Pods stuck on Terminating when Kubernetes (K8s) digester is used

Status update (2023-05-19)

This issue has been added as a candidate to the 16.3 iteration plan.
More in-depth analysis and investigation is pending issue assignment to a developer.

Summary

When using a runner in a Kubernetes cluster, if the k8s digester webhook is installed on the cluster, pods created to run pipeline jobs get stuck on Terminating state without any containers inside them.

This was reported by one of our GitLab Ultimate customers. GitLab team members can find more details in the internal ticket.

The nature of the problem seems somewhat like the fact of having a webhook active disrupts some contextual information that the k8s Go client is holding and the runner fails to cleanup the resources possibly because it can't find them in some way. Without stripping out the client code and recreating it in abstract I can't confirm this at this point because I haven't worked with this client library before but the behaviour seems to fit this pattern.

Additional customer details

We have been running, very successfully so far, Gitlab kubernetes runners in GCP in environments with Binary Authorization for container images.
The Binary Authorization system in GCP means that only 'attested' images can run in the cluster and requires that images are identified to a validating webhook as [:tag]sha256:shavalue rather than simply :.
Whilst this has worked well up to now we wanted a simpler release process for some pipeline images that could use tags so following some investigation we found the k8s-digester project https://github.com/google/k8s-digester.
Digester can be run, and we are using it this way, as a mutating webhook which looks for images with tags in pods, replicasets, deployments, crons and job resources and resolves this to a sha value in the deployment. As mutating webhooks occur prior to validating webhooks this allows the tag to be resolved into a form that the validatingwebhook can handle.
We set this up in our environment running the gitlab runner/executor and the correct images are being selected via tags now and meet our Binary Authorization policy, however, we have found that on termination of the containers in the executor pod the pod itself is always left behind flagged as 'Terminating' with 0/2 containers running.
I've tested a config where the sha values are provided so the webhook is returning an empty patch statement for the pod resource and exactly the same thing happens that the pod is left Terminating with 0/2 containers running (i.e. both the build and helper containers have completed).
Turning the webhook off for the namespace is handled at the mutatingwebhookconfiguration level via a label on the namespace and in that mode the resources aren't routed at all to the webhook. In this mode the pods disappears almost immediately on termination of the container.
I'm currently reading through the kubernetes executor source to try to identify as I feel that there's something that's losing track and not cleaning up. Logging in debug on the main runner reveals nothing different between runs

Steps to reproduce

Install the latest version of the GitLab Runner Helm Chart. This was tested with cluster on GKE.
Deploy the digester webhook to your cluster
Run any pipeline using the newly installed runner and notice that every pod created will get stuck on Terminating with 0 containers inside

Actual behavior

Pods created get stuck on Terminating without any containers inside them.

Expected behavior

Pods should be able to successfully terminate.

Relevant logs and/or screenshots

Environment description

This was tested with GKE (GCP) cluster. The runner was connected to GitLab.com. No other values were defined besides gitlabUrl and runnerRegistrationToken (which are mandatory)

Used GitLab Runner version

The latest version available at the moment ()

Version:      15.3.0
Git revision: bbcb5aba
Git branch:   15-3-stable
GO version:   go1.17.9
Built:        2022-08-19T22:41:11+0000
OS/Arch:      linux/amd64

Possible fixes

It seems to be related to the pod lifecycle somehow. When the digester label is disabled kubectl label namespace default digest-resolution=disabled --overwrite=true, everything works properly again, and pods are successfully terminated.

This also doesn’t happen on version 13.9.0:

I dropped by runner back to version 13.9.0 prior to the k8s client library update and the pod terminated absolutely cleanly rather than hanging in the Terminating state. The issue does seem to be involved with something north of this version. Since the semver versions on k8s/client-go have shot up to v10. (major version!) I suspect a lot of disruptive change between v0.21.1 you are using and this – given 0.21.1 was Jan last year there’s been a lot of change in this area.

Edited May 19, 2023 by Darren Eastman