When gitlab-runner is stopped/rescheduled/crashes, child jobs hang indefinitely
Summary
When using the Kubernetes executor:
If the controller pod is stopped (deleted, crashes, rescheduled to a different machine by Kubernetes), any jobs it was currently running will hang reading from stdin
, indefinitely.
Steps to reproduce
These steps assume a running Kubernetes cluster, GitLab CI runner, and a project on which you can start pipeline runs.
The cluster
Notice the the controller pod has been running for a while and it has just started a new job:
$ kubectl get -n gitlab-ci pods
NAME READY STATUS RESTARTS AGE
gitlab-runner-d799dd6d4-wkrbl 1/1 Running 0 4d20h
runner-mxszx5s-project-5674-concurrent-0mzhr5 6/6 Running 0 12s
Delete the controller pod
$ kubectl delete -n gitlab-ci pods gitlab-runner-d799dd6d4-wkrbl
pod "gitlab-runner-d799dd6d4-wkrbl" deleted
Check the state of the job
Notice that the job still exists after deleting the old controller pod. Also notice that the Kubernetes replica set / deployment has recreated the controller pod.
$ kubectl get -n gitlab-ci pods
NAME READY STATUS RESTARTS AGE
gitlab-runner-d799dd6d4-7984x 1/1 Running 0 36s
runner-mxszx5s-project-5674-concurrent-0mzhr5 6/6 Running 0 56s
Next, we attach to the runner pod and try to find out what it's doing.
$ kubectl exec -ti -n gitlab-ci runner-mxszx5s-project-5674-concurrent-0mzhr5 /bin/bash
bash-4.2# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 11700 2580 ? Ss 14:28 0:00 /usr/bin/bash
root 507 0.0 0.0 11836 3036 pts/0 Ss 14:30 0:00 /bin/bash
root 624 0.0 0.0 51760 3604 pts/0 R+ 14:38 0:00 ps aux
bash-4.2# strace -p 1
strace: Process 1 attached
read(0,
You can see here, PID 1 of the pod is attempting to read from stdin
(file descriptor 0
) and will never complete.
Actual behavior
The job keeps running, attempting to read from stdin
until manually killed.
Expected behavior
The job pod should be deleted whenever the controller pod is deleted. This can be accomplished by using the ownerReferences
feature of Kubernetes when the controller creates new pods.
Example of where this could be set: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/59abc3d324882816618e7372dd85f6adb4d6c6b3/executors/kubernetes/kubernetes.go#L872
Environment description
This is a custom installation using the Kubernetes executor.
config.toml contents
concurrent = 20
check_interval = 30
log_level = "info"
listen_address = '[::]:9252'
Used GitLab Runner version
bash-4.4$ gitlab-runner --version
Version: 12.2.0
Git revision: a987417a
Git branch: 12-2-stable
GO version: go1.8.7
Built: 2019-08-22T13:06:00+0000
OS/Arch: linux/amd64
Possible fixes
This could be fixed by using the ownerReferences
feature of Kubernetes when the controller creates new pods.
Example of where this could be set: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/59abc3d324882816618e7372dd85f6adb4d6c6b3/executors/kubernetes/kubernetes.go#L872
https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/