Job marked as success when job terminates midway in Kubernetes
Summary
When using the Kubernetes executor, sometimes the job stops/terminates midway of running and reports the Job as "Success" without any warning to the user.
Status as of February 14, 2020:
- This bug has been, and continues to be a top priority for us to resolve. In the 12.5 release cycle, we spent most of the time simply trying to reliably reproduce the bug.
- In 12.6 and 12.7 we continued to work on implementing potential solutions to the problem.
- Work continued in 12.8, but the merge request review for the identified fix did not complete in time for the 12.8 code freeze. So 12.9 is the new target. You can follow along in !1775 (merged)
Steps to reproduce
Environment setup:
-
I'm using a Linux laptop with Debian.
dsniff
package is installed (this provides thetcpkill
command). -
On this laptop the I've installed minikube, using the KVM as virtualization engine.
-
Minikube's VM is attached to a virtual network, connected with my local interface through
virbr2
bridge. -
On the same laptop I have GitLab Runner, with the following configuration:
# ~/.gitlab-runner/config.toml.kubernetes concurrent = 3 check_interval = 5 listen_address = "127.0.0.1:6062" [[runners]] name = "apollo-kubernetes" url = "https://gitlab.com/" token = "[REDACTED}" limit = 3 executor = "kubernetes" [runners.kubernetes] host = "https://192.168.39.230:8443" cert_file = "/home/tmaczukin/.minikube/apiserver.crt" key_file = "/home/tmaczukin/.minikube/apiserver.key" ca_file = "/home/tmaczukin/.minikube/ca.crt" image = "alpine:3.10" namespace = "default" helper_image = "gitlab/gitlab-runner-helper:x86_64-latest" pull_policy = "if-not-present"
Kubernetes host value was taken from the cat ~/.kube/config
file after starting minikube.
Having all of this prepared, I did the following:
- In one terminal window I've started Runner's process:
gitlab-runner run --config ~/.gitlab-runner/config.toml.kubernetes
. - I've opened another terminal window and typed the command
sudo tcpkill -i virbr2 -9 port 8443
but without executing it yet! - I've opened another terminal window and executed:
watch -n 1 kubectl get pods
. - I've started the job from my test project (https://gitlab.com/tmaczukin-test-projects/test-kubernetes-4119/-/jobs/337201695).
- When I've noticed that
$ for i in $(seq 1 60); do echo -n .; sleep 1; done
is being executed, I've executed the preparedtcpkill
command.
After a moment, tcpkill
started reporting TCP streams that it was trying to reset, in job's log I saw the Job succeeded
injected in the middle of loop execution, and in Runner output I've noticed:
Job succeeded duration=17.8968597s job=337201695 project=14948016 runner=d578332a
ERROR: Error cleaning up pod: Delete https://192.168.39.230:8443/api/v1/namespaces/default/pods/runner-d578332a-project-14948016-concurrent-0mdx6t: read tcp 192.168.39.1:34442->192.168.39.230:8443: read: connection reset by peer job=337201695 project=14948016 runner=d578332a
What's important, while the pod deletion reported an error (because tcpkill
was still running and it terminated the communication), in this case K8S accepted the request and removed the pod. The watch -n 1 kubectl get pods
, that was showing the existing job pod during job execution, a moment after the job was finished started showing No resources found
.
For my tests I've used the 12.5.0~beta.1988.g1cc30aa5 (1cc30aa5)
version of Runner, so the one that includes updated Kubernetes client library.
Actual behavior
We are using kubernetes runner to build our application. Also, we are using GCP preemptible machines. Normally, our pipeline contains approximately 12 concurrent build
jobs, one for each build profile. Each job needs approximately 20 minutes to complete.
Sometimes the job is failing because the machine was preempted, which is OK because it is marked as failed
and we can just restart it.
But we also encountered a very strange behavior: The build job stops midway, according to our logs, then the helper
proceeds to upload the artifacts, etc and the job is marked as success
.
We suspect the helper
cannot properly detect the exit code of builder
container under some circumstances. This error does not coincide with GCS preemption events in most of the cases. But we still believe some external event kills the builder
.
Expected behavior
It is expected that the job is marked as failed
if the builder
exits before completion.
Relevant logs and/or screenshots
There is not much in logs, because the error was not detected. The job 116236
normally needs approximately 20 minutes but it stopped after 1
Checking for jobs... received [0;m job [0;m=116140 repo_url [0;m=https://***.****.******/mobile/shield.git runner [0;m=37151372
[0;33mWARNING: Namespace is empty, therefore assuming 'default'. [0;m [0;33mjob [0;m=116140 [0;33mproject [0;m=1 [0;33mrunner [0;m=37151372
.... Other info logs omitted ...
Job succeeded [0;m duration [0;m=15m10.746034166s job [0;m=116140 project [0;m=1 runner [0;m=37151372
Theories on what is going on
Below is a list of what people are experiencing and have somewhat consistent behavior of reproducing the issue, as you can see some of them are very similar:
As you can see all of them are related to the pod being removed and when GitLab Runner sees that it just stops the job but does not handle it as an error for the user. @redbaron1 did some awesome investigation regarding this in #3175 (comment 90286240) & #3175 (comment 90507398) which is where we are communicating to kubernetes about the status/stream STDIN/STDOUT
Possible workarounds that users said they worked.
Below are some possible workarounds that users reported that fixed the issue for them:
Environment description
This is mostly visible for Kubernetes on any kind of environment be is self-hosted or on GKE, though it has been reported for the Docker Runner as well.
Reported environments that it is failing on:
- Google Cloud Kubernetes ver.
1.12.5-gke.10
- Self-hosted
- AKS
Used GitLab Runner version
The version doesn't seem to affect anything, it has been seen failing for the following versions:
- 11.8.0
- 11.6.0
- 11.3.0
- 11.2.0
Duplicate issues
Proposal
As suggested by @redbaron1 in #4119 (comment 173794124) we have quite an outdated kubernetes SDK, which might fix the issues if we upgrade the SDK. Upgrading the SDK is something that always needs to be done so it might be a good first step.
One idea that was floating around when talking to @ayufan
about this issue is that we might be checking if a build failed wrong. Right now we are checking the stream output which might not be the most reliable way to do so since the stream can be cut at any time. We used to have similar problems with the Docker executor and the way we solved this is checking the exit code of the Container instead of the stream. We should check if it's possible to do something similar on the pods, so instead of relying on the stream check the pod has some kind of exit code (we need to investigate if pods have something similar exposed)
Investigation log
** Follow up steps:** #4119 (comment 190968921)