Kubernetes executor stops receiving trace updates and job eventually times out
Zendesk: https://gitlab.zendesk.com/agent/tickets/93568 (internal)
A customer reported that with the Kubernetes executor their jobs would never finish and would time out. The traces showed that output would simply stop at some point. Naturally, the first assumption is that the command is hanging and never finishing (not a GitLab/Runner bug). However the customer was noting that the output of the job was actually showing up on the container indicating the commands were finishing.
To prove this, we added | tee output.txt
to the command and reran the pipeline. We saw that at a random point GitLab stopped seeing the updates to the trace, the runner logs showed "Appending trace to coordinator" with the same position indicating it wasn't seeing new output, but the tee log file was actually showing much more output. The job finished successfully and the built artifact was on the container's filesystem.
Any ideas what would cause the Kubernetes runner to be unable to retrieve the trace any longer? I imagine it could be either a case of a communication problem between the runner and the job container or a bug elsewhere.