Race condition on deletion of `gitlab_runner_env` file if all jobs share the same `GIT_CLONE_PATH`
Summary
In parallel jobs that share a GIT_CLONE_PATH
, random jobs may occasionally fail with these errors:
Getting source from Git repository
/bin/bash: line 186: <path>/gitlab_runner_env: No such file or directory
or
Running on <runner>...
rm: can't remove '<path>/gitlab_runner_env': No such file or directory
This would appear to be because in GitLab 17.7.0, we started deleting the gitlab_runner_env
file at the start/end of jobs as part of this MR.
So a race condition can occur, where job1
deletes the gitlab_runner_env
, and job2
attempts to read/delete the file but will fail because job1
has already deleted it.
Context
The customer has shared this workflow with us as to why they use a persistent, shared GIT_CLONE_PATH on an NFS between all jobs:
We use a shared NFS path for the entire pipeline for a combination of reasons:
- clones for this repo, even shallow ones, are 2+GB
- the build stage of this pipeline generates another 2+GB of output that needs to be used by downstream jobs
- the tests for this pipeline generate another 10+GB of output that we often need to inspect after jobs complete, especially if they fail
They also shared:
- We set all jobs to
GIT_STRATEGY: none
except our initial bootstrap job which setsGIT_STRATEGY: fetch
Actual behavior
Race condition where some jobs will fail because the gitlab_runner_env
file could not be read/deleted.
Expected behavior
Parallel jobs running should not fail on reading/deleting gitlab_runner_env
.
Used GitLab Runner version
GitLab 17.7.0.
If they revert back to GitLab 17.6.0, this behaviour is no longer observed.
Possible fixes
- The recursive delete be updated to run
rm
with the-f
flag: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/v17.7.0/shells/bash.go#L271-280 - Only read the file if the
gitlab_runner_env
file exists: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/v17.7.0/shells/bash.go#L211