Authentication failure when updating submodules using fetch strategy with shared build directories
Summary
The change introduced in !3134 (merged) (released in 14.4.0) is causing an authentication failure when jobs execute with a "Reinitialized existing Git repository" and there are submodules present requiring a fetch to checkout the appropriate ref. This only occurs when FF_ENABLE_JOB_CLEANUP
is enabled on the runners as !3134 (merged) was specifically a change to this feature flags behavior.
Steps to reproduce
-
Deploy runner in a configuration where shared storage is used allowing jobs to use a fetch strategy reusing the clone of a repository from a previous job.
-
Run job from project 'A' having submodule 'B' with fetch strategy of normal
-
Push new commit to submodule 'B'
-
Update submodule 'B' in project 'A' to point at new ref present on submodule 'B'
-
Run job from project 'A' observing the job includes "Reinitialized existing Git repository" in the job log, and is running with the same exact build directory as the previous job from step 2. This job will fail with an error similar to the above.
Actual behavior
Job fails with authentication error. Inspecting git config files after step 5 above will reveal an old expired token in the submodule's config. For example:
.git/config
[remote "origin"]
url = https://gitlab-ci-token:CURRENT_CI_JOB_TOKEN@git.example.com/my/project.git
fetch = +refs/heads/*:refs/remotes/origin/*
[submodule "vendor"]
active = true
url = https://gitlab-ci-token:CURRENT_CI_JOB_TOKEN@git.example.com/my/project-vendor-mod.git
.git/config/modules/vendor/config
[remote "origin"]
url = https://gitlab-ci-token:EXPIRED_CI_JOB_TOKEN@git.example.com/my/project-vendor-mod.git
fetch = +refs/heads/*:refs/remotes/origin/*
As a result of removing the .git/config
file during cleanup, there is a slight but important difference in behavior when submodules are present. Submodule urls will be missing from the .git/config
file when it is re-initialized, as these urls are typically added to the .git/config
file when git submodule init
command is run following a fresh clone of a repository.
The runner executes git submodule sync
to ensure that when a URL of a submodule changes (including when the value of CI_JOB_TOKEN
rotates, as it will for every job that is executed) the url in .git/config/modules/vendor/config
(for example) will be updated to match the one found in .git/config
. Since however the url is missing from the .git/config
file when the sync command is run, the url fails to update resulting in the expired and stale token remaining in the .git/config/modules/vendor/config
file.
After the sync command has been run, runners currently call git submodule update --init
to update and/or init submodules. This worked well enough when submodules were either new (and didn’t require a sync) or were being reinitialized with their config already present in .git/config
having already been inited from a previous execution. Since the init is occurring after the sync instead of prior to, the [submodule "vendor"]
section in .git/config
is not present early enough in the job setup script for git submodule sync
to do it's job, ultimately resulting in expired tokens being used in the attempted submodule fetch.
Expected behavior
The job should always complete successfully, cleanup should not result in expired/stale job tokens being used when attempting to fetch submodules during job startup.
Relevant logs and/or screenshots
Fetching changes...
Reinitialized existing Git repository in /builds/<runner>/<concurrent>/my/project/.git/
Created fresh repository.
Checking out e1b05cc9 as <redacted>...
Updating/initializing submodules...
Entering 'vendor'
Entering 'vendor'
HEAD is now at 0d6e7569 <redacted>
Submodule 'vendor' (https://gitlab-ci-token:[MASKED]@git.example.com/my/project-vendor-mod.git) registered for path 'vendor'
remote: HTTP Basic: Access denied
fatal: Authentication failed for 'https://git.example.com/my/project-vendor-mod.git/'
Unable to fetch in submodule path 'vendor'; trying to directly fetch 073a7337c0f3e459ae49906fe429bc4a30c327ed:
remote: HTTP Basic: Access denied
fatal: Authentication failed for 'https://git.example.com/my/project-vendor-mod.git/'
Fetched in submodule path 'vendor', but it did not contain 073a7337c0f3e459ae49906fe429bc4a30c327ed. Direct fetching of that commit failed.
Environment description
We began encountering this after upgrading our runners from 14.3.x to 14.5.0 (we skipped over 14.4.0). The runners are deployed via Helm on a GKE cluster, although the issue should be readily reproduced with any of the executors which support shared build directories (and use of the fetch strategy)
slightly redacted configmap.yaml
apiVersion: v1
kind: ConfigMap
data:
config.template.toml: |
[[runners]]
image = "ubuntu:20.04"
output_limit = 20480
executor = "kubernetes"
builds_dir = "/builds"
environment = [
"FF_ENABLE_JOB_CLEANUP=true",
]
[runners.custom_build_dir]
enabled = false
[runners.kubernetes]
namespace = "gitlab-runner"
privileged = false
allow_privilege_escalation = false
service_account = "gitlab-runner"
pull_policy = "if-not-present"
cpu_limit = "15"
cpu_limit_overwrite_max_allowed = "15"
memory_limit = "16Gi"
memory_limit_overwrite_max_allowed = "16Gi"
cpu_request = "200m"
cpu_request_overwrite_max_allowed = "15"
memory_request = "800Mi"
memory_request_overwrite_max_allowed = "16Gi"
helper_cpu_limit = "2"
helper_memory_limit = "2Gi"
helper_cpu_request = "200m"
helper_memory_request = "100Mi"
service_cpu_limit = "2"
service_memory_limit = "2Gi"
service_cpu_request = "200m"
service_memory_request = "512Mi"
poll_interval = 5
poll_timeout = 600
cleanup_grace_period_seconds = 0
pod_termination_grace_period_seconds = 600
[runners.kubernetes.pod_labels]
"ci_commit_ref_slug" = "$CI_COMMIT_REF_SLUG"
"ci_job_name" = "$CI_JOB_NAME"
"ci_job_stage" = "$CI_JOB_STAGE"
"ci_project_id" = "$CI_PROJECT_ID"
"ci_project_name" = "$CI_PROJECT_NAME"
"ci_project_namespace" = "$CI_PROJECT_NAMESPACE"
[runners.kubernetes.pod_annotations]
"ci_commit_ref_slug" = "$CI_COMMIT_REF_SLUG"
"ci_job_name" = "$CI_JOB_NAME"
"ci_job_stage" = "$CI_JOB_STAGE"
"ci_project_id" = "$CI_PROJECT_ID"
"ci_project_name" = "$CI_PROJECT_NAME"
"ci_project_namespace" = "$CI_PROJECT_NAMESPACE"
"ci_job_url" = "$CI_JOB_URL"
"ci_pipeline_url" = "$CI_PIPELINE_URL"
"ci_project_url" = "$CI_PROJECT_URL"
[runners.kubernetes.node_selector]
"node-pool" = "worker"
[runners.kubernetes.node_tolerations]
"gitlab-runner=true" = "NoSchedule"
[[runners.kubernetes.volumes.host_path]]
name = "repo"
mount_path = "/builds"
host_path = "/mnt/stateful_partition/kube-ephemeral-ssd/gitlab-builds"
[runners.cache]
Type = "gcs"
Path = ""
Shared = true
[runners.cache.gcs]
BucketName = "private-gitlab-cache-bucket"
config.toml: |
concurrent = 100
check_interval = 5
log_level = "info"
listen_address = ':9252'
Used GitLab Runner version
Reproduced on 14.5.0, 14.5.2 and 14.6.0.
Possible fixes
This is the line of code responsible for the problem: !3134 (diffs)
Proposed resolution: !3265 (merged)