CI/CD → Runners page showing stale runner version during job run after a recent runner version upgrade
Summary
When viewing the CI/CD → Runners page (/admin/runners
) after performing a GitLab Runner version upgrade for registered runners, the Version
value will intermittently change between the actual running version of GitLab Runner, and the previous version of GitLab Runner that was last in use. The previously used GitLab Runner version will populate this value as soon as the runner starts to process a job, and the once the runner is idle again and requesting jobs to run, the version value will correctly show the actual running version again.
Steps to reproduce
Example:
gitlab-runner_15-7-3_to_15-8-0_on_gitlab_15-8-0_example
Example scenario:
- Register a new runner to a GitLab instance. In this example, GitLab Runner
15.7.3
has been installed viaapt
, and freshly registered to a GitLab instance running15.8.0
- If you visit the
/admin/runners
page, you'll see the registered GitLab Runner showing correctly asVersion 15.7.3
. - Upgrade GitLab Runner to a different version. In this example I've upgraded from
15.7.3
to15.8.0
. - After the upgrade completes, you can see the correct version of the runner reflected within the
/admin/runners
page. It appears as15.8.0
per the upgrade that was just performed. - If you now retry an old job (or run a new pipeline and job entirely, the result will be the same) - Once the job is picked up by the runner, the
Version
value on the/admin/runners
page will suddenly show the previous version of the runner that was in use,15.7.3
in this example. - Once the job completes and no other jobs are being processed by the runner, the runner's
Version
value on the/admin/runners
page will reflect the correct version once again -15.8.0
in this example.
What is the current bug behavior?
- The currently used GitLab Runner version is not always correctly reflected in the
/admin/runners
page, at least not shortly after having performed GitLab Runner version upgrades. This can cause confusion for users as it may seem as though something has gone wrong during the GitLab Runner upgrade process, however this isn't the case.
What is the expected correct behavior?
- The currently used GitLab Runner version should consistently be reflected correctly after a GitLab Runner version upgrade is performed. The version displayed within the
/admin/runners
page or any other UI pages describing runner details, should consistently display the actual GitLab Runner version in use, irrespective of if a job is being run or not, and ideally consistently after a GitLab Runner version change is detected.
Additional troubleshooting details
Please note that a new/separate test environment was used in the output below, so the IP addressing and other details may differ slightly from what is shown in the example video, but the same methodology was followed to reproduce the problem.
-
It was confirmed that that the runner token/authentication has not been duplicated across more than one runner. Running
gitlab-runner reset-token --all-runners
was tested just to double check, and the problem is still present after cycling the token. -
To try and rule out the possibility of the runner itself sending stale version data,
tcpdump
was used in the test environment on the instance running GitLab Runner, to check the version values being sent in the JSON payloads over HTTP traffic destined for the GitLab instance. Here the only version value we can see being sent from the runner is correct running version -15.8.0
:root@ip-172-31-19-34:~# tcpdump -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' | grep -w "version" tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes {"info":{"name":"gitlab-runner","version":"15.8.0","revision":"12335144","platform":"linux","architecture":"amd64","executor":"docker","shell":"bash","features":{"variables":true,"image":true,"services":true,"artifacts":true,"cache":true,"shared":false,"upload_multiple_artifacts":true,"upload_raw_artifacts":true,"session":true,"terminal":true,"refspecs":true,"masking":true,"proxy":false,"raw_variables":true,"artifacts_exclude":true,"multi_build_steps":true,"trace_reset":true,"trace_checksum":true,"trace_size":true,"vault_secrets":true,"cancelable":true,"return_exit_code":true,"service_variables":true,"service_multiple_aliases":true},"config":{"gpus":""}},"token":"<REDACTED>","system_id":"s_37bf828021dd"}
-
If we try to verify the runner version via the Rails console live while replicating the scenario as previously described, the version discrepancy is visible in the output, and you can see the version output switch between
15.8.0
and15.7.3
when a job is run, and back to15.8.0
once the runner is idle again.irb(main):013:0> Rails.logger.level = Logger::DEBUG => 0 irb(main):014:0> testmctest = true => true irb(main):015:1* loop do irb(main):016:1* sleep(1) irb(main):017:1* pp Ci::Runner.find_by_token('<REDACTED>').version irb(main):018:1* break if !testmctest irb(main):019:0> end Ci::Runner Load (1.4ms) /*application:console,db_config_name:main,console_hostname:ip-172-31-19-34,console_username:ubuntu*/ SELECT "ci_runners".* FROM "ci_runners" WHERE (token_expires_at IS NULL OR token_expires_at >= NOW()) AND "ci_runners"."token_encrypted" IN ('<REDACTED>', '<REDACTED>') LIMIT 1 "15.8.0" Ci::Runner Load (0.7ms) /*application:console,db_config_name:main,console_hostname:ip-172-31-19-34,console_username:ubuntu*/ SELECT "ci_runners".* FROM "ci_runners" WHERE (token_expires_at IS NULL OR token_expires_at >= NOW()) AND "ci_runners"."token_encrypted" IN ('<REDACTED>', '<REDACTED>') LIMIT 1 "15.8.0" Ci::Runner Load (0.6ms) /*application:console,db_config_name:main,console_hostname:ip-172-31-19-34,console_username:ubuntu*/ SELECT "ci_runners".* FROM "ci_runners" WHERE (token_expires_at IS NULL OR token_expires_at >= NOW()) AND "ci_runners"."token_encrypted" IN ('<REDACTED>', '<REDACTED>') LIMIT 1 "15.7.3" Ci::Runner Load (0.8ms) /*application:console,db_config_name:main,console_hostname:ip-172-31-19-34,console_username:ubuntu*/ SELECT "ci_runners".* FROM "ci_runners" WHERE (token_expires_at IS NULL OR token_expires_at >= NOW()) AND "ci_runners"."token_encrypted" IN ('<REDACTED>', '<REDACTED>') LIMIT 1 "15.7.3" Ci::Runner Load (0.6ms) /*application:console,db_config_name:main,console_hostname:ip-172-31-19-34,console_username:ubuntu*/ SELECT "ci_runners".* FROM "ci_runners" WHERE (token_expires_at IS NULL OR token_expires_at >= NOW()) AND "ci_runners"."token_encrypted" IN ('<REDACTED>', '<REDACTED>') LIMIT 1 "15.8.0"
-
When looking into the
ci_runners
table further, we can still see the old runner version15.7.3
being stored:irb(main):023:0> pp Ci::Runner.all Ci::Runner Load (0.9ms) /*application:console,db_config_name:main,console_hostname:ip-172-31-19-34,console_username:ubuntu*/ SELECT "ci_runners".* FROM "ci_runners" [#<Ci::Runner:0x00007f6297586b90 id: 2, token: nil, created_at: Wed, 25 Jan 2023 07:25:06.362582000 UTC +00:00, updated_at: Wed, 25 Jan 2023 07:25:06.362582000 UTC +00:00, description: "[FILTERED]", contacted_at: Wed, 25 Jan 2023 07:25:15.067399000 UTC +00:00, active: true, name: "gitlab-runner", version: "15.7.3", revision: "914aa415", platform: "linux", architecture: "amd64", run_untagged: true, locked: true, access_level: "not_protected", ip_address: "172.31.19.34", maximum_timeout: nil, runner_type: "instance_type", token_encrypted: "<REDACTED>", public_projects_minutes_cost_factor: 0.0, private_projects_minutes_cost_factor: 1.0, config: {}, executor_type: "docker", maintainer_note: nil, token_expires_at: nil, allowed_plans: [], registration_type: 0, creator_id: nil, tag_list: nil>]
gitlabhq_production=> SELECT * FROM ci_runners; -[ RECORD 1 ]------------------------+------------------------------------------------- id | 2 token | created_at | 2023-01-25 07:25:06.362582 updated_at | 2023-01-25 07:25:06.362582 description | ip-172-31-19-34 contacted_at | 2023-01-25 07:25:15.067399 active | t name | gitlab-runner version | 15.7.3 revision | 914aa415 platform | linux architecture | amd64 run_untagged | t locked | t access_level | 0 ip_address | 172.31.19.34 maximum_timeout | runner_type | 1 token_encrypted | <REDACTED> public_projects_minutes_cost_factor | 0 private_projects_minutes_cost_factor | 1 config | {} executor_type | 3 maintainer_note | token_expires_at | allowed_plans | {} registration_type | 0 creator_id |
-
Is this old
version
value inside PostgreSQL used to update theversion
value shown on the/admin/runners
page when a job is picked up by a runner? Ideally from a user experience perspective, the version shown when a runner makes contact to the GitLab instance to poll for jobs, and when a runner takes a job should always display consistently. -
There is some suspicion at the moment that the discrepancy is due to an difference between what is stored in the PostgreSQL database vs Redis, although clearing the Redis cache via the rake task does not seem to resolve the problem. The following items are of interest:
- https://gitlab.com/gitlab-org/gitlab/-/blob/16c8fc87c72137b9236a33029d6e4a8d1c22489c/app/models/ci/runner.rb#L41-42
- https://gitlab.com/gitlab-org/gitlab/-/blob/16c8fc87c72137b9236a33029d6e4a8d1c22489c/app/models/concerns/redis_cacheable.rb#L10-21
- https://gitlab.com/gitlab-org/gitlab/-/blob/16c8fc87c72137b9236a33029d6e4a8d1c22489c/app/models/ci/runner.rb#L192
-