Allow pruning of stale group runners
What does this MR do and why?
Describe in detail what your merge request does and why.
This MR is a follow-up to Add namespace_ci_cd_settings table (!86473 - merged) and implements a background cron worker that enables deleting stale group runners (that is, CI runners that haven't communicated with the GitLab instance in the last 3 months). The idea is for a follow-up MR to implement a GraphQL mutation that sets the flag to opt into this behavior (namespace_ci_cd_settings.allow_stale_runner_pruning
).
NOTES
- The commits in this MR are individually reviewable;
- I don't have much experience developing Sidekiq jobs, so I'd appreciate additional attention to aspects there that I may have missed.
- The
ci_cd_settings
association will not exist in most cases, as this is a new table. I don't have much experience with this scenario in Rails, so looking forward to suggestions on how to best approach it.
Screenshots or screen recordings
These are strongly recommended to assist reviewers and reduce the time to merge your change.
Step | screenshot |
---|---|
1. Start with some stale runners | |
2. Enqueue worker in http://gdk.test:3000/admin/background_jobs | |
3. Check that stale runners are no longer there | |
4. Logs |
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
These are manual steps (not using the Sidekiq dashboard):
-
Ensure you have
gitlab-runner
installed in your machine. -
Register 200 runners against a group (e.g. gitlab-org, get registration token from http://gdk.test:3000/groups/gitlab-org/-/runners), in this example we use hyperfine to help repeat the command:
$ brew install hyperfine $ hyperfine --min-runs 200 'gitlab-runner register -config /tmp/config.gdk.toml \ --executor "shell" \ --url "http://gdk.test:3000/" \ --description "Group test runner" \ --tag-list "shell,mac,gdk,test" \ --run-untagged="false" \ --locked="false" \ --access-level="not_protected" --non-interactive \ --registration-token="${GROUP_REGISTRATION_TOKEN}"'
-
Change the
created_at
field for the last 100 runners in the GDK console, so that they are considered stale:> group = ::Group.find(21) > group.runners.limit(100).update_all(created_at: 4.months.ago) > group.runners.stale.count => 100
-
The group Runners page should now list half
never contacted
runners and halfstale
runners: -
Start the worker from the GDK console:
> Ci::Runners::StaleGroupRunnersPruneCronWorker.new.perform => {:total_pruned=>100, :status=>:success}
As expected,
total_pruned
returned 100 which was the count of stale runners.
Database query plans
The findings in Draft: Test deleting stale CI runners (!74503 - closed) and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5910 are relevant here, as the service logic to purge stale runners is very similar.
Check if any groups exist
SELECT 1 AS one
FROM "namespace_ci_cd_settings"
WHERE "namespace_ci_cd_settings"."allow_stale_runner_pruning" = TRUE
LIMIT 1
Limit (cost=0.00..0.06 rows=1 width=4) (actual time=0.014..0.014 rows=0 loops=1)
I/O Timings: read=0.000 write=0.000
-> Seq Scan on public.namespace_ci_cd_settings (cost=0.00..62.00 rows=1100 width=4) (actual time=0.005..0.006 rows=0 loops=1)
Filter: namespace_ci_cd_settings.allow_stale_runner_pruning
Rows Removed by Filter: 0
I/O Timings: read=0.000 write=0.000
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35427
each_batch window start
SELECT "namespace_ci_cd_settings"."namespace_id"
FROM "namespace_ci_cd_settings"
WHERE "namespace_ci_cd_settings"."allow_stale_runner_pruning" = TRUE
ORDER BY "namespace_ci_cd_settings"."namespace_id" ASC
LIMIT 1
Limit (cost=0.12..0.15 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1)
Buffers: shared hit=1
I/O Timings: read=0.000 write=0.000
-> Index Only Scan using index_cicd_settings_on_namespace_id_where_stale_pruning_enabled on public.namespace_ci_cd_settings (cost=0.12..27.63 rows=1100 width=8) (actual time=0.002..0.003 rows=0 loops=1)
Heap Fetches: 0
Buffers: shared hit=1
I/O Timings: read=0.000 write=0.000
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35428
each_batch window end
SELECT "namespace_ci_cd_settings"."namespace_id"
FROM "namespace_ci_cd_settings"
WHERE "namespace_ci_cd_settings"."allow_stale_runner_pruning" = TRUE
AND "namespace_ci_cd_settings"."namespace_id" >= 1
ORDER BY "namespace_ci_cd_settings"."namespace_id" ASC
LIMIT 1 OFFSET 1000
Limit (cost=23.51..23.57 rows=1 width=8) (actual time=0.036..0.036 rows=0 loops=1)
Buffers: shared hit=4
I/O Timings: read=0.000 write=0.000
-> Index Only Scan using index_cicd_settings_on_namespace_id_where_stale_pruning_enabled on public.namespace_ci_cd_settings (cost=0.14..23.51 rows=367 width=8) (actual time=0.034..0.034 rows=0 loops=1)
Index Cond: (namespace_ci_cd_settings.namespace_id >= 21)
Heap Fetches: 0
Buffers: shared hit=4
I/O Timings: read=0.000 write=0.000
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35431
Delete runners from window group's
DELETE FROM "ci_runners"
WHERE "ci_runners"."id" IN (
SELECT "ci_runners"."id"
FROM "ci_runners"
INNER JOIN "ci_runner_namespaces" ON "ci_runner_namespaces"."runner_id" = "ci_runners"."id"
WHERE "ci_runner_namespaces"."namespace_id" IN (<1000 ids>)
AND (ci_runners.created_at < '2022-02-09 16:16:31.457512'
AND (ci_runners.contacted_at IS NULL
OR ci_runners.contacted_at < '2022-02-09 16:16:31.457512'))
LIMIT 5000)
ModifyTable on public.ci_runners (cost=30531.48..47241.86 rows=5000 width=34) (actual time=8.374..8.377 rows=0 loops=1)
Buffers: shared hit=3005 read=6 dirtied=4
I/O Timings: read=7.003 write=0.000
-> Nested Loop (cost=30531.48..47241.86 rows=5000 width=34) (actual time=8.361..8.363 rows=0 loops=1)
Buffers: shared hit=3005 read=6 dirtied=4
I/O Timings: read=7.003 write=0.000
-> HashAggregate (cost=30531.05..30581.05 rows=5000 width=32) (actual time=8.361..8.362 rows=0 loops=1)
Group Key: "ANY_subquery".id
Buffers: shared hit=3005 read=6 dirtied=4
I/O Timings: read=7.003 write=0.000
-> Subquery Scan on ANY_subquery (cost=0.86..30518.55 rows=5000 width=32) (actual time=8.331..8.332 rows=0 loops=1)
Buffers: shared hit=3005 read=6 dirtied=4
I/O Timings: read=7.003 write=0.000
-> Limit (cost=0.86..30468.55 rows=5000 width=4) (actual time=8.330..8.331 rows=0 loops=1)
Buffers: shared hit=3005 read=6 dirtied=4
I/O Timings: read=7.003 write=0.000
-> Nested Loop (cost=0.86..39237.15 rows=6439 width=4) (actual time=8.328..8.329 rows=0 loops=1)
Buffers: shared hit=3005 read=6 dirtied=4
I/O Timings: read=7.003 write=0.000
-> Index Scan using index_ci_runner_namespaces_on_namespace_id on public.ci_runner_namespaces (cost=0.43..9472.16 rows=9137 width=4) (actual time=8.327..8.327 rows=0 loops=1)
Index Cond: (ci_runner_namespaces.namespace_id = ANY ('{<1000 ids>}'::integer[]))
Buffers: shared hit=3005 read=6 dirtied=4
I/O Timings: read=7.003 write=0.000
-> Index Scan using ci_runners_pkey on public.ci_runners ci_runners_1 (cost=0.43..3.26 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
Index Cond: (ci_runners_1.id = ci_runner_namespaces.runner_id)
Filter: ((ci_runners_1.created_at < '2022-02-09 16:16:31.457512'::timestamp without time zone) AND ((ci_runners_1.contacted_at IS NULL) OR (ci_runners_1.contacted_at < '2022-02-09 16:16:31.457512'::timestamp without time zone)))
Rows Removed by Filter: 0
I/O Timings: read=0.000 write=0.000
-> Index Scan using ci_runners_pkey on public.ci_runners (cost=0.43..3.34 rows=1 width=10) (actual time=0.000..0.000 rows=0 loops=0)
Index Cond: (ci_runners.id = "ANY_subquery".id)
I/O Timings: read=0.000 write=0.000```
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10000/commands/35435
</details>
## MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
* [x] I have evaluated the [MR acceptance checklist](https://docs.gitlab.com/ee/development/code_review.html#acceptance-checklist) for this MR.
## Links
- https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/6094+
- https://gitlab.com/gitlab-org/charts/gitlab/-/merge_requests/2565+
Part of #361112