Implement worker to prune stale group runners
What does this MR do and why?
Describe in detail what your merge request does and why.
This MR implements a background service that enables deleting stale group runners (that is, CI runners that haven't communicated with the GitLab instance in the last 3 months). The idea is for a follow-up MR to implement a GraphQL mutation that calls this.
NOTE 1: This MR was modeled around the existing WebHooks::DestroyService service.
NOTE 2: I don't have much experience developing Sidekiq jobs, so I'd appreciate additional attention to aspects there that I may have missed.
Screenshots or screen recordings
These are strongly recommended to assist reviewers and reduce the time to merge your change.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
-
Register 200 runners against a group (e.g. gitlab-org, get registration token from http://gdk.test:3000/groups/gitlab-org/-/runners):
hyperfine --min-runs 200 'gitlab-runner register -config /tmp/config.gdk.toml \ --executor "shell" \ --url "http://gdk.test:3000/" \ --description "Group test runner" \ --tag-list "shell,mac,gdk,test" \ --run-untagged="false" \ --locked="false" \ --access-level="not_protected" --non-interactive \ --registration-token="${GROUP_REGISTRATION_TOKEN}"'
-
Change the
created_at
field for the last 100 runners in the GDK console, so that they are considered stale:> group = ::Group.find(21) > group.runners.limit(100).update_all(created_at: 4.months.ago) > group.runners.stale.count => 100
-
The group Runners page should now list half
never contacted
runners and halfstale
runners: -
Start the worker from the GDK console:
> Ci::Runners::StaleGroupRunnersPruneWorker.new.perform(User.first, group) => {:async=>false, :total_pruned=>100, :status=>:success}
As expected,
total_pruned
returned 100 which was the count of stale runners, and 100 being smaller thanBATCH_SIZE
, the work was done synchronously without going through Sidekiq. If we change another 50 runners to become stale, and artificially changeCi::Runners::StaleGroupRunnersPruneService::BATCH_SIZE
to 10, then we should see 5 batches being executed in Sidekiq.> group.runners.limit(50).update_all(created_at: 4.months.ago) > group.runners.stale.count => 50 > Ci::Runners::StaleGroupRunnersPruneWorker.new.perform(User.first, group) => {:async=>false, :total_pruned=>50, :status=>:success}
Database queries
The purging job in this MR closely follows the script created for https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5910, and tested in !74503 (closed). I'm happy to add more details or clarify things if needed.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Part of #19865 #342605 (closed) Closes Implement worker to remove stale runners from G... (#361112 - closed)