Implement worker to remove stale runners from GitLab SaaS
Overview
We need to create a background job that will run regularly (CronjobQueue
) to prune stale runners for a given namespace, if that namespace opted into pruning. Opting in to pruning is done by setting a new allow_stale_runner_pruning
field to true in namespace_settings
. The background job will enumerate namespaces that have this flag turned on and delete any runners that haven't contacted GitLab in the last 3 months (i.e. stale).
Problem
As part of https://gitlab.com/gitlab-org/gitlab/-/issues/321368#note_689886009 I've identified that from most namespaces containing more than 1,000 runners, only <1% of the runners have contacted GitLab.com recently. There is for example a project with 250K+ runners and a namespace with 350K+ runners. This causes unnecessary load on the database and makes it unnecessarily harder to estimate the performance of a given query.
We've recently enabled the ci_runner_limits
FF which aims to keep a ceiling of 1000 runners per namespace/project. Still, the user can have 1000 runners registered but only be using 10 of them, so we should have a way of identifying this situation and automatically prune them after a certain time.
Implementation tasks for GitLab.com
-
Implement database migration to add allow_stale_runner_pruning
field to true innamespace_settings
. -
Implement background job to enumerate namespaces that have this flag turned on and delete any runners that haven't contacted GitLab in the last 3 months (i.e. stale). -
Implement GraphQL mutation to enable allow_stale_runner_pruning
for a namespace.
One thing that is not yet clear is how we'd want to communicate this in advance to customers so it doesn't come as a surprise that runners are being pruned. Since the runner limits are only taking into consideration runners that have been registered over 3 months ago, we could start with a window that is bigger than 3 months (e.g. 6 months), so users would have 3 months after release to adapt to the new pruning routine, by e.g. changing the configuration for the expiration delay.
Follow up work
-
Add a notice to Runner registration pages in GitLab when automatic expiration is enabled, with a message explaining that any runner that has not contacted the instance for the configured expiration period will be automatically unregistered; -
Create an audit log of deleted runners by namespace.