Clean up old expired job artifacts on gitlab.com
Summary
We have around 800TB
(80 millions records) of removable artifacts that can't be removed because of inefficient database queries.
Ci::DestroyExpiredJobArtifactsService
is triggered every 7 minutes to remove expired and unlocked artifacts. It looks at 100000 artifacts that have the expire_at
value less than the current timestamp and it selects the ones that are unlocked
. This is not efficient because it can't reach artifacts with an older expire_at
value. It's not possible to add another index for this query because we need to join three large tables to gather the data.
Improvements
As part of #327281 (closed) we:
- Added a new
locked
column toci_job_artifacts
and created a partial index on it that will allow us to query for unlocked and expired artifacts more efficiently (!70235 (merged)) - Started populating the new
locked
column onci_job_artifacts
when new artifacts are created and when pipelines are unlocked (!72406 (merged))
Remaining work
The preferred approach to this cleanup is to add some sort of process that will iterate through old pipelines and artifacts, that would otherwise likely never be touched, and mark the artifacts as unlocked
so they can be deleted by the now-more-performant DestroyAllExpiredService.
Our preferred options here include:
-
Adding a new background worker that will iterate through all old pipelines and marking their ci_job_artifact record
locked
statuses, and we completely remove the worker in a release or two once it's job is done. -
We add some code to our now-more-performant DestroyAllExpired service to iterate through old pipelines and marking their ci_job_artifact record
locked
statuses with the remainder of the 5-minute execution window, letting it time out. This will chip away at thelocked = unknown
records inci_job_artifacts
, allowing those to be properly destroyed by the DestroyAllExpiredService on it's next execution. Additionally, if the artifact being set to unlocked is expired, we can remove it straight away while we have it loaded in memory.
- Adding a partial index on
ci_job_artifacts WHERE locked = (unknown)
will help with this. Initially, the index will be very large because it will have nearly every artifact in the table. However over time, the index will shrink and executions will become relatively faster, because we'll be looking for a smaller proportion of the existing records and not wanting to do sequential scans. When the index is empty, we can remove it at our convenience.
- We write a background migration to iterate through all old pipelines and marking their ci_job_artifact record
locked
statuses. This has database performance risks that we would need to evaluate.
If neither of those things are possible for whatever reason:
- We wrangle an SRE to do it manually, and figure out how to instruct self-hosted customers to do the same. This is not ideal for both our own workflow and that of our self-hosted customers.