Clean up old expired job artifacts on gitlab.com

Summary

We have around 800TB(80 millions records) of removable artifacts that can't be removed because of inefficient database queries.

Ci::DestroyExpiredJobArtifactsService is triggered every 7 minutes to remove expired and unlocked artifacts. It looks at 100000 artifacts that have the expire_at value less than the current timestamp and it selects the ones that are unlocked. This is not efficient because it can't reach artifacts with an older expire_at value. It's not possible to add another index for this query because we need to join three large tables to gather the data.

Improvements

As part of #327281 (closed) we:

Added a new locked column to ci_job_artifacts and created a partial index on it that will allow us to query for unlocked and expired artifacts more efficiently (!70235 (merged))
Started populating the new locked column on ci_job_artifacts when new artifacts are created and when pipelines are unlocked (!72406 (merged))

Remaining work

The preferred approach to this cleanup is to add some sort of process that will iterate through old pipelines and artifacts, that would otherwise likely never be touched, and mark the artifacts as unlocked so they can be deleted by the now-more-performant DestroyAllExpiredService.

Our preferred options here include:

Adding a new background worker that will iterate through all old pipelines and marking their ci_job_artifact record locked statuses, and we completely remove the worker in a release or two once it's job is done.
We add some code to our now-more-performant DestroyAllExpired service to iterate through old pipelines and marking their ci_job_artifact record locked statuses with the remainder of the 5-minute execution window, letting it time out. This will chip away at the locked = unknown records in ci_job_artifacts, allowing those to be properly destroyed by the DestroyAllExpiredService on it's next execution. Additionally, if the artifact being set to unlocked is expired, we can remove it straight away while we have it loaded in memory.

Adding a partial index on ci_job_artifacts WHERE locked = (unknown) will help with this. Initially, the index will be very large because it will have nearly every artifact in the table. However over time, the index will shrink and executions will become relatively faster, because we'll be looking for a smaller proportion of the existing records and not wanting to do sequential scans. When the index is empty, we can remove it at our convenience.

We write a background migration to iterate through all old pipelines and marking their ci_job_artifact record locked statuses. This has database performance risks that we would need to evaluate.

If neither of those things are possible for whatever reason:

We wrangle an SRE to do it manually, and figure out how to instruct self-hosted customers to do the same. This is not ideal for both our own workflow and that of our self-hosted customers.

Edited Feb 17, 2022 by Cheryl Li