Use spare loop cycles in DestroyAllExpiredService (!74132) · Merge requests · GitLab.org / GitLab

drew stachon requested to merge set-artifacts-locked-with-spare-time into master Nov 09, 2021

What does this MR do and why?

This MR uses the time and cycles left in loop_until to set the locked value of Ci::JobArtifact records in the Ci::JobArtifact::DestroyAllExpiredService

Since we started using the new locked column on the job_artifacts table, the Ci::JobArtifact::DestroyAllExpiredService has headroom to do extra work. This solved our first problem of the growing backlog of expired but unremoved artifacts. The DestroyAllExpiredService defines operational limits of 100,000 records or 300 seconds. Right now, we are not particularly close to either of those limits.

Since the new code we're adding is solely for addressing the no-longer growing backlog of un-removable expired artifacts, and will be removed in another couple milestones, adding a new worker is somewhat onerous. Instead, we're adding some code that, given extra time, takes a pass at Ci::JobArtifact.artifacts_unknown records in batches of 100 and updates their locked status based on the related Ci::Pipeline locked status. If it successfully updates artifact records, the loop will continue and remove those freshly unlocked artifact records.

By doing this, we'll chip away at our existing backlog of at least 800TB of expired artifacts that we have not been able to remove in the past.

When the Ci::JobArtifact::DestroyAllExpiredService, it will stop executing the loop, an exit the worker.

The only queries we run on every iteration of the loop, in order to figure out if we have work to do, are both cheap. Although broken out into scopes in the code, here are the queries written out using AR to be slightly easier to read:

Ci::JobArtifact.expired_before(@start_at).artifact_unlocked.limit(BATCH_SIZE) # BATCH_SIZE is 100

Ci::JobArtifact.expired_before(@start_at).artifact_unknown.limit(BATCH_SIZE).distinct.pluck(:job_id) # BATCH_SIZE is 100

Observability

When this code starts running, we should see the rows in this Kibana search have at least one value be the operational limit of either 100,000 records or 300 seconds.

When the backlog of locked = unknown artifacts has been finished, the service will exit sooner and we'll see headroom against the limits on both columns again.

We should see a decrease in this Sisense chart of total artifact size (NEEDS LINK)

We should see a decrease in our AWS bill. (NEEDS LINK)

Current concerns

We're only querying for expired artifacts. If there are existing artifacts configured to expire in the distant future, with their status unknown, we'll never update that status.
- One of the implications of this is that in order to remove this code, we will need to ship a background migration that cleans up anything left over, and it would run into those future expiring and unknown artifacts.

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Nov 10, 2021 by drew stachon

Use spare loop cycles in DestroyAllExpiredService