Optimize CI::JobArtifact querying from ExpireBuildArtifactsWorker
What does this MR do and why?
This MR introduces three optimizations in, and extra feature flags for more granular control of, the ExpireBuildArtifactsWorker
. In general, I recommend reading this MR commit by commit.
The first two commits introduce new feature flags that will allow us to re-enable the worker safely. At first, all the flags will be off.
-
ci_job_artifacts_backlog_work
: a flag placed specifically for turning on and off the section of code that queries expired but not-yet-removed artifacts. -
ci_job_artifacts_large_loop_limit
: a flag allowing us to start at a maximum of 10,000 (10% of prior limit) expirations per worker execution, and scale up to a 50,000 maximum (50% of prior limit) if we find the performance to be acceptable.
For the query optimizations, see the long commit messages on the first and second commits that introduce them. The descriptions there are good, and the diffs will be far more comprehensible.
By introducing these flags, we can re-enable ci_destroy_unlocked_job_artifacts
and stop the accumulation of new job artifacts that cannot be expired via the older, less performant three-table-join process. Separating this improvement from the backlog-focused changes is a useful improvement over our first attempt.
How to set up and validate locally
The work being done here cannot be fully validated locally. I rebuilt the queries and the application logic that sends them with the advice and consent of @msmiley who was very helpful in explaining what was bad before and how to avoid creating similar inefficiency.
Once we've restarted the ExpireBuildArtifactsWorker
and are expiring newly created artifacts efficiently going forward (we were doing this before the incident), we'll start working on the backlog again. We have two feature flags that we'll use in tandem to roll out this change in production:
-
ci_job_artifacts_backlog_work
Enabled the code that queries the artifacts backlog, i.e. thoseci_job_artifact
records whereci_job_artifacts.locked = unknown
. -
ci_job_artifacts_large_loop_limit
Changes the loop limit that caps the number of records our worker will touch in a single execution from 10,000 to 50,000.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR. -
Maintainer: Please coordinate synchronously with @drew about merging this MR. It should be a carefully monitored deployment.
Original commit log:
Add order and limit to JobArtifact query
Add reusable query to DestroyAllExpiredService
Reuse the same query to ensure that we apply ORDER and LIMIT
Before:
UPDATE "ci_job_artifacts" SET "locked" = 2
WHERE "ci_job_artifacts"."job_id" IN (1, 2, 3)
After:
UPDATE "ci_job_artifacts" SET "locked" = 2
WHERE "ci_job_artifacts"."id"
IN (
SELECT "ci_job_artifacts"."id" FROM "ci_job_artifacts"
WHERE "ci_job_artifacts"."job_id"
IN (1, 2, 3)
LIMIT 100
)
Granular logging in ExpireBuildArtifactsWorker
Destroy unlocked job artifacts without updating
By updating rows in ci_job_artifacts to then be deleted on the next
loop cycle, we create twice as many dead tuples as necessary in the
table. Artifacts locked by ci_pipelines.locked will be updated and
left in place, but the others will get passed directly into the
DestroyBatchService to be removed. The dead tuple created by the
update is not necessary.
Optimize Ci::JobArtifact query
First, we need to remove the DISTINCT clause from this query. Using
it tried to distinct the whole table before applying the limit, which
is bad. Also, using DISTINCT implicitly queried for the lowest job_id
values first, which created an enormous pile of dead tuple at the end
of the index that we were explicitly starting our query from. This is
catastrophic for index performance on replicas, but not on the
primary, which can use LP_DEAD hint to skip dead index tuples. On
replicas, performance remains degraded until we VACUUM.
Second, we apply an explicit order by expiration to prevent us from
accidentally querying rows in an order that correlates with job_id.
By ensuring that we query rows that are generally well-distributed
across the table, we prevent large blocks of dead index tuples that
need to be scanned before we can come up with 100 live tuples to
return.