Draft: Resolve "Clean up old expired job artifacts"
What does this MR do?
Changes Ci::JobArtifacts::DestroyAllExpiredService
to add a loop_until
strategy using a query that only finds unlocked and expired job artifacts, and puts this new strategy behind a feature flag :ci_destroy_unlocked_job_artifacts
, disabled by default.
The existing strategy uses each_batch
and then loads the unlocked
on each batch as it iterates. This turns out to be quite expensive (minutes, so may never actually find any unlocked and expired rows in the timeout period) vs. using a query that includes locked = 0
(milliseconds) on the same batch size.
Based on the worst-case EXPLAIN
in database-lab, using this strategy would allow us to get through at least 9.9 million rows per day, or through the 80 million records referenced in !322817 in about 9 days.
Database changes
SELECT
"ci_job_artifacts".*
FROM
"ci_job_artifacts"
INNER JOIN "ci_builds" ON "ci_builds"."id" = "ci_job_artifacts"."job_id"
AND "ci_builds"."type" = 'Ci::Build'
INNER JOIN "ci_pipelines" ON "ci_pipelines"."id" = "ci_builds"."commit_id"
WHERE
"ci_job_artifacts"."expire_at" < '2021-08-10 21:24:00.921962'
AND "ci_pipelines"."locked" = 0
LIMIT 100;
Details: https://console.postgres.ai/shared/3db40326-089c-4c25-b0be-cf0385ff8b26
Statistics:
Time: 683.472 ms
- planning: 26.045 ms
- execution: 657.427 ms
- I/O read: 651.964 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 568 (~4.40 MiB) from the buffer pool
- reads: 506 (~4.00 MiB) from the OS file cache, including disk I/O
- dirtied: 2 (~16.00 KiB)
- writes: 0
Plan:
Limit (cost=1.15..894.21 rows=100 width=146) (actual time=16.362..657.281 rows=100 loops=1)
Buffers: shared hit=568 read=506 dirtied=2
I/O Timings: read=651.964 write=0.000
-> Nested Loop (cost=1.15..2309063341.41 rows=258556439 width=146) (actual time=16.360..657.206 rows=100 loops=1)
Buffers: shared hit=568 read=506 dirtied=2
I/O Timings: read=651.964 write=0.000
-> Nested Loop (cost=0.58..1241278628.44 rows=348769047 width=150) (actual time=10.285..327.612 rows=107 loops=1)
Buffers: shared hit=273 read=266
I/O Timings: read=324.800 write=0.000
-> Seq Scan on public.ci_job_artifacts (cost=0.00..120111154.40 rows=363055262 width=146) (actual time=4.191..6.150 rows=107 loops=1)
Filter: (ci_job_artifacts.expire_at < '2021-08-10 21:24:00.921962+00'::timestamp with time zone)
Rows Removed by Filter: 62
Buffers: shared read=4
I/O Timings: read=5.905 write=0.000
-> Index Scan using ci_builds_pkey on public.ci_builds (cost=0.58..3.09 rows=1 width=8) (actual time=3.001..3.001 rows=1 loops=107)
Index Cond: (ci_builds.id = ci_job_artifacts.job_id)
Filter: ((ci_builds.type)::text = 'Ci::Build'::text)
Rows Removed by Filter: 0
Buffers: shared hit=273 read=262
I/O Timings: read=318.896 write=0.000
-> Index Scan using ci_pipelines_pkey on public.ci_pipelines (cost=0.57..3.06 rows=1 width=4) (actual time=3.077..3.077 rows=1 loops=107)
Index Cond: (ci_pipelines.id = ci_builds.commit_id)
Filter: (ci_pipelines.locked = 0)
Rows Removed by Filter: 0
Buffers: shared hit=295 read=240 dirtied=2
I/O Timings: read=327.164 write=0.000
Testing on staging, the picture is even better where queries look like this for each batch of 100:
Ci::JobArtifact Load (3.6ms)
Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Related to #322817 (closed)