Draft: Resolve "Clean up old expired job artifacts" (!67908) · Merge requests · GitLab.org / GitLab

Changes Ci::JobArtifacts::DestroyAllExpiredService to add a loop_until strategy using a query that only finds unlocked and expired job artifacts, and puts this new strategy behind a feature flag :ci_destroy_unlocked_job_artifacts, disabled by default.

The existing strategy uses each_batch and then loads the unlocked on each batch as it iterates. This turns out to be quite expensive (minutes, so may never actually find any unlocked and expired rows in the timeout period) vs. using a query that includes locked = 0 (milliseconds) on the same batch size.

Based on the worst-case EXPLAIN in database-lab, using this strategy would allow us to get through at least 9.9 million rows per day, or through the 80 million records referenced in !322817 in about 9 days.

Database changes

SELECT
    "ci_job_artifacts".*
FROM
    "ci_job_artifacts"
    INNER JOIN "ci_builds" ON "ci_builds"."id" = "ci_job_artifacts"."job_id"
        AND "ci_builds"."type" = 'Ci::Build'
    INNER JOIN "ci_pipelines" ON "ci_pipelines"."id" = "ci_builds"."commit_id"
WHERE
    "ci_job_artifacts"."expire_at" < '2021-08-10 21:24:00.921962'
    AND "ci_pipelines"."locked" = 0
LIMIT 100;

Time: 683.472 ms
  - planning: 26.045 ms
  - execution: 657.427 ms
    - I/O read: 651.964 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 568 (~4.40 MiB) from the buffer pool
  - reads: 506 (~4.00 MiB) from the OS file cache, including disk I/O
  - dirtied: 2 (~16.00 KiB)
  - writes: 0

 Limit  (cost=1.15..894.21 rows=100 width=146) (actual time=16.362..657.281 rows=100 loops=1)
   Buffers: shared hit=568 read=506 dirtied=2
   I/O Timings: read=651.964 write=0.000
   ->  Nested Loop  (cost=1.15..2309063341.41 rows=258556439 width=146) (actual time=16.360..657.206 rows=100 loops=1)
         Buffers: shared hit=568 read=506 dirtied=2
         I/O Timings: read=651.964 write=0.000
         ->  Nested Loop  (cost=0.58..1241278628.44 rows=348769047 width=150) (actual time=10.285..327.612 rows=107 loops=1)
               Buffers: shared hit=273 read=266
               I/O Timings: read=324.800 write=0.000
               ->  Seq Scan on public.ci_job_artifacts  (cost=0.00..120111154.40 rows=363055262 width=146) (actual time=4.191..6.150 rows=107 loops=1)
                     Filter: (ci_job_artifacts.expire_at < '2021-08-10 21:24:00.921962+00'::timestamp with time zone)
                     Rows Removed by Filter: 62
                     Buffers: shared read=4
                     I/O Timings: read=5.905 write=0.000
               ->  Index Scan using ci_builds_pkey on public.ci_builds  (cost=0.58..3.09 rows=1 width=8) (actual time=3.001..3.001 rows=1 loops=107)
                     Index Cond: (ci_builds.id = ci_job_artifacts.job_id)
                     Filter: ((ci_builds.type)::text = 'Ci::Build'::text)
                     Rows Removed by Filter: 0
                     Buffers: shared hit=273 read=262
                     I/O Timings: read=318.896 write=0.000
         ->  Index Scan using ci_pipelines_pkey on public.ci_pipelines  (cost=0.57..3.06 rows=1 width=4) (actual time=3.077..3.077 rows=1 loops=107)
               Index Cond: (ci_pipelines.id = ci_builds.commit_id)
               Filter: (ci_pipelines.locked = 0)
               Rows Removed by Filter: 0
               Buffers: shared hit=295 read=240 dirtied=2
               I/O Timings: read=327.164 write=0.000

Testing on staging, the picture is even better where queries look like this for each batch of 100:

Draft: Resolve "Clean up old expired job artifacts"

What does this MR do?

Database changes

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Draft: Resolve "Clean up old expired job artifacts"

What does this MR do?

Database changes

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Merge request reports