Geo: Resync direct upload object stored artifacts (!128939) · Merge requests · GitLab.org / GitLab

Michael Kozono requested to merge mk/geo-resync-direct-uploaded-artifacts into master Aug 09, 2023

What does this MR do and why?

This MR causes secondary Geo sites to resync job artifacts which were potentially affected by #419742 (closed).

I aim to backport this MR to 16.3, 16.2, and ideally 16.1, to ensure that customers can upgrade to any latest patch release to automatically avoid data loss.

Note that Geo cannot use Batched Background Migrations #353644

Migration output

Up

If object storage or direct upload is not configured, then it does nothing:

WARNING: Could not write to the database ci: cannot execute UPSERT in a read-only transaction
WARNING: Could not write to the database main: cannot execute UPSERT in a read-only transaction
geo: == [advisory_lock_connection] object_id: 225600, pg_backend_pid: 70141
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: migrating ============
geo: -- Skipping because job artifacts are not stored in object storage with direct upload
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: migrated (0.0002s) ===

geo: == [advisory_lock_connection] object_id: 225600, pg_backend_pid: 70141

If the secondary Geo site is not configured to replicate object storage, then it does nothing:

WARNING: Could not write to the database ci: cannot execute UPSERT in a read-only transaction
WARNING: Could not write to the database main: cannot execute UPSERT in a read-only transaction
geo: == [advisory_lock_connection] object_id: 225180, pg_backend_pid: 30902
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: migrating ============
geo: -- Skipping because this Geo site does not replicate object storage
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: migrated (0.0014s) ===

geo: == [advisory_lock_connection] object_id: 225180, pg_backend_pid: 30902

If there are artifacts that may be affected, then it marks the registry rows as needing to be synced (via Sidekiq jobs):

geo: == [advisory_lock_connection] object_id: 257000, pg_backend_pid: 48361
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: migrating ============
geo: -- Enqueuing Geo::ResyncDirectUploadJobArtifactRegistryWorker jobs to mark artifacts pending
geo: -- See https://gitlab.com/gitlab-org/gitlab/-/issues/419742
geo: -- Enqueued 1 Geo::ResyncDirectUploadJobArtifactRegistryWorker job
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: migrated (0.1244s) ===

geo: == [advisory_lock_connection] object_id: 257000, pg_backend_pid: 48361

In development or test environments, the loop in the migration also outputs the range of each enqueued job:

geo: -- Enqueuing Geo::ResyncDirectUploadJobArtifactRegistryWorker for range 54..810

Down

geo: == [advisory_lock_connection] object_id: 223280, pg_backend_pid: 49037
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: reverting ============
geo: == 20230808031211 ResyncDirectUploadJobArtifactRegistry: reverted (0.0005s) ===

Query analysis

Executed locally since GitLab.com doesn't use Geo, and staging-ref only has 700 artifacts anyway. The each_batch_range scopes work only with the primary key, so I expect performance of each loop is fast. The real work is in the UPDATE queries, and those are performed async, and they are scoped to a range on the primary key.

Side note: I originally scoped each_batch_range to where state = 2, because I expected job_artifact_registry to have an index on state. But it does not have that index. I opened an issue about that. A scope on state is not crucial for this migration though, so I removed it.

each_batch_range query 1 https://explain.depesz.com/s/gOlK:

➜  gitlab git:(49b2f611eeef) ✗ gdk psql-geo
psql (13.9)
Type "help" for help.

gitlabhq_geo_development=# explain analyze SELECT "job_artifact_registry"."id" FROM "job_artifact_registry" ORDER BY "job_artifact_registry"."id" ASC LIMIT 1;
                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.14..0.26 rows=1 width=4) (actual time=0.026..0.027 rows=1 loops=1)
   ->  Index Only Scan using job_artifact_registry_pkey on job_artifact_registry  (cost=0.14..9.34 rows=80 width=4) (actual time=0.025..0.025 rows=1 loops=1)
         Heap Fetches: 0
 Planning Time: 0.863 ms
 Execution Time: 0.051 ms
(5 rows)

each_batch_range query 2 https://explain.depesz.com/s/xL88:

gitlabhq_geo_development=# explain analyze SELECT "job_artifact_registry"."id" FROM "job_artifact_registry" WHERE "job_artifact_registry"."id" >= 17 ORDER BY "job_artifact_registry"."id" ASC LIMIT 1 OFFSET 10000;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7.12..7.12 rows=1 width=4) (actual time=0.083..0.083 rows=0 loops=1)
   ->  Sort  (cost=6.96..7.12 rows=65 width=4) (actual time=0.076..0.079 rows=64 loops=1)
         Sort Key: id
         Sort Method: quicksort  Memory: 28kB
         ->  Seq Scan on job_artifact_registry  (cost=0.00..5.00 rows=65 width=4) (actual time=0.023..0.052 rows=64 loops=1)
               Filter: (id >= 17)
               Rows Removed by Filter: 16
 Planning Time: 0.157 ms
 Execution Time: 0.103 ms
(9 rows)

each_batch_range query 3 https://explain.depesz.com/s/CHxt:

gitlabhq_geo_development=# explain analyze SELECT MIN(id), MAX(id) FROM "job_artifact_registry" WHERE "job_artifact_registry"."id" >= 17 LIMIT 1;
                                                                                              QUERY PLAN                                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..0.58 rows=1 width=8) (actual time=0.078..0.080 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.14..0.29 rows=1 width=4) (actual time=0.022..0.023 rows=1 loops=1)
           ->  Index Only Scan using job_artifact_registry_pkey on job_artifact_registry  (cost=0.14..9.44 rows=65 width=4) (actual time=0.021..0.022 rows=1 loops=1)
                 Index Cond: ((id IS NOT NULL) AND (id >= 17))
                 Heap Fetches: 0
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.14..0.29 rows=1 width=4) (actual time=0.050..0.050 rows=1 loops=1)
           ->  Index Only Scan Backward using job_artifact_registry_pkey on job_artifact_registry job_artifact_registry_1  (cost=0.14..9.44 rows=65 width=4) (actual time=0.049..0.049 rows=1 loops=1)
                 Index Cond: ((id IS NOT NULL) AND (id >= 17))
                 Heap Fetches: 0
   ->  Result  (cost=0.57..0.58 rows=1 width=8) (actual time=0.077..0.077 rows=1 loops=1)
 Planning Time: 0.732 ms
 Execution Time: 0.118 ms
(14 rows)

Executed by the background job https://explain.depesz.com/s/ngJF:

gitlabhq_geo_development=# explain analyze UPDATE job_artifact_registry SET state = 0, last_synced_at = NULL WHERE state = 2 AND last_synced_at BETWEEN '2023-06-22T00:00:00' AND '2024-02-03T00:00:00' AND id BETWEEN 1 AND 10000;
                                                                                                     QUERY PLAN                                                                                                      
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on job_artifact_registry  (cost=0.00..5.80 rows=74 width=796) (actual time=0.972..0.973 rows=0 loops=1)
   ->  Seq Scan on job_artifact_registry  (cost=0.00..5.80 rows=74 width=796) (actual time=0.035..0.071 rows=80 loops=1)
         Filter: ((last_synced_at >= '2023-06-22 00:00:00-10'::timestamp with time zone) AND (last_synced_at <= '2024-02-03 00:00:00-10'::timestamp with time zone) AND (id >= 1) AND (id <= 10000) AND (state = 2))
 Planning Time: 1.581 ms
 Execution Time: 1.225 ms
(5 rows)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Aug 22, 2023 by Michael Kozono

Geo: Resync direct upload object stored artifacts

What does this MR do and why?

Migration output

Up

Down

Query analysis

MR acceptance checklist

Merge request reports