Skip to content

Add post migration to backfill projects updated at after repository move

What does this MR do?

We have several tools that extract some information from the database for statistical purposes. These tools, use some attributes to determine which records to extract.

We recently discovered that there is a bug when we move the repository storage of a project, because the updated_at date wasn't updated when the project's repository storage was moved. This column is used by some of the former tools.

We merged a fix to address this, but we need to align the existing updated_at of projects with the right information.

To do that we implement this background migration, where we will retrieve all newer repository storage moves for a project, get those whose updated_at is older than the one in the repository storage move, and updated it.

Refs https://gitlab.com/gitlab-data/analytics/-/issues/7868

Query plans

In the scheduling migration we execute the following query:

SELECT DISTINCT "project_repository_storage_moves"."project_id"
FROM "project_repository_storage_moves"
WHERE "project_repository_storage_moves"."project_id" >= 1
ORDER BY "project_repository_storage_moves"."project_id" ASC
LIMIT 1
OFFSET 10000

The query plan for that query is:

 Limit  (cost=286.69..286.71 rows=1 width=8) (actual time=42.572..42.572 rows=1 loops=1)
   ->  Unique  (cost=0.43..44437.15 rows=1552331 width=8) (actual time=0.040..42.155 rows=10001 loops=1)
         ->  Index Only Scan using index_project_repository_storage_moves_on_project_id on project_repository_storage_moves  (cost=0.43..40548.62 rows=1555413 width=8) (actual time=0.039..40.301 rows=10010 loops=1)
               Index Cond: (project_id >= 1)
               Heap Fetches: 12
 Planning Time: 4.746 ms
 Execution Time: 42.719 ms

In prod, there are 1555156 distinct projects, so given that the batch is 10_000 we will execute 155 loops. Therefore, the time of this migration will be 6.150ms = 6.5s.

Inside the background migtration, the following query is executed:

WITH repository_storage_cte as
  ( SELECT project_id,
           MAX(updated_at) as updated_at
   FROM "project_repository_storage_moves"
   WHERE "project_repository_storage_moves"."project_id" BETWEEN 1 AND 10000
   GROUP BY "project_repository_storage_moves"."project_id" )
UPDATE projects
SET updated_at = (repository_storage_cte.updated_at + interval '1 second')
FROM repository_storage_cte
WHERE projects.id = repository_storage_cte.project_id
  AND projects.updated_at <= repository_storage_cte.updated_at

With cold cached the query plan is:

Update on projects  (cost=146.35..749.39 rows=58 width=822) (actual time=3388.508..3388.512 rows=0 loops=1)
CTE repository_storage_cte
  ->  GroupAggregate  (cost=0.43..145.91 rows=173 width=16) (actual time=1.770..76.656 rows=136 loops=1)
        Group Key: project_repository_storage_moves.project_id
        ->  Index Scan using index_project_repository_storage_moves_on_project_id on project_repository_storage_moves  (cost=0.43..143.32 rows=173 width=16) (actual time=0.798..75.474 rows=136 loops=1)
              Index Cond: ((project_id >= 1) AND (project_id <= 10000))
->  Nested Loop  (cost=0.44..603.48 rows=58 width=822) (actual time=6.850..358.721 rows=136 loops=1)
      ->  CTE Scan on repository_storage_cte  (cost=0.00..3.46 rows=173 width=56) (actual time=2.213..77.885 rows=136 loops=1)
      ->  Index Scan using projects_pkey on projects  (cost=0.44..3.46 rows=1 width=738) (actual time=2.056..2.056 rows=1 loops=136)
            Index Cond: (id = repository_storage_cte.project_id)
            Filter: (updated_at <= repository_storage_cte.updated_at)
Planning Time: 276.241 ms
Execution Time: 3389.899 ms

With warm caches, the query plan is:

Update on projects  (cost=146.35..749.39 rows=58 width=822) (actual time=186.163..186.166 rows=0 loops=1)
  CTE repository_storage_cte
    ->  GroupAggregate  (cost=0.43..145.91 rows=173 width=16) (actual time=0.059..54.651 rows=136 loops=1)
          Group Key: project_repository_storage_moves.project_id
          ->  Index Scan using index_project_repository_storage_moves_on_project_id on project_repository_storage_moves  (cost=0.43..143.32 rows=173 width=16) (actual time=0.035..54.102 rows=136 loops=1)
                Index Cond: ((project_id >= 1) AND (project_id <= 10000))
  ->  Nested Loop  (cost=0.44..603.48 rows=58 width=822) (actual time=0.110..58.329 rows=136 loops=1)
        ->  CTE Scan on repository_storage_cte  (cost=0.00..3.46 rows=173 width=56) (actual time=0.068..55.033 rows=136 loops=1)
        ->  Index Scan using projects_pkey on projects  (cost=0.44..3.46 rows=1 width=738) (actual time=0.020..0.020 rows=1 loops=136)
              Index Cond: (id = repository_storage_cte.project_id)
              Filter: (updated_at <= repository_storage_cte.updated_at)
Planning Time: 1.051 ms
Execution Time: 186.529 ms

With warm caches, the execution time is not relevant so basically, the background migration will take 155 * 120 = 310 mins = 5 hours.

Does this MR meet the acceptance criteria?

Conformity

Edited by Francisco Javier López

Merge request reports

Loading