Add post migration to backfill projects updated at after repository move
What does this MR do?
We have several tools that extract some information from the database for statistical purposes. These tools, use some attributes to determine which records to extract.
We recently discovered that there is a bug when we move the repository storage of a project, because the updated_at
date wasn't updated when the project's repository storage was moved. This column is used by some of the former tools.
We merged a fix to address this, but we need to align the existing updated_at
of projects with the right information.
To do that we implement this background migration, where we will retrieve all newer repository storage moves for a project, get those whose updated_at
is older than the one in the repository storage move, and updated it.
Refs https://gitlab.com/gitlab-data/analytics/-/issues/7868
Query plans
In the scheduling migration we execute the following query:
SELECT DISTINCT "project_repository_storage_moves"."project_id"
FROM "project_repository_storage_moves"
WHERE "project_repository_storage_moves"."project_id" >= 1
ORDER BY "project_repository_storage_moves"."project_id" ASC
LIMIT 1
OFFSET 10000
The query plan for that query is:
Limit (cost=286.69..286.71 rows=1 width=8) (actual time=42.572..42.572 rows=1 loops=1)
-> Unique (cost=0.43..44437.15 rows=1552331 width=8) (actual time=0.040..42.155 rows=10001 loops=1)
-> Index Only Scan using index_project_repository_storage_moves_on_project_id on project_repository_storage_moves (cost=0.43..40548.62 rows=1555413 width=8) (actual time=0.039..40.301 rows=10010 loops=1)
Index Cond: (project_id >= 1)
Heap Fetches: 12
Planning Time: 4.746 ms
Execution Time: 42.719 ms
In prod, there are 1555156 distinct projects, so given that the batch is 10_000
we will execute 155 loops. Therefore, the time of this migration will be 6.150ms = 6.5s.
Inside the background migtration, the following query is executed:
WITH repository_storage_cte as
( SELECT project_id,
MAX(updated_at) as updated_at
FROM "project_repository_storage_moves"
WHERE "project_repository_storage_moves"."project_id" BETWEEN 1 AND 10000
GROUP BY "project_repository_storage_moves"."project_id" )
UPDATE projects
SET updated_at = (repository_storage_cte.updated_at + interval '1 second')
FROM repository_storage_cte
WHERE projects.id = repository_storage_cte.project_id
AND projects.updated_at <= repository_storage_cte.updated_at
With cold cached the query plan is:
Update on projects (cost=146.35..749.39 rows=58 width=822) (actual time=3388.508..3388.512 rows=0 loops=1)
CTE repository_storage_cte
-> GroupAggregate (cost=0.43..145.91 rows=173 width=16) (actual time=1.770..76.656 rows=136 loops=1)
Group Key: project_repository_storage_moves.project_id
-> Index Scan using index_project_repository_storage_moves_on_project_id on project_repository_storage_moves (cost=0.43..143.32 rows=173 width=16) (actual time=0.798..75.474 rows=136 loops=1)
Index Cond: ((project_id >= 1) AND (project_id <= 10000))
-> Nested Loop (cost=0.44..603.48 rows=58 width=822) (actual time=6.850..358.721 rows=136 loops=1)
-> CTE Scan on repository_storage_cte (cost=0.00..3.46 rows=173 width=56) (actual time=2.213..77.885 rows=136 loops=1)
-> Index Scan using projects_pkey on projects (cost=0.44..3.46 rows=1 width=738) (actual time=2.056..2.056 rows=1 loops=136)
Index Cond: (id = repository_storage_cte.project_id)
Filter: (updated_at <= repository_storage_cte.updated_at)
Planning Time: 276.241 ms
Execution Time: 3389.899 ms
With warm caches, the query plan is:
Update on projects (cost=146.35..749.39 rows=58 width=822) (actual time=186.163..186.166 rows=0 loops=1)
CTE repository_storage_cte
-> GroupAggregate (cost=0.43..145.91 rows=173 width=16) (actual time=0.059..54.651 rows=136 loops=1)
Group Key: project_repository_storage_moves.project_id
-> Index Scan using index_project_repository_storage_moves_on_project_id on project_repository_storage_moves (cost=0.43..143.32 rows=173 width=16) (actual time=0.035..54.102 rows=136 loops=1)
Index Cond: ((project_id >= 1) AND (project_id <= 10000))
-> Nested Loop (cost=0.44..603.48 rows=58 width=822) (actual time=0.110..58.329 rows=136 loops=1)
-> CTE Scan on repository_storage_cte (cost=0.00..3.46 rows=173 width=56) (actual time=0.068..55.033 rows=136 loops=1)
-> Index Scan using projects_pkey on projects (cost=0.44..3.46 rows=1 width=738) (actual time=0.020..0.020 rows=1 loops=136)
Index Cond: (id = repository_storage_cte.project_id)
Filter: (updated_at <= repository_storage_cte.updated_at)
Planning Time: 1.051 ms
Execution Time: 186.529 ms
With warm caches, the execution time is not relevant so basically, the background migration will take 155 * 120 = 310 mins = 5 hours.