Use more-efficient indexing for the MergeRequestDiff storage migration
What does this MR do?
Alters the query we use to select merge request diffs to migrate to rely on the newly-added files_count
column.
This is populated for old records via a background migration merged recently: !38936 (merged) . However, if the files_count
column is nil for any case, it doesn't really matter - the scheduler is stateless and runs forever, so records will be added to the list to migrate as the files_count
column gets filled.
when: always
)
Selecting records with the new query (Here's the new query we use when plucking a list of 1000 records to migrate:
SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
WHERE (NOT "merge_request_diffs"."stored_externally" OR "merge_request_diffs"."stored_externally" IS NULL)
AND (files_count > 0)
With seq scan disabled (I only have ~100 rows), this gives me the following plan locally:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Only Scan using index_merge_request_diffs_by_id_partial on merge_request_diffs (cost=0.14..10.14 rows=133 width=4)
(1 row)
Previously, we'd have a WHERE (SELECT ... from merge_request_diff_files WHERE merge_request_diffs.id = merge_request_diff_files.id)
, which started timing out once the migration process got ~50% through the 72 million rows we have in the merge_reuqest_diffs
table.
when: outdated
)
Selecting records with the new query (We modify these queries too, but we don't use them on GitLab.com and there's no expectation that they scale to .com levels at the moment. Still, here they are:
Old merged diffs
SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
INNER JOIN "merge_request_metrics"
ON "merge_request_diffs"."merge_request_id" = "merge_request_metrics"."merge_request_id"
AND "merge_request_metrics"."merged_at" IS NOT NULL
INNER JOIN "merge_requests"
ON "merge_request_diffs"."merge_request_id" = "merge_requests"."id"
WHERE (
"merge_request_diffs"."stored_externally" = FALSE OR
"merge_request_diffs"."stored_externally" IS NULL
)
AND "merge_request_diffs"."files_count" > 0
AND "merge_requests"."state_id" = 3
AND "merge_request_metrics"."merged_at" <= '2020-08-10 13:41:55.499024'
AND "merge_request_metrics"."merged_at" IS NOT NULL
LIMIT 1000
Limit (cost=0.42..15.05 rows=1 width=4)
-> Nested Loop (cost=0.42..15.05 rows=1 width=4)
-> Nested Loop (cost=0.27..14.07 rows=1 width=12)
-> Index Only Scan using index_merge_request_metrics_on_merge_request_id_and_merged_at on merge_request_metrics (cost=0.12..5.89 rows=1 width=4)
Index Cond: (merged_at <= '2020-08-10 13:44:14.198804'::timestamp without time zone)
-> Index Scan using index_merge_request_diffs_on_merge_request_id_and_id on merge_request_diffs (cost=0.14..8.16 rows=1 width=8)
Index Cond: (merge_request_id = merge_request_metrics.merge_request_id)
Filter: (((NOT stored_externally) OR (stored_externally IS NULL)) AND (files_count > 0))
-> Index Scan using merge_requests_pkey on merge_requests (cost=0.14..0.56 rows=1 width=4)
Index Cond: (id = merge_request_diffs.merge_request_id)
Filter: (state_id = 3)
(11 rows)
Old closed diffs
SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
INNER JOIN "merge_requests"
ON "merge_requests"."id" = "merge_request_diffs"."merge_request_id"
INNER JOIN "merge_request_metrics"
ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
WHERE (
"merge_request_diffs"."stored_externally" = 'f'
OR "merge_request_diffs"."stored_externally" IS NULL
)
AND "merge_request_diffs"."files_count" > 0
AND "merge_requests"."state_id" = 2
AND "merge_request_metrics"."latest_closed_at" <= '2020-08-10 13:47:16.224416'
LIMIT 1000
Limit (cost=0.42..17.30 rows=1 width=4)
-> Nested Loop (cost=0.42..17.30 rows=1 width=4)
-> Nested Loop (cost=0.27..16.32 rows=1 width=12)
-> Index Scan using index_merge_request_metrics_on_latest_closed_at on merge_request_metrics (cost=0.12..8.14 rows=1 width=4)
Index Cond: (latest_closed_at <= '2020-08-10 13:47:16.224416+00'::timestamp with time zone)
-> Index Scan using index_merge_request_diffs_on_merge_request_id_and_id on merge_request_diffs (cost=0.14..8.16 rows=1 width=8)
Index Cond: (merge_request_id = merge_request_metrics.merge_request_id)
Filter: (((NOT stored_externally) OR (stored_externally IS NULL)) AND (files_count > 0))
-> Index Scan using merge_requests_pkey on merge_requests (cost=0.14..0.56 rows=1 width=4)
Index Cond: (id = merge_request_diffs.merge_request_id)
Filter: (state_id = 2)
Not latest diffs
SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
INNER JOIN "merge_requests"
ON "merge_requests"."id" = "merge_request_diffs"."merge_request_id"
AND "merge_request_diffs"."id" != "merge_requests"."latest_merge_request_diff_id"
WHERE (
"merge_request_diffs"."stored_externally" = 'f'
OR "merge_request_diffs"."stored_externally" IS NULL
)
AND "merge_request_diffs"."files_count" > 0
LIMIT 1000
Limit (cost=47.30..57.35 rows=200 width=4)
-> Hash Join (cost=47.30..57.35 rows=200 width=4)
Hash Cond: (merge_request_diffs.merge_request_id = merge_requests.id)
Join Filter: (merge_request_diffs.id <> merge_requests.latest_merge_request_diff_id)
-> Bitmap Heap Scan on merge_request_diffs (cost=9.20..18.71 rows=201 width=8)
Recheck Cond: ((files_count > 0) AND ((NOT stored_externally) OR (stored_externally IS NULL)))
-> Bitmap Index Scan on index_merge_request_diffs_by_id_partial (cost=0.00..9.15 rows=201 width=0)
-> Hash (cost=35.45..35.45 rows=212 width=8)
-> Index Scan using merge_requests_pkey on merge_requests (cost=0.14..35.45 rows=212 width=8)
rails db:migrate
lupine@t470p:~/dev/gdk/gitlab$ bin/rails db:migrate
== 20200813143304 AddNewExternalDiffMigrationIndex: migrating =================
-- transaction_open?()
-> 0.0000s
-- index_exists?(:merge_request_diffs, :id, {:name=>"index_merge_request_diffs_by_id_partial", :where=>"files_count > 0 AND ((NOT stored_externally) OR (stored_externally IS NULL))", :algorithm=>:concurrently})
-> 0.0023s
-- add_index(:merge_request_diffs, :id, {:name=>"index_merge_request_diffs_by_id_partial", :where=>"files_count > 0 AND ((NOT stored_externally) OR (stored_externally IS NULL))", :algorithm=>:concurrently})
-> 0.0080s
== 20200813143304 AddNewExternalDiffMigrationIndex: migrated (0.0109s) ========
== 20200813143356 RemoveOldExternalDiffMigrationIndex: migrating ==============
-- transaction_open?()
-> 0.0000s
-- indexes(:merge_request_diffs)
-> 0.0021s
== 20200813143356 RemoveOldExternalDiffMigrationIndex: migrated (0.0027s) =====
Adding the new index seems fast: https://gitlab.slack.com/archives/CLJMDRD8C/p1597671440196300
The query has been executed. Duration: 7.359 min
This might be because most of the rows still have files_count: nil
, though
rails db:migrate:down
lupine@t470p:~/dev/gdk/gitlab$ VERSION=20200813143356 bin/rails db:migrate:down
== 20200813143356 RemoveOldExternalDiffMigrationIndex: reverting ==============
-- transaction_open?()
-> 0.0000s
-- index_exists?(:merge_request_diffs, [:merge_request_id, :id], {:where=>{:stored_externally=>[nil, false]}, :algorithm=>:concurrently})
-> 0.0045s
== 20200813143356 RemoveOldExternalDiffMigrationIndex: reverted (0.0048s) =====
lupine@t470p:~/dev/gdk/gitlab$ VERSION=20200813143304 bin/rails db:migrate:down
== 20200813143304 AddNewExternalDiffMigrationIndex: reverting =================
-- transaction_open?()
-> 0.0000s
-- indexes(:merge_request_diffs)
-> 0.0033s
-- remove_index(:merge_request_diffs, {:algorithm=>:concurrently, :name=>"index_merge_request_diffs_by_id_partial"})
-> 0.0038s
== 20200813143304 AddNewExternalDiffMigrationIndex: reverted (0.0075s) ========
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Closes #227570 (closed)