Migrate mentions for commit notes to DB table
What does this MR do?
-
Cleans commit_user_mentions records that have no actual mentions. Based on #database-lab there are currently 0 records, so that should be fast.
explain DELETE FROM "commit_user_mentions" WHERE "commit_user_mentions"."mentioned_users_ids" IS NULL AND "commit_user_mentions"."mentioned_groups_ids" IS NULL AND "commit_user_mentions"."mentioned_projects_ids" IS NULL AND "commit_user_mentions"."id" >= 1
Click to see the plan.
ModifyTable on public.commit_user_mentions (cost=0.00..396.38 rows=85 width=6) (actual time=58.317..58.317 rows=0 loops=1) Buffers: shared hit=1 read=80 dirtied=14 I/O Timings: read=18.096 -> Seq Scan on public.commit_user_mentions (cost=0.00..396.38 rows=85 width=6) (actual time=58.288..58.288 rows=0 loops=1) Filter: ((commit_user_mentions.mentioned_users_ids IS NULL) AND (commit_user_mentions.mentioned_groups_ids IS NULL) AND (commit_user_mentions.mentioned_projects_ids IS NULL) AND (commit_user_mentions.id >= 1)) Rows Removed by Filter: 5830 Buffers: shared hit=1 read=80 dirtied=14 I/O Timings: read=18.096 Time: 58.535 ms - planning: 0.170 ms - execution: 58.365 ms - I/O read: 18.096 ms - I/O write: 0.000 ms Shared buffers: - hits: 1 (~8.00 KiB) from the buffer pool - reads: 80 (~640.00 KiB) from the OS file cache, including disk I/O - dirtied: 14 (~112.00 KiB) - writes: 0
-
This MR contains migrations for commit notes mentions.
These migrations are to be run after !19088 (merged) is merged and confirmed to be working fine on production
UPDATE: migration is on production and seemingly working.
This MR is based on initial migration that contains common code under `CreateResourceUserMention` to be used on all mentions migrations of other models, including `Commit`.
I've pulled some estimate counts of the records we might be looking at from slack `database-lab` channel.
Database benchmarks
Number of rows affected
-
Commit notes with mentions ~177788
explain select id from notes where noteable_type = ‘Commit’ and note like ‘%@%’ Index Only Scan using commit_mentions_temp_index on public.notes (cost=0.42..31898.56 rows=986504 width=4) (actual time=0.203..5144.721 rows=177788 loops=1) Heap Fetches: 4133 Buffers: shared hit=73401 read=4987 dirtied=216 I/O Timings: read=5011.186
Migrate commit notes mentions
Query plans for sql queries to migrate mentions for commit notes based on database-lab
data.
-
Temporary index is now created as a single index for all noteable_types, rather than one per each noteable_type.
CREATE INDEX CONCURRENTLY mentions_in_notes_temp_index ON notes (id, noteable_type) WHERE note ~~ '%@%'::text
\di+ mentions_in_notes_temp_index
List of relations Schema | Name | Type | Owner | Table | Size | Description --------+------------------------------+-------+--------+-------+--------+------------- public | mentions_in_notes_temp_index | index | gitlab | notes | 813 MB | (1 row)
-
explain SELECT “notes”.“id” FROM “notes” LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id WHERE (note LIKE ‘%@%’ AND commit_user_mentions.commit_id IS NULL AND notes.noteable_type = ‘Commit’) ORDER BY notes.id LIMIT 1
Click to see the plan WITH index. Time: 3.718 ms https://explain.depesz.com/s/iCWw
Limit (cost=0.70..24943.43 rows=1 width=4) (actual time=0.374..0.375 rows=1 loops=1) Buffers: shared hit=3 read=4 I/O Timings: read=0.297 -> Merge Join (cost=0.70..24943.43 rows=1 width=4) (actual time=0.373..0.373 rows=1 loops=1) Filter: (commit_user_mentions.commit_id IS NULL) Rows Removed by Filter: 0 Buffers: shared hit=3 read=4 I/O Timings: read=0.297 -> Index Only Scan using commit_mentions_temp_index on public.notes (cost=0.42..21960.30 rows=959901 width=4) (actual time=0.361..0.361 rows=1 loops=1) Heap Fetches: 0 Buffers: shared read=4 I/O Timings: read=0.297 -> Index Scan using index_commit_user_mentions_on_note_id on public.commit_user_mentions (cost=0.28..600.06 rows=5335 width=45) (actual time=0.008..0.008 rows=1 loops=1) Buffers: shared hit=3 Time: 3.718 ms planning: 3.319 ms execution: 0.399 ms I/O read: 0.297 ms I/O write: 0.000 ms Shared buffers: hits: 3 (~24.00 KiB) from the buffer pool reads: 4 (~32.00 KiB) from the OS file cache, including disk I/O dirtied: 0 writes: 0
-
explain SELECT “notes”.“id” FROM “notes” LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id WHERE (note LIKE ‘%@%’ AND commit_user_mentions.commit_id IS NULL AND notes.noteable_type = ‘Commit’) AND notes.id >= 5450000 ORDER BY notes.id ASC LIMIT 1 OFFSET 10000
Click to see the plan. Time: 42.215 ms. https://explain.depesz.com/s/pnLr
Limit (cost=34861.24..69721.78 rows=1 width=4) (actual time=38.042..38.045 rows=1 loops=1) Buffers: shared hit=4604 read=639 I/O Timings: read=28.204 -> Merge Join (cost=0.70..34861.24 rows=1 width=4) (actual time=0.264..37.223 rows=10001 loops=1) Filter: (commit_user_mentions.commit_id IS NULL) Rows Removed by Filter: 0 Buffers: shared hit=4604 read=639 I/O Timings: read=28.204 -> Index Only Scan using commit_mentions_temp_index on public.notes (cost=0.42..31864.66 rows=970639 width=4) (actual time=0.252..34.289 rows=10001 loops=1) Index Cond: (notes.id >= 5450000) Heap Fetches: 114 Buffers: shared hit=4601 read=639 I/O Timings: read=28.204 -> Index Scan using index_commit_user_mentions_on_note_id on public.commit_user_mentions (cost=0.28..560.34 rows=3796 width=45) (actual time=0.008..0.008 rows=1 loops=1) Buffers: shared hit=3 Time: 42.215 ms planning: 4.118 ms execution: 38.097 ms I/O read: 28.204 ms I/O write: 0.000 ms Shared buffers: hits: 4604 (~36.00 MiB) from the buffer pool reads: 639 (~5.00 MiB) from the OS file cache, including disk I/O dirtied: 0 writes: 0
-
explain SELECT min(“notes”.“id”), max(“notes”.“id”) FROM “notes” LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id WHERE (note LIKE ‘%@%’ AND commit_user_mentions.commit_id IS NULL AND notes.noteable_type = ‘Commit’) AND notes.id >= 5450000 AND notes.id < 15450000
Click to see the plan. Time: 168.444 ms. https://explain.depesz.com/s/YHW7
Aggregate (cost=8339.26..8339.27 rows=1 width=8) (actual time=165.263..165.263 rows=1 loops=1) Buffers: shared hit=2768 read=1714 dirtied=1 I/O Timings: read=110.949 -> Hash Left Join (cost=493.46..8339.26 rows=1 width=4) (actual time=3.737..163.955 rows=4269 loops=1) Hash Cond: (notes.id = commit_user_mentions.note_id) Filter: (commit_user_mentions.commit_id IS NULL) Rows Removed by Filter: 0 Buffers: shared hit=2768 read=1714 dirtied=1 I/O Timings: read=110.949 -> Index Only Scan using commit_mentions_temp_index on public.notes (cost=0.56..7790.70 rows=14839 width=4) (actual time=1.193..158.082 rows=4269 loops=1) Index Cond: ((notes.id >= 26730000) AND (notes.id < 31730000) AND (notes.noteable_type = 'Commit'::text)) Heap Fetches: 3 Buffers: shared hit=2680 read=1714 dirtied=1 I/O Timings: read=110.949 -> Hash (cost=414.62..414.62 rows=6262 width=45) (actual time=2.531..2.531 rows=6299 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 538kB Buffers: shared hit=88 -> Seq Scan on public.commit_user_mentions (cost=0.00..414.62 rows=6262 width=45) (actual time=0.010..1.299 rows=6299 loops=1) Buffers: shared hit=88 Time: 168.444 ms - planning: 3.128 ms - execution: 165.316 ms - I/O read: 110.949 ms - I/O write: 0.000 ms Shared buffers: - hits: 2768 (~21.60 MiB) from the buffer pool - reads: 1714 (~13.40 MiB) from the OS file cache, including disk I/O - dirtied: 1 (~8.00 KiB) - writes: 0
-
explain SELECT notes.* FROM "notes" LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id WHERE (note LIKE '%@%' AND commit_user_mentions.commit_id IS NULL AND notes.noteable_type = 'Commit') AND notes.id >= 5450000 AND notes.id < 21730000
Click to see the plan. **Time: 12.889 s**. https://explain.depesz.com/s/0QJ
Hash Join (cost=2172.54..2610.66 rows=1 width=2467) (actual time=260.147..260.261 rows=157 loops=1) Hash Cond: (commit_user_mentions.note_id = notes.id) Filter: (commit_user_mentions.commit_id IS NULL) Rows Removed by Filter: 0 Buffers: shared hit=5 read=290 dirtied=10 I/O Timings: read=252.525 -> Seq Scan on public.commit_user_mentions (cost=0.00..414.62 rows=6262 width=45) (actual time=0.015..23.113 rows=6299 loops=1) Buffers: shared hit=1 read=87 dirtied=10 I/O Timings: read=20.928 -> Hash (cost=2157.74..2157.74 rows=1184 width=2467) (actual time=236.084..236.084 rows=157 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 146kB Buffers: shared hit=1 read=203 I/O Timings: read=231.597 -> Index Scan using commit_mentions_temp_index on public.notes (cost=0.56..2157.74 rows=1184 width=2467) (actual time=3.757..235.239 rows=157 loops=1) Index Cond: ((notes.id >= 5450000) AND (notes.id < 11730000) AND ((notes.noteable_type)::text = 'Commit'::text)) Buffers: shared hit=1 read=203 I/O Timings: read=231.597 Time: 12.889 s - planning: 3.111 ms - execution: 12.886 s - I/O read: 12.653 s - I/O write: 0.000 ms Shared buffers: - hits: 495 (~3.90 MiB) from the buffer pool - reads: 12663 (~98.90 MiB) from the OS file cache, including disk I/O - dirtied: 53 (~424.00 KiB) - writes: 0
explain SELECT notes.* FROM "notes"
LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id
WHERE (note LIKE '%@%' AND commit_user_mentions.commit_id IS NULL
AND notes.noteable_type = 'Commit') AND notes.id >= 11800000 AND notes.id < 13000000
Click to see the plan for 1355 rows cold cache. **Time: 1.584 s**. https://explain.depesz.com/s/r9x8
Hash Left Join (cost=601.12..7672.16 rows=1 width=2471) (actual time=6.620..1579.179 rows=1355 loops=1)
Hash Cond: (notes.id = commit_user_mentions.note_id)
Filter: (commit_user_mentions.commit_id IS NULL)
Rows Removed by Filter: 0
Buffers: shared hit=565 read=1247 dirtied=9
I/O Timings: read=1542.118
-> Index Scan using mentions_in_notes_temp_index on public.notes (cost=0.56..7057.16 rows=3849 width=2471) (actual time=2.391..1571.064 rows=1355 loops=1)
Index Cond: ((notes.id >= 11800000) AND (notes.id < 13000000) AND ((notes.noteable_type)::text = 'Commit'::text))
Buffers: shared hit=459 read=1247 dirtied=9
I/O Timings: read=1542.118
-> Hash (cost=502.47..502.47 rows=7847 width=45) (actual time=4.206..4.207 rows=7903 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 659kB
Buffers: shared hit=106
-> Seq Scan on public.commit_user_mentions (cost=0.00..502.47 rows=7847 width=45) (actual time=0.013..1.935 rows=7903 loops=1)
Buffers: shared hit=106
Time: 1.584 s
planning: 3.810 ms
execution: 1.580 s
I/O read: 1.542 s
I/O write: 0.000 ms
Shared buffers:
hits: 565 (~4.40 MiB) from the buffer pool
reads: 1247 (~9.70 MiB) from the OS file cache, including disk I/O
dirtied: 9 (~72.00 KiB)
writes: 0
Migration Runtime estimates
[OLD] Commit notes mentions migration: 10K batches + 3 - 10 mins jobs + index
- Concurrent Index creation: ~55 <-> 120 mins? On database-lab channel the index creation took 55 mins, and it was not run concurrently which on production I assume it will be slower because we have to run the index creation concurrently.
- ~177788 records in 10K batches results in ~18 background migration jobs. Scheduling a single job given all the queries seems to run way below 100ms should take ~1s, so ~20-30 secs to schedule all 18 jobs.
This next estimate is a very rough approximation, as I was able to only run jobs locally which showed me a runtime foreach each job between 3-5 mins with 2 mins delay between jobs(for only ~500K user mentions from 1.5M notes), and where running times were between 2 and 4 mins to the end of migration. Taking same estimations for 18 jobs we got:
18 jobs * 3 mins = 54 mins
18 jobs * 5 mins = 90 mins
18 jobs * 10 mins = 180 mins
[Current] Commit notes mentions migration: 1K batches + 3 - 10 mins jobs, no index
- index will be pre-created in !26307 (merged) -> 0 mins
- ~177788 records in 1K batches results in ~180 jobs scheduled to run 3 mins apart. Jobs should still be scheduled in under 1 min.
- running jobs estimates
-
180 jobs * 3 mins = 540 mins
-> this is probably most plausible now that we lowered the batch size and increased the scheduling delay -
180 jobs * 5 mins = 900 mins
-> less plausible, maybe possible for some jobs to the end of migration ?🤔 -
180 jobs * 10 mins = 1800 mins
-> unlikely
-
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
re #198326