Reduce space needed for merge request diff commits
What does this MR do?
The table merge_request_diff_commits stores the names and Emails of commit authors and committers. This data is stored as-is. This means that if I push 10 commits, GitLab stores my name and Email address 20 times: 10 times for the author details, and 10 times for the committer details.
This commit adds a set of migrations and code changes to resolve this problem. Instead of storing names and Emails for every occurrence, we'll store a unique set of names and Emails in a separate table. The table merge_request_context_commits in turn has two new columns: commit_author_id and committer_id.
When creating rows for merge_request_context_commits, we take the author and committer details and try to find an existing row in the newly added table. If no row exists, one is created.
The resulting setup is such that given a name and Email pair of (X, Y), we only store a single occurrence of that pair. This reduces the amount of space necessary to store all this information. Based on our findings in #331823 (closed), we estimate that over 95% of the data currently stored is duplicate data. For GitLab.com, we estimate this will translate to roughly 500 GB of data we no longer need to store.
The migration process for GitLab.com will likely take around two weeks, based on our estimates discussed in #331823 (comment 595865070). The exact time may differ based on how lucky (or not) we get, and the exact number of rows that have to be migrated.
See the following issues for more information:
Migration output
db/migrate/20210602155056_add_merge_request_diff_commit_users.rb
up:
== AddMergeRequestDiffCommitUsers: migrating =================================
-- create_table(:merge_request_diff_commit_users, {:id=>:bigint})
-- quote_column_name(:name)
-> 0.0000s
-- quote_column_name(:email)
-> 0.0000s
-> 0.2931s
-- quote_table_name("check_147358fc42")
-> 0.0000s
-- quote_table_name("check_f5fa206cf7")
-> 0.0000s
-- quote_table_name(:merge_request_diff_commit_users)
-> 0.0000s
-- execute("ALTER TABLE \"merge_request_diff_commit_users\"\nADD CONSTRAINT \"check_147358fc42\" CHECK (char_length(\"name\") <= 512),\nADD CONSTRAINT \"check_f5fa206cf7\" CHECK (char_length(\"email\") <= 512)\n")
-> 0.1400s
-- transaction_open?()
-> 0.0000s
-- current_schema()
-> 0.1392s
-- execute("ALTER TABLE merge_request_diff_commit_users\nADD CONSTRAINT merge_request_diff_commit_users_name_or_email_existence\nCHECK ( (COALESCE(name, '') != '') OR (COALESCE(email, '') != '') )\nNOT VALID;\n")
-> 0.1403s
-- current_schema()
-> 0.1396s
-- execute("ALTER TABLE merge_request_diff_commit_users VALIDATE CONSTRAINT merge_request_diff_commit_users_name_or_email_existence;")
-> 0.1399s
== AddMergeRequestDiffCommitUsers: migrated (2.5322s) ========================
down:
== AddMergeRequestDiffCommitUsers: reverting =================================
-- drop_table(:merge_request_diff_commit_users)
-> 0.1610s
== AddMergeRequestDiffCommitUsers: reverted (0.1610s) ========================
db/migrate/20210602155110_add_merge_request_diff_commit_user_columns.rb
up:
== AddMergeRequestDiffCommitUserColumns: migrating ===========================
-- column_exists?(:merge_request_diff_commits, :commit_author_id)
-> 0.1424s
-- add_column(:merge_request_diff_commits, :commit_author_id, :bigint)
-> 0.1499s
-- column_exists?(:merge_request_diff_commits, :committer_id)
-> 0.1419s
-- add_column(:merge_request_diff_commits, :committer_id, :bigint)
-> 0.1408s
== AddMergeRequestDiffCommitUserColumns: migrated (0.5753s) ==================
down:
== AddMergeRequestDiffCommitUserColumns: reverting ===========================
-- column_exists?(:merge_request_diff_commits, :commit_author_id)
-> 0.1417s
-- remove_column(:merge_request_diff_commits, :commit_author_id)
-> 0.1554s
-- column_exists?(:merge_request_diff_commits, :committer_id)
-> 0.1411s
-- remove_column(:merge_request_diff_commits, :committer_id)
-> 0.1401s
== AddMergeRequestDiffCommitUserColumns: reverted (0.5785s) ==================
Background migration timings
On Postgres.ai:
Benchmark.measure { Gitlab::BackgroundMigration::MigrateMergeRequestDiffCommitUsers.new.perform(1, 40_001) }
=> #<Benchmark::Tms:0x0000560cf7c99488 @cstime=0.0, @cutime=0.0, @label="", @real=133.6710021990002, @stime=0.07279900000000006, @total=10.357776000000001, @utime=10.284977000000001>
The exact timings vary a bit. For example, the first time I ran this it took 618 seconds in total. I suspect this is due to caches and what not. On staging this batch size took around 1 minute. Postgres.ai in general is quite a bit slower compared to production, so I expect production timings to be better.
A single update using Postgres.ai takes about 150 milliseconds:
gitlabhq_dblab=# update merge_request_diff_commits set commit_author_id = 3, committer_id = 3 where (merge_request_diff_id, relative_order) = (1382, 0);
UPDATE 0
Time: 142,764 ms
Also take into account that the roundtrip from my local environment to Postgres.ai and back may contribute to these timings.
This batch creates 4474 rows in merge_request_diff_commit_users
, and updates 178195 rows in merge_request_diff_commits
.
Bulk update timings
Rows in merge_request_diff_commits
are updated in bulk. The queries for this look like the following query:
SQL
UPDATE merge_request_diff_commits
SET commit_author_id = CASE
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (1, 0) THEN 1
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (1, 1) THEN 1
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (25, 0) THEN 2
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (29, 0) THEN 3
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (45, 0) THEN 4
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (58, 0) THEN 5
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (58, 1) THEN 6
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (58, 2) THEN 6
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (74, 0) THEN 7
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 0) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 1) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 2) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 3) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (96, 0) THEN 9
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (97, 0) THEN 10
END,
committer_id = CASE
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (1, 0) THEN 1
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (1, 1) THEN 1
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (25, 0) THEN 2
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (29, 0) THEN 3
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (45, 0) THEN 4
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (58, 0) THEN 5
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (58, 1) THEN 6
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (58, 2) THEN 6
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (74, 0) THEN 7
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 0) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 1) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 2) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (90, 3) THEN 8
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (96, 0) THEN 9
WHEN (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) = (97, 0) THEN 10
END
WHERE (merge_request_diff_commits.merge_request_diff_id, merge_request_diff_commits.relative_order) IN (
(1, 0), (1, 1), (25, 0), (29, 0), (45, 0), (58, 0), (58, 1), (58, 2), (74, 0), (90, 0), (90, 1), (90, 2), (90, 3), (96, 0), (97, 0)
);
Each query is limited to updating at most 1000 rows. The above query completes on staging in 25 milliseconds. The query plan is as follows:
EXPLAIN ANALYZE
Update on merge_request_diff_commits (cost=35.26..1894.98 rows=1072 width=232) (actual time=5.325..5.331 rows=0 loops=1)
-> Bitmap Heap Scan on merge_request_diff_commits (cost=35.26..1894.98 rows=1072 width=232) (actual time=0.385..2.374 rows=15 loops=1)
Recheck Cond: (((merge_request_diff_id = 1) AND (relative_order = 0)) OR ((merge_request_diff_id = 1) AND (relative_order = 1)) OR ((merge_request_diff_id = 25) AND (relative_order = 0)) OR ((merge_request_diff_id = 29) AND (relative_order = 0)) OR ((merge_request_diff_id = 45) AND (relative_order = 0)) OR ((merge_request_diff_id = 58) AND (relative_order = 0)) OR ((merge_request_diff_id = 58) AND (relative_order = 1)) OR ((merge_request_diff_id = 58) AND (relative_order = 2)) OR ((merge_request_diff_id = 74) AND (relative_order = 0)) OR ((merge_request_diff_id = 90) AND (relative_order = 0)) OR ((merge_request_diff_id = 90) AND (relative_order = 1)) OR ((merge_request_diff_id = 90) AND (relative_order = 2)) OR ((merge_request_diff_id = 90) AND (relative_order = 3)) OR ((merge_request_diff_id = 96) AND (relative_order = 0)) OR ((merge_request_diff_id = 97) AND (relative_order = 0)))
Heap Blocks: exact=6
-> BitmapOr (cost=35.26..35.26 rows=1072 width=0) (actual time=0.134..0.140 rows=0 loops=1)
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.053..0.054 rows=1 loops=1)
Index Cond: ((merge_request_diff_id = 1) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=49 width=0) (actual time=0.018..0.018 rows=1 loops=1)
Index Cond: ((merge_request_diff_id = 1) AND (relative_order = 1))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.003..0.004 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 25) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 29) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.024..0.024 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 45) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 58) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=49 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 58) AND (relative_order = 1))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=35 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 58) AND (relative_order = 2))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 74) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 90) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=49 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 90) AND (relative_order = 1))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=35 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 90) AND (relative_order = 2))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=30 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 90) AND (relative_order = 3))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 96) AND (relative_order = 0))
-> Bitmap Index Scan on merge_request_diff_commits_pkey (cost=0.00..2.08 rows=92 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: ((merge_request_diff_id = 97) AND (relative_order = 0))
Planning Time: 3.554 ms
Execution Time: 5.678 ms
Impact analysis
- Replication lag: minor (see !63669 (comment 615623486))
- CPU/load: non-existing (see !63669 (comment 614605856))
- Background migration time: 2 weeks at minimum, possibly 3-4 weeks depending on the exact amount of data to migrate
- Time per background job: roughly 60-90 seconds
- Size of the new table: <=5GB (see !63669 (comment 614736560))
Does this MR meet the acceptance criteria?
Conformity
-
I have included a changelog entry, or it's not needed. (Does this MR need a changelog?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.