Normalise and de-duplicate committer and author details in merge_request_diff_commits
In #331523 (comment 583654940), as part of our sharding effort, we found that a lot of data in merge_request_diff_commits
is used for storing committer and author details (= name + email). Most of this data is duplicate data.
To reduce the amount of space needed to store this data, we need to normalise and de-duplicate this data. To do so, we need two tables with these structures:
-
merge_request_diff_commit_names
-
id
(primary key) -
name
(text, unique)
-
-
merge_request_diff_commit_emails
-
id
(primary key) -
email
(text, unique)
-
We then need to add the following columns to merge_request_diff_commits
:
committer_name_id
committer_email_id
author_name_id
author_email_id
These columns would have foreign keys to the appropriate tables (e.g. committer_name_id
points to merge_request_diff_commit_names.id
), using the nullify
rule. The columns can be nullable, because I believe that commit author/committer details are technically not required (at least they aren't in our current database scheme).
The rule doesn't really matter though, as we'd never delete rows from the commit names/emails tables. Indexes on these columns are also not needed for that reason.
When we create a new row in merge_request_diff_commits
, we take the following steps for the committer name, email, the author name, and the author email:
- Run a
SELECT
to see if an existing row is present in the appropriate target table (e.g. for the committer name we'd look inmerge_request_diff_commit_names
) - If a row exists, use the ID of that row to populate the appropriate column in
merge_request_diff_commits
- If no row exists, insert one and catch any unique record errors we may run into. If we run into such an error, run a
SELECT
to get the row - Populate the column the same way as step 2
Migration Steps
Migrating the existing data is done as follows:
- A regular migration adds the new tables and columns. Initially these are all empty
- We add a column
name_and_email_migrated
tomerge_request_diff_commits
. This column is a boolean that defaults tofalse
- The code is changed to support reading from both. That is, if the ID columns have a value we read from them; otherwise we read from the old text based columns
- A post-deployment migration schedules the migrating of the existing rows, using the same logic for populating tables as outlines above. For every row migrated we set
name_and_email_migrated
totrue
- In the next release, we run a regular migration that gets all rows where
name_and_email_migrated
isfalse
and migrates them in place. This number should be close to zero (or zero itself), and mostly servers as a backup in case some background migrations didn't finish - In this same release we change the code to only read from the new columns. We also add the necessary rules to ignore the old columns
- In this same release we add a post-deployment migration that removes the old text based columns