Skip to content

Fix merge request commits that are missing committer/author details

Yorick Peterse requested to merge yorick/fix-missing-commits-data into master

What does this MR do and why?

In MR !63669 (merged) we introduced a new data format for storing merge request diff commit authors and committers. As part of this work we made changes to the import/export code to support this new format, and added a set of migrations to migrate existing data to this new format. At this time we supported reading and writing of data in both the old and new format, allowing us to gradually migrate data over to the new format.

In !72219 (merged) we ensured all migrations are done, stopped using the old data format, and removed the columns storing this data.

Unfortunately, this chain of events uncovered a bug in our import/export logic. Consider the following timeline of events:

  1. You export project "Cooking Recipes" from a GitLab instance running a version earlier than 14.1 (e.g. 14.0).
  2. The instance you intend to import this project into is running 14.1 or newer. Existing data has been fully migrated already.
  3. You import the project into this new instance.

At this point, the imported data is using the old format, not the format. This is because we forgot to take into account users importing exports using GitLab 14.0 or older, instead only covering exports generated using GitLab 14.1 or newer. Because the background migrations finished, or the data imported would fall in a "bucket" (= a chunk or rows to migrate) that had already been migrated, the data would never be updated to the new format.

In this commit we resolve this problem in two steps. First, we change the import/export logic to support importing data in both the old and new format. Exports still use the new format. In addition, we include a background migration that processes all projects created using a GitLab import/export since the first mentioned merge request was introduced. For each such project we scan over the merge request diff commits and fix any that are missing the commit author or committer details.

For small self-hosted instances this process is unlikely to take more than a few minutes. On GitLab.com however we expect this process to take a few days, as we have to process around 200 000 projects imported since July. This means we'll likely need additional manual intervention similar to the manual work needed for #334394 (closed).

See #344080 (closed) for additional details.

Migration output

==  ScheduleFixMergeRequestDiffCommitUsersMigration: migrating ================
-- transaction()
==  ScheduleFixMergeRequestDiffCommitUsersMigration: migrated (0.0714s) =======

Query output

The background migration runs a query to get all the diff commits that need fixing. These queries look like this:

EXPLAIN ANALYZE
SELECT merge_request_diff_commits.merge_request_diff_id,
       merge_request_diff_commits.relative_order,
       merge_request_diff_commits.sha,
       merge_request_diff_commits.committer_id,
       merge_request_diff_commits.commit_author_id
FROM merge_request_diff_commits
INNER JOIN merge_request_diffs ON merge_request_diffs.id = merge_request_diff_commits.merge_request_diff_id
INNER JOIN merge_requests ON merge_requests.id = merge_request_diffs.merge_request_id
WHERE merge_requests.id IN (
    SELECT merge_requests.id
    FROM merge_requests
    WHERE merge_requests.target_project_id = 278964
    AND merge_requests.id >= 124079136
    AND merge_requests.id < 124110008
)
AND (commit_author_id IS NULL OR committer_id IS NULL);

The plan of this query on production is:

 Gather  (cost=1002.41..497738.42 rows=1 width=45) (actual time=39.832..43.734 rows=0 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Nested Loop  (cost=2.41..496738.32 rows=1 width=45) (actual time=36.613..36.616 rows=0 loops=2)
         ->  Nested Loop  (cost=1.85..496736.46 rows=1 width=53) (actual time=36.612..36.614 rows=0 loops=2)
               ->  Nested Loop  (cost=1.14..12128.84 rows=18 width=12) (actual time=7.570..30.951 rows=10 loops=2)
                     ->  Parallel Index Scan using merge_requests_pkey on merge_requests merge_requests_1  (cost=0.57..12102.26 rows=9 width=4) (actual time=7.534..29.821 rows=4 loops=2)
                           Index Cond: ((id >= 124079136) AND (id < 124110008))
                           Filter: (target_project_id = 278964)
                           Rows Removed by Filter: 13625
                     ->  Index Only Scan using index_merge_request_diffs_on_merge_request_id_and_id on merge_request_diffs  (cost=0.57..2.82 rows=13 width=8) (actual time=0.083..0.246 rows=2 loops=9)
                           Index Cond: (merge_request_id = merge_requests_1.id)
                           Heap Fetches: 50
               ->  Index Scan using merge_request_diff_commits_pkey on merge_request_diff_commits  (cost=0.71..26922.40 rows=25 width=45) (actual time=0.595..0.595 rows=0 loops=19)
                     Index Cond: (merge_request_diff_id = merge_request_diffs.id)
                     Filter: ((commit_author_id IS NULL) OR (committer_id IS NULL))
                     Rows Removed by Filter: 1
         ->  Index Only Scan using merge_requests_pkey on merge_requests  (cost=0.57..1.85 rows=1 width=4) (never executed)
               Index Cond: (id = merge_request_diffs.merge_request_id)
               Heap Fetches: 0
 Planning Time: 17.800 ms
 Execution Time: 43.849 ms

To schedule the background migration, we run queries like this:

select id
from projects
where id in (
        select id
        from projects
        where id between (31013821 - 10000) and 31013821
)
and import_type = 'gitlab_project'
and created_at >= '2021-07-07 00:00:00';

These queries produce a plan along the lines of the following:

 Gather  (cost=1134.10..26444.84 rows=26 width=4) (actual time=14.338..56.862 rows=561 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop  (cost=134.10..25442.24 rows=11 width=4) (actual time=9.935..48.743 rows=187 loops=3)
         ->  Parallel Bitmap Heap Scan on projects projects_1  (cost=133.66..14121.38 rows=3342 width=4) (actual time=9.280..33.871 rows=2739 loops=3)
               Recheck Cond: ((id >= 31003821) AND (id <= 31013821))
               Heap Blocks: exact=2074
               ->  Bitmap Index Scan on projects_pkey  (cost=0.00..131.66 rows=8022 width=0) (actual time=7.160..7.160 rows=21048 loops=1)
                     Index Cond: ((id >= 31003821) AND (id <= 31013821))
         ->  Index Scan using projects_pkey on projects  (cost=0.44..3.39 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=8217)
               Index Cond: (id = projects_1.id)
               Filter: ((created_at >= '2021-07-07 00:00:00+00'::timestamp with time zone) AND ((import_type)::text = 'gitlab_project'::text))
               Rows Removed by Filter: 1
 Planning Time: 4.802 ms
 Execution Time: 57.020 ms

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Yorick Peterse

Merge request reports

Loading