Bulk import: split references pipeline into smaller workers
What does this MR do and why?
Closes #429609 (closed).
Splits BulkImports::Projects::Pipelines::ReferencesPipeline
from a single worker into smaller workers behind a feature flag [Feature flag] Rollout of `bulk_import_async_re... (#430181 - closed). The pipeline is responsible for updating references in issue and MR descriptions and notes so that they are correctly mapped to the new project.
Before
A single worker was responsible for fetching all objects (issues, MRs and notes), building references and saving the objects. For one case, it took ~3.5 hours to complete.
After
The single worker approach was renamed to LegacyReferencesPipeline
so that when the feature flag is not enabled, it is still in use.
ReferencesPipeline
now is responsible for fetching all objects and enqueuing workers for each so that their refs can be updated async. The new workers are not blocking for the import - i.e. when a worker fails it will not fail the entire import but the failures are added to the import's failures.
Before the pipeline | After the pipeline | |
---|---|---|
MR description | ||
MR note | ||
Issue description | ||
Issue note |
Database queries
The database queries remained the same except for the following improvements:
- Instead of loading the whole issue, MR or note record, we now only select
id
- Instead of stepping through issues and MRs twice (once for themselves and once for their notes), we loop through them once and load up notes within the same loop.
The resulting database queries are as follows for the gitlab project:
- Loading up issues in batches:
-
6.853 ms for
SELECT "issues"."iid" FROM "issues" WHERE "issues"."project_id" = 278964 ORDER BY "issues"."iid" ASC LIMIT 1
-
21.165 ms for
SELECT "issues"."iid" FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."iid" >= 1 ORDER BY "issues"."iid" ASC LIMIT 1 OFFSET 100
-
191.533 ms for
SELECT "issues"."id" FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."iid" >= 1 AND "issues"."iid" < 101
-
6.853 ms for
- Issue notes:
-
49.734 ms for
SELECT "notes"."id" FROM "notes" WHERE "notes"."noteable_id" = 278965 AND "notes"."noteable_type" = 'Issue' ORDER BY "notes"."id" ASC LIMIT 1
-
14.630 ms for
SELECT "notes"."id" FROM "notes" WHERE "notes"."noteable_id" = 278965 AND "notes"."noteable_type" = 'Issue' AND "notes"."id" >= 1 ORDER BY "notes"."id" ASC LIMIT 1 OFFSET 100
-
12.014 ms for
SELECT "notes"."id" FROM "notes" WHERE "notes"."noteable_id" = 278965 AND "notes"."noteable_type" = 'Issue' AND "notes"."id" >= 1 AND "notes"."id" < 101
-
49.734 ms for
- Merge requests in batches (similar queries):
- 23.701 ms
- 62.736 ms
- 684.578 ms
- MR notes:
- 68.414 ms
- 14.846 ms
- 15.246 ms
Because these queries have already been database reviewed and are performant enough, I don't think we need additional review on this.
How to set up and validate locally
- Disable the feature flag:
Feature.disable(:bulk_import_async_references_pipeline)
. - Import a group via the Direct Importer. Add refs to issues, MRs and notes on the projects being imported.
- Tail the importer logs to see a single worker for
LegacyReferencesPipeline
. - Ensure that the refs are converted to links pointing to the new project.
- Enable the feature flag:
Feature.enable(:bulk_import_async_references_pipeline)
. - Import the group again.
- Tail the logs or view the sidekiq UI to see that a single worker called
ReferencesPipeline
is enqueued and then aTransformReferencesWorker
for each issue, MR and note. - Ensure that the refs are converted to links pointing to the new project.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #429609 (closed)