Add BulkImports NdjsonExtractor & update labels pipeline to use it
What does this MR do?
This MR adds NdjsonExtractor and updates labels pipeline to use it in Bulk Imports.
More information on Bulk Imports group migration tool https://docs.gitlab.com/ee/user/group/import/
Majority of Bulk Import ETL pipelines (extract -> transform -> load) use GraphQL API to import data. However, due to challenges described in #326757 (closed) not all group relations can be transferred over while preserving all their associations (e.g. if an epic has notes, which have award emojis. such nested relations are difficult to preserve using GraphQL API, due to nested pagination).
Instead, download exported relation ndjson.gz
file from 'Group relations export API' that was recently added as part of #329864 (closed) https://docs.gitlab.com/ee/api/group_relations_export.html and import it. This way we can easily preserve all nested associations, as we're reusing alot of the behaviour from existing Import/Export codebase.
NdjsonExtrator does the following:
- Downloads
labels.ndjson.gz
from source GitLab instance using 'Group relations export API' - Decompresses it
- Reads data from file and returns it for processing (one line at a time using ImportExport
NdjsonReader
)
LabelsPipeline is updated from GraphQL extractor to NdjsonExtrator in order to preserve epic-label association. Epics pipeline is going to be updated to use NdjsonExtractor in the future MR. This MR is a split from my draft MR (!61044 (closed)) as an attempt to have smaller MR easier to review.
Updated LabelsPipeline utilises existing Import/Export RelationFactory
which brings a lot of benefits, like making sure all nested relations are transformed into objects, all attributes are sanitized, appropriate attributes are added, etc.
Mentions #329864 (closed)
Screenshots (strongly suggested)
Does this MR meet the acceptance criteria?
Conformity
-
I have included a changelog entry, or it's not needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Security
Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team