Introduce `.ndjson` as a way to process import

Problem to solve

We've previously identified the need to reduce overall memory consumption for both imports and exports. The problems as of now are:

Peak memory use is a function of project size (specifically the metadata tree encoding it)
There are known inefficiencies in the current encoding, such as creating duplicate entries

The current solution therefore doesn't scale, since memory use rises with project size (to an extent that we were in some cases unable to process it anymore.)

To address these concerns, we propose to introduce a new data-interchange format (DIF) based on .ndjson, which would allow us to process imports and exports with approximately constant memory use, i.e. regardless of project size. An early proof-of-concept has shown very promising results. However, this is a complex untertaking, as there are a number of things that need to happen before we can switch over to .ndjson:

We need to introduce versioning of import/export to allow us introduce breaking changes: #35861 (closed)
We need to implement .ndjson on export side,
We need to implement .ndjson on import side.

Proposal

What follows is an overview of:

Why we think ndjson is a good contender to solving the problem outlined above
How a project export would be represented in ndjson
An estimate of the memory savings this would afford us
General risks and impediments for reaching this goal

Benefits of `ndjson`

ndjson (newline-delimited JSON) is a JSON based DIF optimized for streaming use cases. Since plain JSON encodes an entity in a single monolithic tree, it needs to either be interpreted in its entirety or tokenized and streamed in parts. Both approaches are problematic for different reasons: the former because the data needs to be loaded into memory in its entirety (which is inefficient for large data sets; it is the approach we're currently taking), the latter because it operates at a very low level of the data structure, making it cumbersome to deal with from a development point of view.

ndjson instead splits a data structure that can be encoded in JSON into smaller JSON values, where each such value is written and read as a single line of text. An ndjson formatted stream is therefore not a valid JSON document, but each line in such a stream is.

The format itself is more formally specified here.

This approach has a number of benefits:

Constant memory use.. Data can be streamed from files or over the wire line-by-line, and already processed lines can be discarded. This means memory use will never be larger than the largest single-line entity, which over a sufficiently large number of projects should--on average--mean roughly constant memory use.
Familiar concepts & tooling. Each line is a valid JSON document (i.e. either an array or object), so no additional tooling or knowledge is needed to process it, meaning there is much less friction compared to other wire formats like protobuf.
Checkpointing. The by-line transmission nature of ndjson gives us a natural way to checkpoint imports and exports, which can be used to implement abort/resume as well as progress indicators / ETAs for end users/frontends. We could, for instance, keep a simple file pointer around that we can reset a file to when an import is paused and later resumed.

Representing GitLab project exports

In terms of encoding e.g. exported projects to ndjson, a natural split would be along the "top-level associations" that we currently define for project exports (e.g. issues, merge_requests, etc; see import_export.yml for a full list). In this model, each direct child node of project would be written to a separate file, which in return contains one entity entry per line. This has the benefit that we can follow a familiar structure / schema. Moreover, since we would have one file per top-level association, entity-specific post-processing becomes easier to do, since we could for instance limit or transform the number of merge requests we export without every running the application, purely by text-processing merge_requests.ndjson. An example break-down:

├── [   5]  auto_devops.ndjson
├── [  31]  ci_cd_settings.ndjson
├── [172K]  ci_pipelines.ndjson
├── [ 219]  container_expiration_policy.ndjson
├── [   5]  error_tracking_setting.ndjson
├── [139K]  issues.ndjson
├── [1.2K]  labels.ndjson
├── [2.3M]  merge_requests.ndjson
├── [   5]  metrics_setting.ndjson
├── [2.7K]  milestones.ndjson
├── [ 316]  project_feature.ndjson
├── [1.6K]  project_members.ndjson
├── [1.3K]  project.ndjson
├── [ 745]  protected_branches.ndjson
├── [   5]  service_desk_setting.ndjson
└── [ 589]  services.ndjson

A challenge with this approach is that the size distribution can be quite uneven, since some relations contain much more data than others.

Expected memory savings

Some early measurements can be found in !23920 (closed)

We expect even large relations to not exceed a few dozen KB in size; since this memory can be released once the relation has been processed, we expect dramatic improvements in memory used. These gains become larger as the project metadata grows. Even with very large individual relations (say 1MB / line), here is how ndjson would compare to a DOM based approach:

Total JSON metadata	Max JSON (before)	Max JSON (after)	reduction
`80MB`	`80MB`	`1MB`	`98.75%`
`1GB`	`1GB`	`1MB`	`99.9%`

Risks and impediments

A major challenge will be to migrate to the new format without compromising user experience too much, since this constitutes a breaking change, as older archives exported under the current logic would not be compatible with ndjson. We need to decide to what extent we want to keep supporting the older, less efficient format, or whether we prefer a clean break with instructions for users for how to migrate over. One suggestion that was made to ease this effort could be a dedicated tool that customers can use to transform older project archives to the new format.

There might be a time during which we need to support both formats simultaneously, which will temporarily increase complexity in the code base.

Making sure that we do not compromise existing import/export functionality and correctness is another risk. We have already worked towards a better automated monitoring & testing of import/export related functionality, but especially large imports can break in subtle ways that are difficult to detect.

Finally, since this is a larger effort, we do not expect groupmemory to implement every aspect of it, but rather prepare everything as much as possible and eventually hand over the project to ~"group::import". However, we already did a high level issue breakdown, which is summarized below.

Approach

Below is a proposed break down of the work that needs to get done and how it could be split up in increments that we can deliver.

Track 0: Preparational work

Separate Group from Project logic (important, in progress)
- move (dumb) all (almost) code from Gitlab::ImportExport namespace into
- #207846 (closed)
Drop or rework RelationRenameService (important, #207960 (closed))
- we need to figure out if or how RelationRenameService will carry over to ndjson, since it duplicates relations it renames, which is not compatible with streaming data
- we might have to drop this -- check with PM
- Similarly, we have recently dropped support for legacy merge request formats in !25616 (merged)
- next step: open MR to remove it
Introduce a rake task for synchronous exports (optional, in progress)
- ship export.rake task similar to import.rake
- #207847 (closed)
Collect exporter metrics in regular intervals (optional)
- extend our metrics gathering to also include exporting project (similar to importing)
- this could be similar to what we did for imports where results are regularly published via a CI pipeline (TODO: create issue for this)

Track 1: Moving toward ndjson

We can split this track up further to work in parallel on the export and import side of things.

Exporting

MR1: Export via streaming serializer, introduce "Writer" abstraction

Introduce streaming serializer as a drop-in replacement for FastHashSerializer
a Writer can persist relations in different ways
it would still produce a "fat JSON" so no ndjson here yet
there would be no structural changes here yet, it's mostly a refactor

MR2: Introduce ndjson writer

this implements a Writer that writes ndjson
based on a feature flag it can switch between fat or ndjson or just writes both outputs

MR3: (nice to have) Allow to export either using ndjson or legacy format (but not both)

See also Track 2 (need to clarify with product how to achieve that, since it requires user input)

Importing

MR4: Introduce "Reader" abstraction

a Reader can read JSON files for further processing
its only implementation would be to read fat JSON files

MR5: Introduce ndjson Reader

this implements the Reader that can parse ndjson
Based on feature flag and/or file format it decides which reader to choose

Track 2: Expose `ndjson` import/export to users

Tackle the smallest possible aspect of the #35861 (closed).

I would us like to have a way to indicate as part of export request what "format" of export we want. Maybe it should be a two radio buttons under the:

Export project Export this project with all its related data in order to move your project to a new GitLab instance. Once the export is finished, you can import the file from the "New Project" page.

Use legacy version compatible with GitLab < 12.9 (internally it would create big json)
Export using new version compatible with GitLab >= 12.9 (internally it would create ndjson) (default).

We would continue importing legacy/big json till 13.0, with 13.1 we would remove support for exporting/importing legacy json.

Intended users

Links / references

POC MR: !23920 (closed)

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited May 31, 2022 by 🤖 GitLab Bot 🤖