Introduce `.ndjson` as a way to process import
Problem to solve
We've previously identified the need to reduce overall memory consumption for both imports and exports. The problems as of now are:
- Peak memory use is a function of project size (specifically the metadata tree encoding it)
- There are known inefficiencies in the current encoding, such as creating duplicate entries
The current solution therefore doesn't scale, since memory use rises with project size (to an extent that we were in some cases unable to process it anymore.)
To address these concerns, we propose to introduce a new data-interchange format (DIF) based on .ndjson
, which would allow us to process imports and exports with approximately constant memory use, i.e. regardless of project size. An early proof-of-concept has shown very promising results. However, this is a complex untertaking, as there are a number of things that need to happen before we can switch over to .ndjson
:
- We need to introduce versioning of
import/export
to allow us introduce breaking changes: #35861 (closed) - We need to implement
.ndjson
onexport
side, - We need to implement
.ndjson
onimport
side.
Proposal
What follows is an overview of:
- Why we think
ndjson
is a good contender to solving the problem outlined above - How a project export would be represented in
ndjson
- An estimate of the memory savings this would afford us
- General risks and impediments for reaching this goal
ndjson
Benefits of ndjson
(newline-delimited JSON) is a JSON based DIF optimized for streaming use cases. Since plain JSON encodes an entity in a single monolithic tree, it needs to either be interpreted in its entirety or tokenized and streamed in parts. Both approaches are problematic for different reasons: the former because the data needs to be loaded into memory in its entirety (which is inefficient for large data sets; it is the approach we're currently taking), the latter because it operates at a very low level of the data structure, making it cumbersome to deal with from a development point of view.
ndjson
instead splits a data structure that can be encoded in JSON into smaller JSON values, where each such value is written and read as a single line of text. An ndjson
formatted stream is therefore not a valid JSON document, but each line in such a stream is.
The format itself is more formally specified here.
This approach has a number of benefits:
- Constant memory use.. Data can be streamed from files or over the wire line-by-line, and already processed lines can be discarded. This means memory use will never be larger than the largest single-line entity, which over a sufficiently large number of projects should--on average--mean roughly constant memory use.
- Familiar concepts & tooling. Each line is a valid JSON document (i.e. either an array or object), so no additional tooling or knowledge is needed to process it, meaning there is much less friction compared to other wire formats like protobuf.
- Checkpointing. The by-line transmission nature of ndjson gives us a natural way to checkpoint imports and exports, which can be used to implement abort/resume as well as progress indicators / ETAs for end users/frontends. We could, for instance, keep a simple file pointer around that we can reset a file to when an import is paused and later resumed.
Representing GitLab project exports
In terms of encoding e.g. exported projects to ndjson, a natural split would be along the "top-level associations" that we currently define for project exports (e.g. issues
, merge_requests
, etc; see import_export.yml
for a full list). In this model, each direct child node of project
would be written to a separate file, which in return contains one entity entry per line. This has the benefit that we can follow a familiar structure / schema. Moreover, since we would have one file per top-level association, entity-specific post-processing becomes easier to do, since we could for instance limit or transform the number of merge requests we export without every running the application, purely by text-processing merge_requests.ndjson
. An example break-down:
├── [ 5] auto_devops.ndjson
├── [ 31] ci_cd_settings.ndjson
├── [172K] ci_pipelines.ndjson
├── [ 219] container_expiration_policy.ndjson
├── [ 5] error_tracking_setting.ndjson
├── [139K] issues.ndjson
├── [1.2K] labels.ndjson
├── [2.3M] merge_requests.ndjson
├── [ 5] metrics_setting.ndjson
├── [2.7K] milestones.ndjson
├── [ 316] project_feature.ndjson
├── [1.6K] project_members.ndjson
├── [1.3K] project.ndjson
├── [ 745] protected_branches.ndjson
├── [ 5] service_desk_setting.ndjson
└── [ 589] services.ndjson
A challenge with this approach is that the size distribution can be quite uneven, since some relations contain much more data than others.
Expected memory savings
Some early measurements can be found in !23920 (closed)
We expect even large relations to not exceed a few dozen KB in size; since this memory can be released once the relation has been processed, we expect dramatic improvements in memory used. These gains become larger as the project metadata grows. Even with very large individual relations (say 1MB / line), here is how ndjson
would compare to a DOM based approach:
Total JSON metadata | Max JSON (before) | Max JSON (after) | reduction |
---|---|---|---|
80MB |
80MB |
1MB |
98.75% |
1GB |
1GB |
1MB |
99.9% |
Risks and impediments
A major challenge will be to migrate to the new format without compromising user experience too much, since this constitutes a breaking change, as older archives exported under the current logic would not be compatible with ndjson
. We need to decide to what extent we want to keep supporting the older, less efficient format, or whether we prefer a clean break with instructions for users for how to migrate over. One suggestion that was made to ease this effort could be a dedicated tool that customers can use to transform older project archives to the new format.
There might be a time during which we need to support both formats simultaneously, which will temporarily increase complexity in the code base.
Making sure that we do not compromise existing import/export functionality and correctness is another risk. We have already worked towards a better automated monitoring & testing of import/export related functionality, but especially large imports can break in subtle ways that are difficult to detect.
Finally, since this is a larger effort, we do not expect groupmemory to implement every aspect of it, but rather prepare everything as much as possible and eventually hand over the project to ~"group::import". However, we already did a high level issue breakdown, which is summarized below.
Approach
Below is a proposed break down of the work that needs to get done and how it could be split up in increments that we can deliver.
Track 0: Preparational work
-
Separate
Group
fromProject
logic (important, in progress)- move (dumb) all (almost) code from
Gitlab::ImportExport
namespace into - #207846 (closed)
- move (dumb) all (almost) code from
-
Drop or rework
RelationRenameService
(important, #207960 (closed))- we need to figure out if or how
RelationRenameService
will carry over tondjson
, since it duplicates relations it renames, which is not compatible with streaming data - we might have to drop this -- check with PM
- Similarly, we have recently dropped support for legacy merge request formats in !25616 (merged)
- next step: open MR to remove it
- we need to figure out if or how
-
Introduce a rake task for synchronous exports (optional, in progress)
- ship
export.rake
task similar toimport.rake
- #207847 (closed)
- ship
-
Collect exporter metrics in regular intervals (optional)
- extend our metrics gathering to also include exporting project (similar to importing)
- this could be similar to what we did for imports where results are regularly published via a CI pipeline (TODO: create issue for this)
Track 1: Moving toward ndjson
We can split this track up further to work in parallel on the export and import side of things.
Exporting
MR1: Export via streaming serializer, introduce "Writer" abstraction
- Introduce streaming serializer as a drop-in replacement for
FastHashSerializer
- a
Writer
can persist relations in different ways - it would still produce a "fat JSON" so no ndjson here yet
- there would be no structural changes here yet, it's mostly a refactor
MR2: Introduce ndjson writer
- this implements a Writer that writes ndjson
- based on a feature flag it can switch between fat or ndjson or just writes both outputs
MR3: (nice to have) Allow to export either using ndjson
or legacy format (but not both)
- See also Track 2 (need to clarify with product how to achieve that, since it requires user input)
Importing
MR4: Introduce "Reader" abstraction
- a
Reader
can read JSON files for further processing - its only implementation would be to read fat JSON files
MR5: Introduce ndjson
Reader
- this implements the Reader that can parse ndjson
- Based on feature flag and/or file format it decides which reader to choose
ndjson
import/export to users
Track 2: Expose Tackle the smallest possible aspect of the #35861 (closed).
I would us like to have a way to indicate as part of export request what "format" of export we want. Maybe it should be a two radio buttons under the:
Export project Export this project with all its related data in order to move your project to a new GitLab instance. Once the export is finished, you can import the file from the "New Project" page.
-
Use legacy version compatible with GitLab < 12.9
(internally it would create big json) -
Export using new version compatible with GitLab >= 12.9
(internally it would create ndjson) (default).
We would continue importing legacy/big json till 13.0, with 13.1 we would remove support for exporting/importing legacy json.
Intended users
Links / references
POC MR: !23920 (closed)
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.