WIP: Implement `ndjson` support for `import/export`
What does this MR do?
Implement ndjson
support for import/export
This implements ndjson
and streaming json
support to handle two cases:
- big
project.json
(legacy way) - new
.ndjson
format, where each relation receives a separate file, and each item is stored per-line
This can properly detect old and a new file contents, without any changes to the files, and by maintaining backward compatibility.
This implements a trick to support streaming json writer to append data additively.
This overall when exporting legacy/ndjson
or importing ndjson
allows us to have a constant memory for the process,
and also significantly reduces latency of the data processing
due to not escaping to the native.
This does remove the usage of RelationFactory
on exporting side.
I believe it is OK trade-off to make.
Performance
Keep in mind that idle memory usage of GitLab is around ~500MB.
master
branch
The git checkout b213471f
master
1.1. Import on IMPORT_DEBUG=1 bin/rake gitlab:import_export:import[root,root,gitlabhq-with-issues-4,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 1260.872610532002
Number of SQL calls: 147407
Memory usage: 890.62109375 MiB
GC calls: 2718
GC major calls: 55
Label: process_345
master
1.2. Export on IMPORT_DEBUG=1 bin/rake gitlab:import_export:export[root,root,gitlabhq-with-issues-3,tmp/exports/gitlabhq_with_issues_export_legacy_v2.tar.gz]
Time to finish: 97.66875144900041
Number of SQL calls: 4006
Memory usage: 761.77734375 MiB
GC calls: 199
GC major calls: 26
Label: process_309
pid="process_110"
implement-ndjson
branch
2. The git checkout dbcec49a
implement-ndjson
2.1. Import on IMPORT_DEBUG=1 bin/rake gitlab:import_export:import[root,root,gitlabhq-with-issues-5,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 1207.3776693220025
Number of SQL calls: 147418
Memory usage: 671.0703125 MiB
GC calls: 2737
GC major calls: 42
Label: process_378
implement-ndjson
2.2. Export on IMPORT_DEBUG=1 bin/rake gitlab:import_export:export[root,root,gitlabhq-with-issues-3,tmp/exports/gitlabhq_with_issues_export_ndjson_v2.tar.gz]
Time to finish: 102.0661853370002
Number of SQL calls: 4006
Memory usage: 564.35546875 MiB
GC calls: 199
GC major calls: 23
Label: process_280
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team