De-dup project tree entries (!22598) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 27070-dedup-import-json into master Jan 08, 2020

What does this MR do?

Note: This change introduces a feature flag

name: dedup_project_import_metadata
default: off

The proposal is to arrive at a better utilization of heap memory during imports by removing duplicate entries in the project metadata tree, thus shrinking it in size.

Implementation wise, I introduced a new collaborator ProjectTreeProcessor, which the ProjectTreeRestorer delegates to before handing the relation tree off to RelationTreeRestorer. This makes it easy to swap out implementations e.g. for testing and comparison (to that end I introduced a no-op IdentityProjectTreeProcessor which leaves the tree untouched.)

Preliminary results are not good: it appears that the overall memory usage went up, in contrast to the original hypothesis:

Before

After

It is evident from that profile that the original tree has not shrunk at all; rather, an additional 84M are being allocated in dedup_hash.

I suspect this is because in addition to the original tree, we're building up a new hash that takes care of all the book keeping around which nodes have been visited before. More measurements are required to see whether this can account for the discrepancy.

The dedup_hash method also adds anywhere around 7-12 seconds of additional runtime (I've seen this swing quite wildly between multiple runs), because it needs to traverse a large tree in its entirety.

UPDATE: the above results refer to the original proposal; I found a faster solution, but it also performs worse than not doing any de-duping at all, as explained in this comment

Does this MR meet the acceptance criteria?

Conformity

[-] Changelog entry Will add in follow-up issue that removes the feature flag.
[-] Documentation (if required)
Code review guidelines
[-] Merge request performance guidelines
Style guides
[-] Database guides
[-] Separation of EE specific content

Availability and Testing

unit tests
run manual tests against gitlabhq project.json (this takes 14s to run so didn't make it a unit test)
run full gitlabhq import locally and compare results
run gitlabhq import against branch-specific GL instance
add feature-toggle
only apply optimization if project tree >= 500MB

Edited May 31, 2022 by 🤖 GitLab Bot 🤖

De-dup project tree entries

What does this MR do?

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Merge request reports