De-dup project tree entries
What does this MR do?
Closes #27070 (closed)
Note: This change introduces a feature flag
- name:
dedup_project_import_metadata
- default: off
The proposal is to arrive at a better utilization of heap memory during imports by removing duplicate entries in the project metadata tree, thus shrinking it in size.
Implementation wise, I introduced a new collaborator ProjectTreeProcessor
, which the ProjectTreeRestorer
delegates to before handing the relation tree off to RelationTreeRestorer
. This makes it easy to swap out implementations e.g. for testing and comparison (to that end I introduced a no-op IdentityProjectTreeProcessor
which leaves the tree untouched.)
Preliminary results are not good: it appears that the overall memory usage went up, in contrast to the original hypothesis:
Before
After
It is evident from that profile that the original tree has not shrunk at all; rather, an additional 84M are being allocated in dedup_hash
.
I suspect this is because in addition to the original tree, we're building up a new hash that takes care of all the book keeping around which nodes have been visited before. More measurements are required to see whether this can account for the discrepancy.
The dedup_hash
method also adds anywhere around 7-12 seconds of additional runtime (I've seen this swing quite wildly between multiple runs), because it needs to traverse a large tree in its entirety.
UPDATE: the above results refer to the original proposal; I found a faster solution, but it also performs worse than not doing any de-duping at all, as explained in this comment
Does this MR meet the acceptance criteria?
Conformity
- [-] Changelog entry Will add in follow-up issue that removes the feature flag.
- [-] Documentation (if required)
-
Code review guidelines - [-] Merge request performance guidelines
-
Style guides - [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
-
unit tests -
run manual tests against gitlabhq project.json
(this takes 14s to run so didn't make it a unit test) -
run full gitlabhq import locally and compare results -
run gitlabhq import against branch-specific GL instance -
add feature-toggle -
only apply optimization if project tree >= 500MB