De-duplicate project tree on import

UPD: The refactoring part (it is required to reduce the complexity and prepare the actual change): #33226 (closed)

It seems that we have very high data duplication on doing project import:

Doing some expansive tests on gitlabhq.json export and trying to understand the memory cost of import we can do a few interesting tricks to make it a significantly more performant.

Loading JSON is expensive, and consumes a lot of memory,
Raw JSON weights around 70MB,
Loading JSON takes significant amount of time,

Benchmark.measure{json=JSON.load(File.read('./tmp/exports/gitlabhq/project.json'));}
=> #<Benchmark::Tms:0x000055f3be8daa38
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=3.4529981750001753,
 @stime=0.13202700000000012,
 @total=3.4529029999999983,
 @utime=3.3208759999999984>

Parsed JSON consumes over 200MB of memory,

Gitlab::Utils::DeepSize.new(json, max_size:10000000000).size
=> 215655327

We can deduplicate json (strings/hashes/arrays):

def dedup_hash(item, map = {})
  return map[item] if map.key?(item)

  new_item =
    case item
    when String
      item
    when Array
      item.map { |a| dedup_hash(a, map) }
    when Hash
      item.map { |k, v| [dedup_hash(k, map), dedup_hash(v, map)] }.to_h
    else
      item
    end

  map[item] = new_item
  new_item
end

Gitlab::Utils::DeepSize.new(json_dedup,max_size:10000000000).size
=> 60687921

Benchmark.measure{dedup_hash(json);}
=> #<Benchmark::Tms:0x000055f3bfe46098
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=7.577473183000166,
 @stime=0.016188999999999787,
 @total=7.577327999999997,
 @utime=7.561138999999997>

Cost might seem high, but actually we free 2/3 of memory for JSON.

As a side effect of de-duplication we can now use object_id of hash entries when expanding relations to cache relation objects. This becomes super cheap and allows us to remove a ton of redundant expensive operations to find/create relations by checking database and instantiating objects.
This side effect, which costs by dedup around 7s, should allow us to reduce import time of database structure to something like 50%, and reduce memory footprint to almost constant value in this case of 100MB instead of currently memory balloon. Except the cost of holding de-dup JSON structure in memory.

Proposal

De-duplicate the import hash,
When importing objects, and rewrite the hash,
Rewritten hash since it was de-duplicated will reuse already created objects.

This should allow:

With very little cost upfront to reduce memory footprint by 2/3 of JSON hash for big projects,
Reduce amount of SQL queries and object creations, since objects that were created already will not be re-created again,
Make import pretty much of almost constant memory except a few objects (labels and milestones),
This should in general reduce import time between 50-70%.

Final measurements (post-MR)

(mk: Reposting some findings that appeared in comments in the MR here, in the spirit of keeping the issue the source of truth)

Observation 1

I reran some tests with a 2.5GB project JSON, and the results are definitely a lot better. I do remember though that this particular project tree (the one from a recent incident) was very oddly distributed in terms of data, so I'm not sure how representative this is.

Again as earlier I only measured the resources spent on processing the project tree, not the relation restorer that follows it:

No-dedup: Peak RSS: 8,689,944 (~8.69GB) Final RSS: 8,457,200 (~8.46GB) Duration: 1m37s

Dedup'ed: Peak RSS: 8,734,700 (~8.73GB) 📈 Final RSS: 6,437,040 (~6.44GB) 📉 💯 Duration: 3m46s 📈

So while runtime has more than doubled, and peak RSS is still slightly higher, just before going into relation processing we see an improvement of ~2.3GB in overall memory use. This will be sustained until the job ends (but we will see memory increase further again due to the work that follows.)

Given how long a job like this would normally run, the increase in duration isn't all that significant IMHO. >2GB less memory used is pretty significant though.

Observation 2

So the current trade-off is:

Without optimization: We're optimizing for the bulk of the curve. Slighly faster & more memory firendly for smaller imports (the majority), but a large price is paid for larger imports (the minority/outliers/the "tail end")
With optimization: We're optimizing for the tail of the curve. Slightly slower & less memory friendly for smaller imports (the majority) but vastly more memory efficient for larger imports (the minority)

What's more important to us? Also, do we have any data on project import sizes anywhere to understand what percentage of imports would actually benefit from this change?

(mk: we ended up addressing this problem by moving the optimization behind a file size check and only apply it to files > 500MB)

Observation 3

Did some more measurements here with de-duping enabled, now for different project tree sizes, to determine a cut-off point:

417MB project.json // Peak RSS = Final RSS = 1.98GB improvement: 0%

839MB project.json // Peak RSS: 3.42GB Final RSS: 2.68GB improvement: ~21%

1.2GB project.json // Peak RSS: 4.90GB Final RSS: 3.89GB improvement: ~20%

These numbers suggest that the optimization is only useful for project metadata trees that are larger than ~500MB. This is still a bit finger-in-the-air, since it also likely depends on the distribution of data within the tree, but the "ball park" should be correct.

Conclusion: if we want the optimization to apply automatically, only do it for project JSON that exceeds 500MB.

Edited Jan 29, 2020 by Matthias Käppler