Skip to content

Cache external MR diffs on disk during Project Export to speed up export process

What does this MR do and why?

👀 What

Improves performance of externally stored merge request diffs export during Project Export in order to speed the overall export process. It caches merge request diffs in local temp storage and reuses locally stored diff in order to perform 'diff files' export, by reading from file on disk, instead of making a network request every time.

It downloads the entire diff file and places it under 'project-%{project_id}-external-mr-%{mr_id}-diff-%{id}-cache'.

After a merge request is done exporting, we remove the tmp file and folder, keeping disk usage in check.

🤔 Why

As described in #227653 (closed) externally stored diffs take a very long time to export due to network overhead and the way we fetch diffs from object storage.

A test project on staging (https://staging.gitlab.com/gk-import-tests/gitlabhq/) with ~4k MRs takes ~60 minutes to export vs 80 seconds on local file system.

385 diffs example

Staging

Benchmark.realtime do
  MergeRequestDiff.find(8184405).merge_request_diff_files.last(385).each do |f|
    f.utf8_diff
  end
end

=> 18.50129810348153 

Local file system

Benchmark.realtime do
  MergeRequestDiff.find(8184405).merge_request_diff_files.last(385).each do |f|
    f.utf8_diff
  end
end

=> 0.014228000305593014

A 0.4 Mb diff file which takes to read on its own 0.03 seconds is being read 385 times with a different content range every time, ending up taking 385 * 0.03 (11 seconds) not counting serialization, writing json to file and other operations that happen during export. Each diff retrieval take ~0.1 seconds on average using external storage (based on my testing).

This is a nice solution for large diffs that are stored in object storage, but not very efficient when performing Project Export. 18 seconds for a 0.4Mb file is not the best performance.

Profiling diff fetch from #227653 (comment 995761725) shows that most of the time spent is during network request:

Total: 0.228454
Sort by: total_time

 %self      total      self      wait     child     calls  name                           location
  0.00      0.228     0.000     0.000     0.228        1   Object#irb_binding             (irb):260
  0.01      0.228     0.000     0.000     0.228        1   MergeRequestDiffFile#diff_export (irb):60
  0.01      0.228     0.000     0.000     0.228        1   MergeRequestDiff#cached_external_diff (irb):4
  0.00      0.228     0.000     0.000     0.228        1   MergeRequestDiff#cache_external_diff (irb):19
  0.00      0.227     0.000     0.000     0.227        1   MergeRequestDiff#opening_external_diff /opt/gitlab/embedded/service/gitlab-rails/app/models/merge_request_diff.rb:489
  0.01      0.227     0.000     0.000     0.227        1   GitlabUploader#open            /opt/gitlab/embedded/service/gitlab-rails/app/uploaders/gitlab_uploader.rb:100
  0.02      0.219     0.000     0.000     0.219        3  *<Class::IO>#open               
  0.01      0.219     0.000     0.000     0.219        1   Gitlab::HttpIO#read            /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/http_io.rb:78
  0.01      0.219     0.000     0.000     0.219        1   Gitlab::HttpIO#get_chunk       /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/http_io.rb:150
  0.01      0.219     0.000     0.000     0.219        1   <Class::Net::HTTP>#start       /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:587
  0.01      0.219     0.000     0.000     0.218        1   Net::HTTP#start                /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:928
  0.01      0.213     0.000     0.000     0.213        1   Sentry::Net::HTTP#request      /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/sentry-ruby-core-5.1.1/lib/sentry/net/http.rb:28
  0.00      0.213     0.000     0.000     0.213        1   Net::HTTP#request              /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:1481
  0.01      0.213     0.000     0.000     0.213        1   Net::HTTP#transport_request    /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:1515

📓 Additional notes

  1. It's possible to cache the file using Carrierwave with it's external_diff.cache_stored_file! which will download it under uploader's cache_dir (which is likely to be shared/external-diffs/tmp/cache/.... However, the caching is only visible per-object basis (meaning sometihng like MergeRequestDiff.find(...).external_diff.cached? will return nil.
  2. It's possible to retrieve_from_cache! providing the string id that you get when calling cache_stored_file!, but in order to do that we'd have to save the cache id somewhere during export (e.g. redis) and reusing it for each of the diff files. Current solution was less effort.

🔬 Testing results

Env: Staging Data set: 3500 externally stored diffs

Before After
175 seconds 33 seconds

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by George Koltsov

Merge request reports

Loading