Cache external MR diffs on disk during Project Export to speed up export process
What does this MR do and why?
👀 What
Improves performance of externally stored merge request diffs export during Project Export in order to speed the overall export process. It caches merge request diffs in local temp storage and reuses locally stored diff in order to perform 'diff files' export, by reading from file on disk, instead of making a network request every time.
It downloads the entire diff file and places it under 'project-%{project_id}-external-mr-%{mr_id}-diff-%{id}-cache'.
After a merge request is done exporting, we remove the tmp file and folder, keeping disk usage in check.
🤔 Why
As described in #227653 (closed) externally stored diffs take a very long time to export due to network overhead and the way we fetch diffs from object storage.
A test project on staging (https://staging.gitlab.com/gk-import-tests/gitlabhq/) with ~4k MRs takes ~60 minutes to export vs 80 seconds on local file system.
385 diffs example
Staging
Benchmark.realtime do
MergeRequestDiff.find(8184405).merge_request_diff_files.last(385).each do |f|
f.utf8_diff
end
end
=> 18.50129810348153
Local file system
Benchmark.realtime do
MergeRequestDiff.find(8184405).merge_request_diff_files.last(385).each do |f|
f.utf8_diff
end
end
=> 0.014228000305593014
A 0.4 Mb diff file which takes to read on its own 0.03 seconds is being read 385 times with a different content range every time, ending up taking 385 * 0.03 (11 seconds) not counting serialization, writing json to file and other operations that happen during export. Each diff retrieval take ~0.1 seconds on average using external storage (based on my testing).
This is a nice solution for large diffs that are stored in object storage, but not very efficient when performing Project Export. 18 seconds for a 0.4Mb file is not the best performance.
Profiling diff fetch from #227653 (comment 995761725) shows that most of the time spent is during network request:
Total: 0.228454
Sort by: total_time
%self total self wait child calls name location
0.00 0.228 0.000 0.000 0.228 1 Object#irb_binding (irb):260
0.01 0.228 0.000 0.000 0.228 1 MergeRequestDiffFile#diff_export (irb):60
0.01 0.228 0.000 0.000 0.228 1 MergeRequestDiff#cached_external_diff (irb):4
0.00 0.228 0.000 0.000 0.228 1 MergeRequestDiff#cache_external_diff (irb):19
0.00 0.227 0.000 0.000 0.227 1 MergeRequestDiff#opening_external_diff /opt/gitlab/embedded/service/gitlab-rails/app/models/merge_request_diff.rb:489
0.01 0.227 0.000 0.000 0.227 1 GitlabUploader#open /opt/gitlab/embedded/service/gitlab-rails/app/uploaders/gitlab_uploader.rb:100
0.02 0.219 0.000 0.000 0.219 3 *<Class::IO>#open
0.01 0.219 0.000 0.000 0.219 1 Gitlab::HttpIO#read /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/http_io.rb:78
0.01 0.219 0.000 0.000 0.219 1 Gitlab::HttpIO#get_chunk /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/http_io.rb:150
0.01 0.219 0.000 0.000 0.219 1 <Class::Net::HTTP>#start /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:587
0.01 0.219 0.000 0.000 0.218 1 Net::HTTP#start /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:928
0.01 0.213 0.000 0.000 0.213 1 Sentry::Net::HTTP#request /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/sentry-ruby-core-5.1.1/lib/sentry/net/http.rb:28
0.00 0.213 0.000 0.000 0.213 1 Net::HTTP#request /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:1481
0.01 0.213 0.000 0.000 0.213 1 Net::HTTP#transport_request /opt/gitlab/embedded/lib/ruby/2.7.0/net/http.rb:1515
📓 Additional notes
- It's possible to cache the file using Carrierwave with it's
external_diff.cache_stored_file!
which will download it under uploader'scache_dir
(which is likely to beshared/external-diffs/tmp/cache/...
. However, the caching is only visible per-object basis (meaning sometihng likeMergeRequestDiff.find(...).external_diff.cached?
will returnnil
. - It's possible to
retrieve_from_cache!
providing the string id that you get when callingcache_stored_file!
, but in order to do that we'd have to save the cache id somewhere during export (e.g. redis) and reusing it for each of the diff files. Current solution was less effort.
🔬 Testing results
Env: Staging Data set: 3500 externally stored diffs
Before | After |
---|---|
175 seconds | 33 seconds |
Screenshots or screen recordings
These are strongly recommended to assist reviewers and reduce the time to merge your change.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.