Add rubyzip refinement to speed up entry count
What does this MR do and why?
In https://gitlab.com/gitlab-org/gitlab/-/issues/345673 we identified the need to provide a faster, more efficient way to count entries in a zip file.
This MR adds a small refinement to Zip::File
(part of the rubyzip
gem) that does just that. It operates by reading only the EOCD
data structure, which contains an integer field with the entry count.
For a zip file with 1M entries, it improves performance by several orders of magnitude, as it runs with constant CPU and memory use, whereas iterating Central Directory entries is O(N) both for CPU and memory use.
More details with timing in https://gitlab.com/gitlab-org/gitlab/-/issues/345673#note_733321374
How to set up and validate locally
Run Rails console and paste this snippet:
module M
using GemExtensions::Rubyzip::Refinements
def fast_count(archive_path)
Zip::File.entry_count(File.open(archive_path))
end
extend self
end
[8] pry(main)> M.fast_count '/home/git/gitlab/spec/fixtures/safe_zip/valid-simple.zip'
=> 7
Compare:
$ unzip -l spec/fixtures/safe_zip/valid-simple.zip
Archive: spec/fixtures/safe_zip/valid-simple.zip
Length Date Time Name
--------- ---------- ----- ----
0 2019-01-17 16:30 public/
12 2019-01-17 16:30 public/index.html
0 2019-01-17 16:30 public/assets/
0 2019-01-17 16:30 public/assets/image.png
6 2019-01-17 16:30 public/images
0 2019-01-17 16:30 source/
12 2019-01-17 16:30 source/index.html
--------- -------
30 7 files
Possible follow-ups
- Write documentation for efficient use of zip files
- Write a Cop that flags potentially harmful use of rubyzip APIs
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #345673