Artifact / Cache Performance

Description

The issues with Artifact/Cache archiving and extraction are well known. There's multiple issues, that can be placed into two categories. To summarise:

Compression/Extraction

The built-in compression Go offers out of the box can be a little slow.
We're not making use of concurrency.
Memory allocations are greater than they need to be.

Collecting info about files

filepath.Walk is slow.
We're not making use of concurrency.
We perform multiple lstats on the same files as they're passed from one function to another.

MR !1640 (closed) tried to deal with all of these issues all at once. Ignoring dependency changes, it was 21 files modified, +345 lines, -892 lines. Part of the problem was interweaving new code with old, rather than creating an entirely new execution path that could easily be feature flagged.

Proposal

I'm creating this issue to split the process into improving things down into multiple issues.

Compression/Extraction

!2055 (merged) Use https://github.com/klauspost/pgzip as a drop-in replacement for compress/gzip. A community contribution has already been approved to do this.
!2190 (closed) Use https://github.com/klauspost/compress/zip as a drop-in replacement for archive/zip. This uses klauspost's incredibly improved deflate library.
!2195 (merged) Implement a solution to easily replace and implement new Archivers/Extractors. GitLab supports zip, gzip and raw, but the code to do so isn't built around any common interface. Replacing an archiver or extractor requires modifying an awful lot of code.

There should be a way to register an Archiver/Extractor for a specific format and the rest of the codebase not notice.

I propose the first step is introducing an API for the registration of Archivers and Extractor using common interfaces. The first Archivers and Extractors implemented will be shims that bridge with the existing archiving and extraction code. We change the entry points into archiving/extraction to use the new interfaces, but the shim returns to the existing implementation. This reduces how many tests would need to be modified, and allows us to continue using the legacy implementations until they can be entirely removed.

We don't introduce any new Archivers/Extractors in this step, we only introduce the new API we can use in the future. We don't feature flag this. We need this interface to more easily feature flag different Archivers/Extractors in the future. This should be a relatively safe change, as there will be few lines of code modified.

The API introduces changing the compression level in a generic way (Fastest, Fast, Default, Slow, Slowest). This will be used in the future.
!2205 (closed) Rewrite how we glob and walk directories for creating artifacts and caches. This process is too tightly coupled to archiving/extracting, and there's many inefficiencies. The result should be a map of filepaths and os.FileInfos (map[string]os.FileInfo).

Ideally, lstats would only happen once, but this would require a large change. The good news is, although the os.FileInfo is thrown away at several points in the existing implementation, the OS does cache these results and lstats are faster on subsequent calls.

For this step, we focus on only creating the initial walking, globbing and lstat fast. Later, we can remove the code inbetween that performs redundant lstats.
!2210 (merged) Implement a new Archiver/Extractor for zip: https://github.com/saracen/fastzip - This will be feature flagged (on by default?) and eventually replaces all of the existing zip implementation, even the "extra" field functionality.
Teach the artifact/cache helper binaries about the compression levels the Archiver/Extractor interface understands (step 3)
We should be able to refactor the gzip implementation. I think for this, most of the code will remain the same, we'll just be moving code around to where it now seems most appropriate, updating tests and allowing the compression level to be controlled.
Tidy the steps between "getting the files" (step 4) and passing them to the Archiver/Extractor interface (step 3).

Links to related issues and merge requests / references

Edited Sep 08, 2020 by Arran Walker