packed_binaries: Extract binaries in parallel
Currently we extract each Gitaly's packed binaries serially. This task consumes a majority of the time spent during initialization. By extracting these files in parallel we can measureably reduce startup time.
On a 4-core system this improves startup time by ~10%, using the same benchmarking process as 96438c24 (gitaly: Don't block on preloading licensedb, 2023-09-20):
Benchmark 1: ./gitaly-par serve config.toml
Time (mean ± σ): 228.5 ms ± 4.2 ms [User: 285.8 ms, System: 76.8 ms]
Range (min … max): 222.7 ms … 237.0 ms 13 runs
Benchmark 2: ./gitaly-st serve config.toml
Time (mean ± σ): 254.5 ms ± 6.9 ms [User: 315.1 ms, System: 75.6 ms]
Range (min … max): 246.7 ms … 272.2 ms 11 runs
Summary
./gitaly-par serve config.toml ran
1.11 ± 0.04 times faster than ./gitaly-st serve config.toml
On a 16-core system this improves to ~20%:
Benchmark 1: ./gitaly-par serve config.toml
Time (mean ± σ): 234.7 ms ± 6.0 ms [User: 326.4 ms, System: 169.5 ms]
Range (min … max): 228.6 ms … 247.5 ms 12 runs
Benchmark 2: ./gitaly-st serve config.toml
Time (mean ± σ): 282.9 ms ± 10.4 ms [User: 377.1 ms, System: 156.3 ms]
Range (min … max): 266.3 ms … 302.0 ms 11 runs
Summary
'./gitaly-par serve config.toml' ran
1.21 ± 0.05 times faster than './gitaly-st serve config.toml'
This does place more demand on the disk, but only momentarily. When the
host is under heavy io pressure, simulated here with stress-ng --iomix 5
parallel extraction's performance advantage is extended to ~36% on
the 16-core system:
Benchmark 1: ./gitaly-par serve config.toml
Time (mean ± σ): 545.9 ms ± 159.9 ms [User: 731.7 ms, System: 177.7 ms]
Range (min … max): 326.9 ms … 913.7 ms 10 runs
Benchmark 2: ./gitaly-st serve config.toml
Time (mean ± σ): 740.9 ms ± 242.8 ms [User: 977.4 ms, System: 150.7 ms]
Range (min … max): 378.9 ms … 1029.0 ms 10 runs
Summary
'./gitaly-par serve config.toml' ran
1.36 ± 0.60 times faster than './gitaly-st serve config.toml'
Edited by Will Chandler (ex-GitLab)