git: Inject `pack.writeReverseIndex` into commands generating packfiles
Git v2.31.0 has introduced a new on-disk reverse-index for packfiles, which is used to map from a given object position in the packfile to its entry in the packfile index. This reverse index has typically been generated by git on the fly, which can take some time for packfiles with a lot of objects. With the new on-disk file, these computations can be skipped.
In theory, the on-disk reverse index should speed up computations which
need to look up object sizes. This includes git-cat-file(1) with
--batch-check=%(objectsize)
, but should also include commands like
git-rev-parse(1) with a --filter=blob:limit
. Given that we do not use
git-cat-file(1) anywhere to print object sizes, it does not matter for
us. But the latter should matter given that we use the filter to look up
LFS pointers.
Unfortunately, benchmarks didn't show much of an improvement. The following tests were run in linux.git:
Benchmark #1: git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # with reverse-index
Time (mean ± σ): 14.354 s ± 0.105 s [User: 12.411 s, System: 1.941 s]
Range (min … max): 14.203 s … 14.487 s 5 runs
Benchmark #2: git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # without reverse-index
Time (mean ± σ): 14.076 s ± 0.076 s [User: 12.032 s, System: 2.042 s]
Range (min … max): 13.988 s … 14.156 s 5 runs
Summary
'git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # without reverse-index' ran
1.02 ± 0.01 times faster than 'git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # with reverse-index'
There is one more angle to the reverse index though: because the reverse
index doesn't need to be computed in-memory anymore, git doesn't have to
allocate the reverse index anymore. struct revindex_entry
is 16 bytes
big, which isn't much. But for biggish repositories with millions of
objects, this would be a sizeable chunk of memory. E.g. for linux.git
with about 9M objects, this is about 140MB of RAM allocated only for the
reverse index. This can be seen in the maximum RSS of above two
commands: with reverse index, it peaks at 2.34GB, while it peaks at
2.45GB without the index. Also, there's significantly less pagefaults:
from 465k down to 390k when using the reverse index.
One downside of the reverse index is that it obviously requires disk space. In case of linux.git, the packfile is 2GB, its index is 250MB, the bitmap is 65MB and then on top of that we have the reverse index with 36MB. Which is roughly 1.5% more storage than we required before. This is not representative, but it is a ballpark figure that helps as orientation.
So it seems like the reverse index is a mixed bag, and it's not yet clear whether it helps us or not. It does trade disk space for memory and page faults. In theory, it also improves runtime of some commands for largish repos, but we couldn't really observe it for our typical usecases for the medium-sized linux.git repo. But a 1.5% increase in disk space seems to be small enough to give it a shot and test whether it does have an impact on production systems.
This commit thus starts injecting the pack.writeReverseIndex
option
into commands which generate packfiles. Because it's doubtful whether it
helps or not, it's currently hidden behind a feature flag.