git: Inject `pack.writeReverseIndex` into commands generating packfiles (!3292) · Merge requests · GitLab.org / gitaly

Patrick Steinhardt requested to merge pks-repack-write-reverse-index into master Mar 25, 2021

Git v2.31.0 has introduced a new on-disk reverse-index for packfiles, which is used to map from a given object position in the packfile to its entry in the packfile index. This reverse index has typically been generated by git on the fly, which can take some time for packfiles with a lot of objects. With the new on-disk file, these computations can be skipped.

In theory, the on-disk reverse index should speed up computations which need to look up object sizes. This includes git-cat-file(1) with --batch-check=%(objectsize), but should also include commands like git-rev-parse(1) with a --filter=blob:limit. Given that we do not use git-cat-file(1) anywhere to print object sizes, it does not matter for us. But the latter should matter given that we use the filter to look up LFS pointers.

Unfortunately, benchmarks didn't show much of an improvement. The following tests were run in linux.git:

Benchmark #1: git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # with reverse-index
  Time (mean ± σ):     14.354 s ±  0.105 s    [User: 12.411 s, System: 1.941 s]
  Range (min … max):   14.203 s … 14.487 s    5 runs

Benchmark #2: git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # without reverse-index
  Time (mean ± σ):     14.076 s ±  0.076 s    [User: 12.032 s, System: 2.042 s]
  Range (min … max):   13.988 s … 14.156 s    5 runs

Summary
  'git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # without reverse-index' ran
    1.02 ± 0.01 times faster than 'git rev-list --all --objects --filter=blob:limit=200 --use-bitmap-index # with reverse-index'

There is one more angle to the reverse index though: because the reverse index doesn't need to be computed in-memory anymore, git doesn't have to allocate the reverse index anymore. struct revindex_entry is 16 bytes big, which isn't much. But for biggish repositories with millions of objects, this would be a sizeable chunk of memory. E.g. for linux.git with about 9M objects, this is about 140MB of RAM allocated only for the reverse index. This can be seen in the maximum RSS of above two commands: with reverse index, it peaks at 2.34GB, while it peaks at 2.45GB without the index. Also, there's significantly less pagefaults: from 465k down to 390k when using the reverse index.

One downside of the reverse index is that it obviously requires disk space. In case of linux.git, the packfile is 2GB, its index is 250MB, the bitmap is 65MB and then on top of that we have the reverse index with 36MB. Which is roughly 1.5% more storage than we required before. This is not representative, but it is a ballpark figure that helps as orientation.

So it seems like the reverse index is a mixed bag, and it's not yet clear whether it helps us or not. It does trade disk space for memory and page faults. In theory, it also improves runtime of some commands for largish repos, but we couldn't really observe it for our typical usecases for the medium-sized linux.git repo. But a 1.5% increase in disk space seems to be small enough to give it a shot and test whether it does have an impact on production systems.

This commit thus starts injecting the pack.writeReverseIndex option into commands which generate packfiles. Because it's doubtful whether it helps or not, it's currently hidden behind a feature flag.

git: Inject `pack.writeReverseIndex` into commands generating packfiles

Merge request reports