blob: Speed up LFS pointer search via object type filters
The ListLFSPointers()
RPC returns all LFS pointers referenced by a set
of revisions. This filtering is quite expensive: we first need to
enumerate all reachable objects, then for each object we need to see
whether it's a blob and whether its size indicates that it can be an LFS
pointer, and finally we need to check the blobs' contents and test
whether it really is an LFS pointer.
To optimize this a bit, we do set up a blob size limit of 200 bytes, which is the maximum size an LFS pointer can have. While this severely brings down the number of candidate blobs, one issue we have is that git-rev-list(1) will still unconditionally list all the other object types. Effectively, we're thus needlessly retrieving object info of all tags, commits and trees only to notice that they aren't blobs in the first place. It goes without saying that this is a huge waste of time.
To tackle this problem, we have upstreamed two new options for git-rev-list(1):
- By default, git-rev-list(1) will always unconditionally print
objects which have directly been received either via the command
line or via stdin. A new option `--filter-provided-objects` has
been added which changes this behaviour and also causes provided
revisions to be filtered.
- A new object type filter `--filter=object:type=<type>` has been
added which will cause git-rev-list(1) to only list objects whose
type matches the given type.
Used in combination, this brings down the number of potential LFS pointer candidates by a significant factor. Executed on linux.git:
$ git rev-list --objects --filter=blob:limit=200 --all | wc -l
7146677
$ git rev-list --objects --filter=blob:limit=200 --all \
--filter=object:type=blob --filter-provided-objects | wc -l
15217
For this particular repo, we have a factor of 470 less objects to check
for whether they are an LFS pointer or not. Naturally, this is an
artificial demonstration only because we don't typically search LFS
objects with --all
. But we can expect that this translates to speedups
at a smaller scale by not having to do pointless work.
So let's use this by setting up the new withObjectTypeFilter()
option
in case we're running a Git version which supports it. No new feature
flag is introduced given that we only implement it on the new pipeline
code, which is already guarded by a featureflag anyway.
Part of #3618 (closed)