Change Zoekt to use Gitaly for fetching code
Problem
Based on our investigation in #384722 (closed) it seems likely there is no way to use the public Git API for fetching only changed files without storing the bare repo. In general storing the whole bare repo is just very inefficient for storage and also bandwidth. Our current design also depends on cloning using public URLs which also means lots of extra bandwidth costs as we go out to the internet and back all the way through cloudflare and so-on.
Solution
We should change our indexing logic to work similar to gitlab-elasticsearch-indexer
where we use the Gitaly API to fetched the changed files since the last time we indexed the repo.
This also solves all the following issues
Technical details
We need to implement a new binary that is responsible for fetching files from gitaly and writing them to Zoekt index files. This would basically be like https://github.com/DylanGriffith/zoekt/blob/main/gitindex/index.go in the Zoekt codebase but we need to swap the go-git
with gitaly direct calls. This has a couple of implications:
- It only works as an internal service as Gitaly is not internet accessible so it needs an authenticated privileged access to Gitaly
- This won't really make sense to contribute back to Zoekt as it's too GitLab specific
- This will therefore need to use Zoekt like a library (eg. these kinds of calls to build zoekt documents)
Likely this service will look very similar to gitlab-elasticsearch-indexer
and it will need to be passed similar data (eg. Gitaly server storage details for the specific project) except since it will write to local zoekt index files it will need to be a web server instead of shelled out from Sidekiq.
I also found that the logic for figuring out the currently indexed SHA is at https://github.com/DylanGriffith/zoekt/blob/719983ac8e9f107daf9a47f90142568ab8c94e52/gitindex/index.go#L614 . So we won't need to keep track of the IndexStatus like we do for Elasticsearch since Zoekt will know it from the existing Zoekt index files.
There is also the option of implementing this fully like Elasticsearch and send the full file contents over HTTP to Zoekt but it does seem like this would be much less efficient than having Zoekt pull them via Gitaly RPC as it's an extra encoding, decoding, network hop and slower interface. It's probably also more difficult to implement this HTTP API on the Zoekt side as it doesn't really fit with the existing indexing logic very well.