Remove archive generation from the Go proxy
This is a follow up to a discussion: !27746 (comment 343643472).
The Go proxy MVC implementation may walk the entire repository tree. This can have a negative impact on performance. Until this is mitigated, the Go proxy is behind a feature flag. Once archive generation is moved into Workhorse or otherwise removed from Rails request processing, the feature flag can be removed.
Additionally, subdirectories containing a
go.mod
file must be excluded.That does make this a little trickier to handle. If this were just a simple glob pattern, then I think it's pretty straightforward change. I left some details out. This is the flow:
sequenceDiagram Client->>+Workhorse: GET /api/v4/1234/api/packages/go/module/@v/v1.0.0.zip Workhorse->>+Rails: GET /api/v4/1234/api/packages/go/module/@v/v1.0.0.zip Rails->>+Workhorse: Gitlab-Workhorse-Send-Data Workhorse->>Gitaly: SendArchiveRequest Gitaly->>Git: git archive v1.0.0 Git->>Gitaly: <streamed data> Gitaly->>Workhorse: <streamed data> Workhorse->>Client: v1.0.0.zip
Rails sends back the
Gitlab-Workhorse-Send-Data
header that Workhorse recognizes as a request to the Gitaly file servers to rungit archive
.It's easy to add a prefix (e.g.
gitlab.com/my/project
) before each file because the gRPCSendArchiveRequest
already has aPrefix
parameter (https://gitlab.com/gitlab-org/gitaly/blob/f89b33baaa4b34db9444d92466921c1e4a0a66f5/internal/service/repository/archive.go#L96-119) that is passed togit archive --prefix <prefix>
.
git archive
apparently works with globs just fine. For example, I was able to do this on the GitLab repo:git archive --prefix "gitlab.com/gitlab-org/" --format zip master "*/**/README.md" > /tmp/test.zip
This generates a
test.zip
file that only hasREADME.md
files.However, to fulfill the "subdirectories containing a
go.mod
file must be excluded" requirement, we would have to make a second pass through thiszip
file, scan the tree, and delete the files before sending it back. That's probably doable, but we would probably have to teach Gitaly to do this special filtering before sending it back to Workhorse. That being said, it's probably pretty fast to scan the ZIP metadata and make that decision versus trying to do this with Git repository data.
The download API doesn't support the Golang prefix here. I think we'll want to extend the Workhorse code to be able to do this.
archiveParams
includesArchivePrefix
, so the workhorse already supports--prefix
; all that needs to change on that front isGitlab::Workhorse.send_git_archive
.We would have to make a second pass through this
zip
file, scan the tree, and delete the files before sending it back. That's probably doable, but we would probably have to teach Gitaly to do this special filtering before sending it back to Workhorse.Should Gitaly do this or should Workhorse? It feels a bit weird for Gitaly to have code specific to Go (and less weird for Workhorse to have that). Workhorse will need an additional method either way. Instead of teeing to a temp file and the HTTP stream, Workhorse could copy to the temp file, scan it, then write a filtered version to the HTTP stream.
As an alternative solution, what about using
RepositoryService.SearchFilesByName
(Repository#search_files_by_name
) to get a complete list of files in the repository? This could be used to construct a list of what files should be archived. This could be done by a custom Workhorse method, or if that Gitaly call is fast enough, perhaps in the Rails app. However,RepositoryService.GetArchive
only permits a single path argument at the moment. So this would require either a newRepositoryService
method (GetArchiveWithPaths
), or a modification to support specifying multiple paths forGetArchive
, if that can be done in a backwards compatible way.
I think the best approach is:
- List all of the files:
RepositoryService.SearchFilesByName(query: <path>, ref: <ref>)
- Identify any
<path>/sub/dir/go.mod
files and create a list of those subdirectories- Request an archive:
RepositoryService.GetArchive(commit_id: <ref>.sha, path: <path>, prefix: <prefix>)
- Filter the archive stream on the fly, skipping any entry within one of the subdirectories from step 2
As far as I can tell, the only mechanism for excluding files from
git archive
is with a gitattributes file, which is problematic. The only way I can think of excluding files (without some /tmp/gitattributes hack) is to exhaustively list every file we do want, which could be a problem for large Go repositories. Additionally,RepositoryService.GetArchive
does not currently support multiple paths. Thus I think it would be better to filter after the fact.This approach would also eliminate the need to scan the
git archive
. I believe TAR and ZIP are both streaming formats, so the stream can be rewritten and entries omitted on the fly, without saving it to disk at any point.I see two ways to implement this:
- Create a new Workhorse method for this specific case; or
- Call
Repository#search_files_by_name
fromgo_proxy.rb
and add anExcludePaths []string
parameter to Workhorse'sSendArchive
methodI prefer implementation option 2, because it doesn't involve adding a custom method to Workhorse. I think it may be feasible because
git ls-tree
in my local GitLab repo runs in a fraction of a second:time git ls-tree --full-tree --name-status -r master | wc -l #> 32786 #> git ls-tree --full-tree --name-status -r master 0.03s user 0.05s system 77% cpu 0.101 total #> wc -l 0.00s user 0.02s system 15% cpu 0.098 total