Dependency proxy: some files will interrupt the parallel upload

🔥 Problem

At the heart of the dependency proxy logic, such as in the Maven dependency proxy, lies the dependencyproxy component of workhorse.

The goal is to retrieve a file from a remote url and send it back to the client interacting with the dependency proxy. At the same time, we want to upload this file to a GitLab backend endpoint.

To solve this, workhorse effectively uses a tee. This way we can read from the remote url and write into two different locations: the client response and the upload endpoint. For more technical details, read this blog post from Igor.

In https://gitlab.com/gitlab-com/ops-sub-department/section-ops-request-for-help/-/issues/305, we got hold of a problem where the dependency proxy was returning a file but not uploading it.

At first, we thought that it could be an authentication problem with the upload endpoint but actually, it was something more complex than that.

⛏ Digging deeper

We were able to reproduce the problem locally and the workhorse logs showed:

{"code":500,"correlation_id":"01HRFDJG8TYQZ41ENEJYR3PK8K","error":"dependency proxy: failed to upload file","level":"error","method":"GET","msg":"","time":"2024-03-08T17:15:38+01:00","uri":"/api/v4/projects/301/dependency_proxy/packages/maven/main/foo/bar/ko.xml"}

Obviously, there is a problem with the upload request. Above that log, we had:

{"correlation_id":"01HRFDJG8TYQZ41ENEJYR3PK8K","error":"RequestBody: upload failed: PUT request \"<object storage url>": Put \"<object storage url>\": context canceled","level":"error","method":"PUT","msg":"","time":"2024-03-08T17:15:38+01:00","uri":""}

That context canceled is suspicious.

Looking at the upload request, we can see that its context is tied to the original request (the one made by the client interacting with the dependency proxy).

Here is the kicker: if we target a different but similar (smaller) file, it works.

It seems to me that the original client closes the connection while the upload request is ongoing, since it's on the same context, the upload request will 💥 .

🏁 Confirming the context cancel

Let's try to confirm that the context of the original request is canceled and thus makes the upload request fail.

If this is true, then, we could put the upload request in a different context and that should work. Here are the changes:

diff --git a/workhorse/internal/dependencyproxy/dependencyproxy.go b/workhorse/internal/dependencyproxy/dependencyproxy.go
index a28d108b431b..c9485d928f68 100644
--- a/workhorse/internal/dependencyproxy/dependencyproxy.go
+++ b/workhorse/internal/dependencyproxy/dependencyproxy.go
@@ -138,7 +138,7 @@ func (p *Injector) fetchUrl(ctx context.Context, params *entryParams) (*http.Res
 func (p *Injector) newUploadRequest(ctx context.Context, params *entryParams, originalRequest *http.Request, body io.Reader) (*http.Request, error) {
        method := p.uploadMethodFrom(params)
        uploadUrl := p.uploadUrlFrom(params, originalRequest)
-       request, err := http.NewRequestWithContext(ctx, method, uploadUrl, body)
+       request, err := http.NewRequestWithContext(context.Background(), method, uploadUrl, body)
        if err != nil {
                return nil, err
        }

Sure enough, with these, the problem is fixed and the upload part is no longer interrupted.

Now, why the original client closes the connection before the upload requests complete? 🤔

1️⃣ Theory 1

I thought that once we are done with the tee, the client closes the connection and this wouldn't work for the upload part because in the dependency proxy uploads are workhorse assisted ones.

It involves multiple steps and uploading the actual data is actually in the middle of the list. Meaning that after uploading the data or better said writing data (in whatever target, object storage or file system), we still have requests to do to the rails backend. Thus, if the client closes everything at the end of the tee, we could interrupt the remaining steps for the upload part.

The problem with this theory is that it should happen pretty often and certainly for files that are smaller in size, which is not the case with the two files I'm working with.

2️⃣ Theory 2

I also thought that the file contents (xml structures) could have an effect on this but after playing with different contents and different parts of the failing file, I don't see what we could have in the content that could cause the client to close the connection early.

Keep in mind that the client receives the failing file in its entirety. It is the exact same file as the one of the remote url. The problem is the upload part that fails.

3️⃣ Theory 3

After reading more about the ResponseWriter, it seems that data could be flushed. I'm wondering if the failing is large enough so that the last bytes of the data is flushed and the client closes the connection. Unluckily, this happens when the upload part is ongoing.

🐛 Issue reproduction

Create a project
In the Packages and Registries settings, enable the maven dependency proxy and use the following url: https://gitlab.com/-/snippets/3686132/raw. (this is a snippet that contains both files: the one working and the one not working).

Let's curl the successful file:

$ curl -vvv "http://<username>:<pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/main/foo/bar/ok.xml

Everything works. Now check the package registry of the project: a package has been uploaded. It contains the file that we just requested.

Let's try with the failing file:

$ curl -vvv "http://<username>:<pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/main/foo/bar/ko.xml

No package in the package registry.
The workhorse logs shows a dependency proxy: failed to upload file error.

🚒 Solutions

I don't have a clear view on which solution would be the best here.

Use a different context for the upload part. That helped us to verify the issue but I'm not sure it's an actual solution. By using a different context, we're not tied to the original request anymore which means that we will not properly cleanup used resources when the client closes its connection. In other words, we could be introducing memory leaks I think.
- Perhaps, we could use a context with a timeout. It either completes or get canceled.
- Alternative: see #448886 (comment 1837088205).

Edited Mar 28, 2024 by David Fernandez