Maven virtual registries: improve the senddependency workhorse command (!167034) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 461561-workhorse-senddependency-improvement into master Sep 24, 2024

🎈 Context

In the maven virtual registry feature, we use heavily the send-dependency from workhorse.

The send-dependency command is pretty powerful. It instruct workhorse to:

Get the contents from a remote url (passed in the parameters).
Ask rails for an upload location.
Stream the contents from the remote url and pipe into a "T junction" (see https://about.gitlab.com/blog/2024/02/15/compose-readers-and-writers-in-golang-applications/#reader--%3E-reader-%2B-writer for more details) to send it to the external client and the upload location in parallel or at the same time.
Confirm the upload to rails.

From a rails point of view, workhorse will hit two endpoints for:

Getting the upload location. This is called the upload authorization.
Confirming that the file was sent to the upload location. This is mainly used to create or persist whatever business objects we need on the rails side. For example, link this upload with an Active Record instance.

In #451242 (comment 1890233598), we raised the Wild Idea (😺): what if we completely avoid (1.)?

The upload authorization comes from the upload handling in workhorse. This handling is mainly used when the external clients upload a file to GitLab. However, here, the upload is not initiated by an external client. It's initiated by the rails backend sending the send-dependency command to workhorse. Thus, we can easily imagine: why should workhorse contact rails again for the upload authorization if rails could send this authorization directly along with the send-dependency command? This would help with the overall execution time and also lower the load on the rails backend.

The upload authorization endpoints in rails are usually used to validate that an upload can be done. For example, during the upload authorization, user permissions are checked to make sure that the current user can upload a file. Another check can be the file size to avoid accepting super large 🐘 files. Now, in the send-dependency command case, rails is already doing part of these validations (such as user permissions). There is no need to tell workhorse to ask for an upload authorization, we can simply build the upload authorization response structure and send it along with the send-dependency command. Workhorse can then simply read this authorization to upload the file to the target destination and then, simply confirm the upload.

This optimization has been described in #461561 (closed) for virtual registries and this MR implements it.

🔬 What does this MR do and why?

Update the workhorse send-dependency command. The UploadConfig structure has now a new field: AuthorizedUploadResponse. This new field will hold the upload authorization structure (essentially, the destination to upload the file (remote or local), the maximum file size and the digest functions that are allowed).
- Notice that the send-dependency command is used by several different features. Thus, by default, this new field is not set and the workhorse part will behave as it currently behaves.
Update the related send-dependency tests.
Update the rails workhorse helper to support the new field.
Update the Maven virtual registry to set this new field when using the send-dependency command.
Update the related rails specs.

🗒 MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

✅

🖥 Screenshots or screen recordings

No UI changes.

⚗ How to set up and validate locally

⚠ Please make sure that you compile workhorse and restart it before trying this MR. The reason is that the workhorse changes of this MR are taken into account only when they are compiled. Thus, from the GDK root directory:

$ cd gitlab/workhorse
$ make
$ gdk restart workhorse

Let's set up a maven virtual registry and pull a file through it. This pull will use the send-dependency command and upload the file to the GitLab instance. This upload should not need to ping the upload authorize endpoint.

Have a root group and a PAT ready.

In a rails console:

Feature.enable(:virtual_registry_maven) # enable the maven virtual registry
group = Group.find(<root group id>)
r = ::VirtualRegistries::Packages::Maven::Registry.create!(group: root_group)
u = ::VirtualRegistries::Packages::Maven::Upstream.create!(group: root_group, url: 'https://repo1.maven.org/maven2')
VirtualRegistries::Packages::Maven::RegistryUpstream.create!(group: root_group, registry: r, upstream: u)

Now, from the gitlab folder, watch the rails logs with:

$ tail -f log/development.log | grep "Started "

Pull a file with:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"

Check the logs:

Started GET "/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom" for 172.16.123.1 at 2024-09-24 16:54:15 +0200
Started POST "/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom/upload" for 127.0.0.1 at 2024-09-24 16:54:15 +0200

As we can see, there is:

the request received by the virtual registry endpoint. This request will return the workhorse send-dependency command with the upload authorization.
Workhorse will upload the file to its destination and confirm the upload. That's the second request (on /upload).

We completely avoided the upload authorization endpoint (url ending with /authrorize) 🎉

⚠ if you want to try again the scenario, you need to reset the virtual registry. The registry will "cache" the remote file and so, requesting it again with $ curl will not use the send-dependency command. To reset the cache:

u.reload.cached_responses.destroy_all

🚀 Performance review

For the performance review, we're going to use to following scenario:

Use this dummy maven application.
Configure a virtual registry in our local GitLab instance that targets maven central, which is the official public registry.

Configure the maven application so that we replace the maven central reference with the virtual registry endpoint. We will use this settings.xml file

<settings>
  <mirrors>
    <mirror>
      <id>gitlab-maven</id>
      <name>GitLab proxy of central repo</name>
      <url>http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry.id></url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  </mirrors>
  <servers>
    <server>
      <id>gitlab-maven</id>
      <configuration>
        <httpHeaders>
          <property>
            <name>Private-Token</name>
            <value><PAT></value>
          </property>
        </httpHeaders>
      </configuration>
    </server>
  </servers>
</settings>

$ mvn compile -s settings.xml that will pull packages (through the virtual registry) and compile the application.
- In this configuration, we will pull close to 1000 files.
To keep the analysis focused on the web requests done between workhorse and rails, the object storage is disabled in our local GitLab instance (the file system will be used to store the uploaded files).
This is, by no means, a highly accurate performance analysis but the goal here is to have a glimpse on the improvements.

Between each scenario, we will make sure that:

the virtual registry is completely empty (no cached entries) to make sure that all packages are downloaded from Maven central and they go through the send-dependency command.
~/.m2/repository is removed to make sure that $ mvn will pull the packages and not use the ones that are present in the local maven cache.

Here are the results:

Metric	On `master`	With this MR	Improvement
Execution time reported by `$ mvn`	04:16 min	03:58 min	`7 %`
Amount of files pulled	916	916	-
Amount of requests done to rails	2748	1832	`33.33 %`

We can say that the change has a positive impact.

Edited Sep 25, 2024 by David Fernandez

Maven virtual registries: improve the senddependency workhorse command