Check maven path for maven file API endpoints [RUN ALL RSPEC] [RUN AS-IF-FOSS]
🍰 Context
The maven package registry exposes a few API endpoints so that maven clients (mainly $ mvn
) can interact with it.
In addition, those endpoints are "duplicated" at different levels so that users can use the most appropriate one.
The endpoint we're going to focus on this MR is the one used to "pull" or download a given package file. Basically, the request goes like this: Hey package registry, given file X from package P
. The backend will check across the available packages to locate the proper one and locate the right package file. (This part is not as trivial as it seems)
On the other hand, $ mvn
has a given behavior for pulling dependencies. Let's say that we have a maven project with 5 dependencies and 3 package registries set up (maven central + gitlab.com + a third party registry). The $ mvn
will not check the registries in the order they are declared. Instead, it will check them all at the same time. I'm not sure why is it so.
So in other words, on the maven file API endpoint, gitlab.com is completely hammered with requests hey do you have this file for this package?
.
This leads to this peculiar situation where the majority of the requests on those file endpoints don't have a 200
response but 404
response:
=> 60-75% of the requests on those endpoints are for package files that doesn't exist at all on gitlab.com
The main issue here is that the backend will still go through all the queries (through the groups/projects and packages) to end up with an empty result set and reply 404
.
In #326819 (closed), we noticed a way to improve this situation:
- One the arguments of that endpoint is the maven
path
. Thispath
is actually persisted in the database on thepackages_maven_metadata
table.- The idea here is to add a preliminary query that will quickly check if the requested
path
exists in the table. If it exists, we let the backend do the work as it does today. If it doesn't exist (the majority of the requests), we can quickly reply404
.
- The idea here is to add a preliminary query that will quickly check if the requested
This improvement should help with the horrible response times we see on those endpoints.
Note that those endpoints deal with two methods: GET
and HEAD
requests.
🤔 What does this MR do?
- Add a preliminary check for the maven file API endpoints that will verify the existency of the given maven
path
inpackages_maven_metadata
- if the path doesn't exist, a
404
is directly returned - if the path exist, the backend will process the request the same way it's currently doing
- if the path doesn't exist, a
- Update the related specs
- Update specs for the file API endpoint on all levels
- Util functions were upgraded with keyword arguments for better readability
- Util functions now accept a
path
argument that will be used in the contacted url. - Added several contexts for the case
with a non existing maven path
- Those file API endpoints are among the busiest endpoints for the ~"group::package" team. As such, the blast radius in case of a problem with this change is quite big. This change is gated behind a feature flag to provide an additional safety net for gitlab.com
- Tracking issue: #327487 (closed)
📸 Screenshots (strongly suggested)
Let's have a look at the SQL queries triggered by the different levels, with / without this MR, with / without the feature flag enabled.
Level | Conditions | Number of SQL queries | Difference |
---|---|---|---|
Project | Before this MR Path exists |
11 | Baseline |
Project | Before this MR Path doesn't exists |
10 | Baseline |
Project | With this MR FF disabled Path exists |
12 | +1 query (the FF existence check) |
Project | With this MR FF disabled Path doesn't exist |
11 | +1 query (the FF existence check) |
Project | With this MR FF enabled Path exists |
13 | +2 queries (the FF existence check and the maven path existence check) |
Project | With this MR FF enabled Path doesn't exist |
3 | -7 queries |
Group | Before this MR Path exists |
11 | Baseline |
Group | Before this MR Path doesn't exist |
10 | Baseline |
Group | With this MR FF disabled Path exists |
12 | +1 query (the FF existence check) |
Group | With this MR FF disabled Path doesn't exist |
13 | +1 query (the FF existence check) |
Group | With this MR FF enabled Path exists |
13 | +2 queries (the FF existence check and the maven path existence check) |
Group | With this MR FF enabled Path doesn't exist |
3 | -7 queries |
Instance | Before this MR Path exists |
9 | Baseline |
Instance | Before this MR Path doesn't exist |
8 | Baseline |
Instance | With this MR FF disabled Path exists |
10 | +1 query (the FF existence check) |
Instance | With this MR FF disabled Path doesn't exist |
9 | +1 query (the FF existence check) |
Instance | With this MR FF enabled Path exists |
11 | +2 queries (the FF existence check and the maven path existence check) |
Instance | With this MR FF enabled Path doesn't exist |
3 | -5 queries |
Notes
- The MR impact is minimal (+1/+2 queries).
- One of the added query is the feature flag existence check. This query is not triggered all the time. In addition, this query will disappear when the feature flag is removed
- The other query is the path existence check. Given the execution time of this query, the impact is minimal
- The most used scenario (feature flag enabled + path doesn't exist) gets a nice bonus of -5/-7 SQL queries.
In conclusion, this change properly accelerate the conditions we want to target (the requested package file doesn't exist) while minimally impacting the other conditions. The above analysis bring us more confidence in this change
📐 Does this MR meet the acceptance criteria?
Conformity
-
📋 Does this MR need a changelog?-
I have included a changelog entry. (needed by the new index) - [-] I have not included a changelog entry because ________.
-
- [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides - [-] Separation of EE specific content
Availability and Testing
- [-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
- [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team
💾 Database review
Migration up
$ rails db:migrate
== 20210413092922 AddIndexToPackagesMavenMetadataPath: migrating ==============
-- transaction_open?()
-> 0.0000s
-- index_exists?(:packages_maven_metadata, :path, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently})
-> 0.0079s
-- execute("SET statement_timeout TO 0")
-> 0.0006s
-- add_index(:packages_maven_metadata, :path, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently})
-> 0.0199s
-- execute("RESET ALL")
-> 0.0007s
== 20210413092922 AddIndexToPackagesMavenMetadataPath: migrated (0.0300s) =====
Migration down
$ rails db:rollback
== 20210413092922 AddIndexToPackagesMavenMetadataPath: reverting ==============
-- transaction_open?()
-> 0.0000s
-- index_exists?(:packages_maven_metadata, :path, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently})
-> 0.0021s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- remove_index(:packages_maven_metadata, {:name=>"index_packages_maven_metadata_on_path", :algorithm=>:concurrently, :column=>:path})
-> 0.0054s
-- execute("RESET ALL")
-> 0.0005s
== 20210413092922 AddIndexToPackagesMavenMetadataPath: reverted (0.0094s) =====