NPM package registry: fine tune a query for the instance level
👽 Context
The NPM registry can be accessed at two "levels":
- The project level
- The instance level
At the instance level, only packages that follow the naming convention are available.
How do we locate the correct package? Scanning all the NPM packages for a given GitLab instance would go through way too many packages. What we do is that we leverage the naming convention to find the project hosting the packages.
- We extract the package scope.
- From that scope, we locate the namespace.
- We execute a package finder for that namespace.
- We look at the most recent package and read the
project_id
. We just reduced an instance access level to a project access level.- Accessing packages within a project scope is obviously way more efficient than trying to scan all existing npm packages.
These steps are nicely packed here.
The Packages::Package
table we could find two npm packages with the same name and same version. That scope ensures that only the most recent row is used.
Looking at the source of that scope, we can see that all
is called in a subquery. The problem is that all
will basically use the current conditions attached to the main query to use them in the subquery = the conditions will appear twice.
Back to step (3.) above, going through all the packages of a given root namespace can be expensive. Some namespaces have a high number of nested sub groups and basically, the finder will need to use the hierarchy traversal conditions. See https://gitlab.com/gitlab-org/gitlab/-/blob/7eac953926529288a5a332c18a058e8dcc4efb61/app/models/namespace.rb#L283. Those conditions efficiency depends on the number of objects nested in the root namespace.
Now combine both points and what is the result? This. By reusing a condition that is hard to execute twice, we get an overall slow query:
Query
SELECT "packages_packages".*
FROM "packages_packages"
WHERE "packages_packages"."project_id" IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE (traversal_ids @> ('{XXX}'))
)
)
AND "packages_packages"."package_type" = 2
AND "packages_packages"."name" = 'XXX'
AND "packages_packages"."status" = 0
AND "packages_packages"."id" IN (
SELECT MAX(id) AS id
FROM "packages_packages"
WHERE "packages_packages"."project_id" IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE (traversal_ids @> ('{XXX}'))
)
)
AND "packages_packages"."package_type" = 2
AND "packages_packages"."name" = 'XXX'
AND "packages_packages"."status" = 0
GROUP BY "packages_packages"."version"
)
ORDER BY "packages_packages"."id" DESC
LIMIT 1;
This is issue #339835 (closed).
🔧 Solution
How can we improve things?
Looking at the query above, we see that rails is smart enough to compact the scope + .last
+ &.project_id
in a single query. In other words, the query is already returning the project id linked to the most recent package that has the name that we look for.
Because, we're only interested in the most recent package, we don't need to use the scope at all. Why handle duplicated packages when we're basically using the most recent package? This means that we don't need the .last_of_each_version
scope for this call.
The finder being used in multiple locations, we can't simply drop the scope from the finder. We need to update it so that it accepts an option that let clients of this finder control if they want to use the scope or not.
Lastly, because the npm package registry is one of the most used on gitlab.com, the changes here are going to be deployed behind a feature flag. This is to gives us a chance to rollout the fix incrementally and to have a safety net to fall back to the old implementation quickly if an issue occurs.
🤔 What does this MR do?
- Update
Packages::Npm::PackageFinder
to acceptlast_of_each_version
option. That option will be enabled by default. - Update the instance level access to don't use that option when searching for the project id.
- Gate those changes behind a feature flag:
npm_finder_query_avoid_duplicated_conditions
- Rollout issue: #340099 (closed)
- Update the related specs
📸 Screenshots or Screencasts (strongly suggested)
Here is the original query:
Original query
SELECT "packages_packages".*
FROM "packages_packages"
WHERE "packages_packages"."project_id" IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE (traversal_ids @> ('{XXX}'))
)
)
AND "packages_packages"."package_type" = 2
AND "packages_packages"."name" = 'XXX'
AND "packages_packages"."status" = 0
AND "packages_packages"."id" IN (
SELECT MAX(id) AS id
FROM "packages_packages"
WHERE "packages_packages"."project_id" IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE (traversal_ids @> ('{XXX}'))
)
)
AND "packages_packages"."package_type" = 2
AND "packages_packages"."name" = 'XXX'
AND "packages_packages"."status" = 0
GROUP BY "packages_packages"."version"
)
ORDER BY "packages_packages"."id" DESC
LIMIT 1;
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6369/commands/21713
Here is the query without using the scope:
Query without scope
SELECT "packages_packages".*
FROM "packages_packages"
WHERE "packages_packages"."project_id" IN (
SELECT "projects"."id"
FROM "projects"
WHERE "projects"."namespace_id" IN (
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE (traversal_ids @> ('{XXX}'))
)
)
AND "packages_packages"."package_type" = 2
AND "packages_packages"."name" = 'XXX'
AND "packages_packages"."status" = 0
ORDER BY "packages_packages"."id" DESC
LIMIT 1;
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6393/commands/21819
As wanted, the subquery (with the MAX
) has been removed.
It's a bit hard to evaluate the impact of one less subquery as the conditions for group hierarchy traversals behave a bit strangely on pg.ai. Sometimes they are fast and other times they are slower. We still believe that we should have better execution times without the subquery.
⚙ How to setup and validate locally (strongly suggested)
Group hierarchy traversals seems to have several improvements in progress. To get the query that we have on production, we need to enable both feature flag:
use_traversal_ids
recursive_approach_for_all_projects
Please note that no matter what is the form of the group hierarchy traversal condition, this MR will still remove a duplicated condition.
- Have a user with a personal access token ready (scope
api
) - Create a group
- Create a project inside that group
- With gl_pru, upload 3-4 packages but be careful to use this form for the package name:
$ bundle exec thor package:push --package-type=npm --user=<username> --token=<pat_token> --url=<gitlab_base_url>/api/v4/projects/<project_id>/packages/npm/ --name=test --version=1.3.8 --npm-package-scope=@<group_path>
- In a rails console:
Packages::Npm::PackageFinder.new('@<group_path>/test', namespace: Namespace.top_most.by_path('<group_path>'), last_of_each_version: true) Packages::Npm::PackageFinder.new('@<group_path>/test', namespace: Namespace.top_most.by_path('<group_path>'), last_of_each_version: false)
When last_of_each_version
is set to true
, it will give you the query currently used on gitlab.com.
When last_of_each_version
is set to false
, it will give you the query as modified by this MR with the feature flag is enabled.
📐 Does this MR meet the acceptance criteria?
Conformity
- [-] I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
- [-] I have added/updated documentation, or it's not needed. (Is documentation required?)
- [-] I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
-
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) - [-] I have tested this MR in all supported browsers, or it's not needed.
- [-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Security
Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team
💿 Database review
- Query to get the project id from a namespace: !69572 (comment 669990307)