Improve performance of registry enqueuer query
🔎 What does this MR do and why?
We are in the process of migrating all container repositories to the new container registry. GitLab rails drives this process with the EnqueuerWorker
. The EnqueuerWorker
is responsible for finding the next qualified container repository for import, and then starting the import by making an API request to the registry.
We discovered the main query being used to find the next qualified container repository is underperforming, using a Seq Scan
instead of an Index Scan
on container_repositories
. This was rather confusing because we already have a few indexes that looked like the right index for the job:
"idx_container_repos_on_migration_state_migration_plan_created" btree (migration_state, migration_plan, created_at)
"tmp_idx_container_repos_on_non_migrated" btree (project_id, id) WHERE migration_state <> 'import_done'::text AND created_at < '2022-01-23 00:00:00'::timestamp without time zone
After some digging, we discovered that the postgresql planner is mistakenly calculating the cost of using an index as being higher than using a sequential scan due to startup costs. This means that for this particular query, no index on a related condition will ever be chosen over the sequential scan.
The planner expects it will find a matching record faster if it just starts scanning the table from the beginning based on its internal statistics, but if we ask it for two records, it will decide to use the index instead.
So in this MR, we update from using .take
(which applies LIMIT 1
) to using .limit(2)[0]
. We cannot use .limit(1).first
because rails will translate that into LIMIT 1
. Using [0]
forces rails to execute the query with LIMIT 2
and then use the first record returned. If no records are found, nil
is returned just as it would have been using .take
.
You may be asking "why are we adding what looks like a bandaid to the query rather than fixing this at a deeper level?". The registry migration is continuously running on production, so this query is currently executing around 2000 times per hour. We are on a tight timeline to have the migration completed, so we want to introduce as little changes and risk as possible. Updating to use a limit and ignore the extra returned record is the "boring solution" in this case. It allows us to keep moving forward, but without risking introducing any further bugs into the system.
💿 Database
The only update to this query is `LIMIT 1
SELECT "container_repositories".* FROM "container_repositories"
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id"
INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id"
WHERE "container_repositories"."migration_state" = 'default'
AND "container_repositories"."created_at" < '2022-01-23 00:00:00'
AND (
"container_repositories"."migration_plan" IN ('free', 'early_adopter')
OR "container_repositories"."migration_plan" IS NULL
)
AND (NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])
+ )) LIMIT 2;
- )) LIMIT 1;
-
Explain plan before
16.67s
: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10144/commands/35811😱 -
Explain plan after
16ms
: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10144/commands/35813😎
Note also the large reduction in buffer usage.
Screenshots or screen recordings
With no returned values:
[28] pry(main)> ContainerRepository.ready_for_import.limit(2)[0]
ContainerRepository Load (1.6ms) SELECT "container_repositories".* FROM "container_repositories" INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."created_at" < '2022-01-23 00:00:00' AND "container_repositories"."migration_plan" IN ('ultimate', 'gold', 'ultimate_trial') AND (NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])
)) LIMIT 2 /*application:console,db_config_name:main,line:(pry):28:in `__pry__'*/
=> nil
With multiple returned values (I've ommitted .ready_for_import
from these to show how the LIMIT is affected and for easier setup):
[29] pry(main)> ContainerRepository.limit(2)[0]
ContainerRepository Load (0.5ms) SELECT "container_repositories".* FROM "container_repositories" LIMIT 2 /*application:console,db_config_name:main,line:(pry):29:in `__pry__'*/
=> #<ContainerRepository:0x00007fc98cf642c8
How to set up and validate locally
-
Create some container_repositories locally:
10.times { FactoryBot.create(:container_repository, project: Project.first) }
-
Try running the following to ensure they return the same results:
Example with no results returned (you may have something returned if you have older
container_repositories
records stored locally):ContainerRepository.ready_for_import.take # old ContainerRepository.ready_for_import.limit(2)[0] # new
Example with results returned:
ContainerRepository.take # old ContainerRepository.limit(2)[0] # new
📏 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related: #362544 (closed)