Improve EnqueuerWorker query performance
What does this MR do and why?
We are in the process of migrating all container repositories to the new container registry. GitLab rails drives this process with the EnqueuerWorker
. The EnqueuerWorker
is responsible for finding the next qualified container repository for import, and then starting the import by making an API request to the registry.
In #362544 (closed), we discovered a strange performance issue where the main query used by this worker is using a Seq Scan
rather than an Index Scan
despite having a well defined index for the query. We discovered that the postgresql planner is mistakenly calculating the cost of using an index as being higher than using a sequential scan due to startup costs.
The planner expects it will find a matching record faster if it just starts scanning the table from the beginning based on its internal statistics, but if we ask it for more records, it will decide to use the index instead.
In the previous MR that fixed this problem, !87733 (merged), we updated from LIMIT 1
to LIMIT 2
. But it seems that the planner now thinks it will find the first 2 records faster with a Seq Scan. So in this MR, we increase to LIMIT 25
to ensure we do not run into the problem again during the duration of the registry migration (just a few more weeks).
The migration has a fairly strict due date and is continuously running, so in order to keep risk low, we are choosing to make this minimal change to the query rather than explore other options such as setting the seq_page_cost
higher in the transaction that the query is performed in.
The entire system driving the migration is behind a feature flag so I have not included a changelog.
Database
- Query before (LIMIT 2): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10687/commands/38546
- Query with change (LIMIT 25): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10709/commands/38617
- Reference query (LIMIT 3): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10687/commands/38558
The Postgres query planner uses a variety of statistics and weights to decide on a query plan. So it will weight a Seq Scan as a certain cost
based on the number of rows it expects to have to scan. The number of rows it expects to scan is determined by the table statistics, which include rough estimates of things like row (tuple) counts. What is happening here is the cost
calculated for the Seq Scan is less than the cost of an index scan, so the Seq Scan is used. We can see the cost difference by looking at the explain plan, and then disabling Seq Scans and looking at the explain plan again:
-- This was tested on a postgres.ai cloned instance
gitlabhq_dblab# explain SELECT "container_repositories".* FROM "container_repositories" INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."created_at" < '2022-01-23 00:00:00' AND (NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1]) )) LIMIT 2;
QUERY PLAN
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Limit (cost=1.28..7.86 rows=2 width=168)
-> Nested Loop Anti Join (cost=1.28..2009839.49 rows=611156 width=168)
-> Nested Loop (cost=1.00..1812567.19 rows=650825 width=196)
-> Nested Loop (cost=0.44..1395617.51 rows=650825 width=172)
-> Seq Scan on container_repositories (cost=0.00..218572.13 rows=650825 width=168)
Filter: ((created_at < '2022-01-23 00:00:00'::timestamp without time zone) AND (migration_state = 'default'::text))
-> Index Scan using projects_pkey on projects (cost=0.44..1.81 rows=1 width=8)
Index Cond: (id = container_repositories.project_id)
-> Index Scan using namespaces_pkey on namespaces (cost=0.56..0.64 rows=1 width=32)
Index Cond: (id = projects.namespace_id)
-> Index Only Scan using index_feature_gates_on_feature_key_and_key_and_value on feature_gates (cost=0.28..0.30 rows=1 width=8)
Index Cond: ((feature_key = 'container_registry_phase_2_deny_list'::text) AND (key = 'actors'::text) AND (value = concat('Group:', (namespaces.traversal_ids)[1])))
(12 rows)
----------------------
-- Disable Seq Scan --
----------------------
gitlabhq_dblab# set enable_seqscan to off;
SET
gitlabhq_dblab# explain SELECT "container_repositories".* FROM "container_repositories" INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."created_at" < '2022-01-23 00:00:00' AND (NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1]) )) LIMIT 2;
QUERY PLAN
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Limit (cost=1.70..7.87 rows=2 width=168)
-> Nested Loop Anti Join (cost=1.70..1883835.70 rows=611156 width=168)
-> Nested Loop (cost=1.42..1686563.40 rows=650825 width=196)
-> Nested Loop (cost=0.86..1269613.71 rows=650825 width=172)
-> Index Scan using tmp_idx_container_repos_on_non_migrated on container_repositories (cost=0.42..92568.34 rows=650825 width=168)
Filter: (migration_state = 'default'::text)
-> Index Scan using projects_pkey on projects (cost=0.44..1.81 rows=1 width=8)
Index Cond: (id = container_repositories.project_id)
-> Index Scan using namespaces_pkey on namespaces (cost=0.56..0.64 rows=1 width=32)
Index Cond: (id = projects.namespace_id)
-> Index Only Scan using index_feature_gates_on_feature_key_and_key_and_value on feature_gates (cost=0.28..0.30 rows=1 width=8)
Index Cond: ((feature_key = 'container_registry_phase_2_deny_list'::text) AND (key = 'actors'::text) AND (value = concat('Group:', (namespaces.traversal_ids)[1])))
(12 rows)
- With Seq Scan:
cost=1.28..7.86
- Without Seq Scan:
cost=1.70..7.87
The Sec Scan has a lower estimated cost, so Postgres chooses that plan.
To decide on the best limit to give us enough buffer, I looked at a few other limits and compared the current cost with the cost of the Seq Scan. Right now, the Index Scan has better cost, so to get the Seq Scan cost, I removed the various indexes that the query might use until a Seq Scan was chosen:
Limit | Cost | Cost Difference |
---|---|---|
1 | Index Scan: cost=1.70..4.79 Seq Scan: cost=1.28..4.57
|
-0.22 |
2 | Index Scan: cost=1.70..7.87 Seq Scan: cost=1.28..7.86
|
-0.01 |
10 | Index Scan: cost=1.70..32.53 Seq Scan: cost=1.28..34.17
|
1.64 |
25 | Index Scan: cost=1.70..78.77 Seq Scan: cost=1.28..83.50
|
4.73 |
50 | Index Scan: cost=1.70..155.83 Seq Scan: cost=1.28..165.71
|
9.88 |
100 | Index Scan: cost=1.70..309.95 Seq Scan: cost=1.28..330.14
|
20.19 |
1000 | Index Scan: cost=1005.74..2343.71 Seq Scan: cost=1005.32..2523.71
|
180 (this uses a parallel index scan leading to higher cost and query time) |
We know that in the past, LIMIT 1 worked with the index but we had to transition to LIMIT 2 because we crossed the threshold where the Seq Scan cost less. Since then, we have hit the same threshold for LIMIT 2 and now have to transition to at least LIMIT 3. Both costs increase as the limit increases, but we can speculate that we have decreased the gap by close to 0.22
after switching to LIMIT 2.
Based on this, I recommend we add additional padding by going with LIMIT 25.
The query time is still reasonable at 168ms cold, 7ms warm. This query runs in a loop, so we can expect most of the time we will see the warm query.
Screenshots or screen recordings
With no returned values:
[28] pry(main)> ContainerRepository.ready_for_import.limit(25)[0]
ContainerRepository Load (1.6ms) SELECT "container_repositories".* FROM "container_repositories" INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."created_at" < '2022-01-23 00:00:00' AND (NOT EXISTS (
SELECT 1
FROM feature_gates
WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
AND feature_gates.key = 'actors'
AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])
)) LIMIT 25 /*application:console,db_config_name:main,line:(pry):28:in `__pry__'*/
=> nil
With multiple returned values (I've ommitted .ready_for_import
from these to show how the LIMIT is affected and for easier setup):
[29] pry(main)> ContainerRepository.limit(25)[0]
ContainerRepository Load (0.5ms) SELECT "container_repositories".* FROM "container_repositories" LIMIT 10 /*application:console,db_config_name:main,line:(pry):29:in `__pry__'*/
=> #<ContainerRepository:0x00007fc98cf642c8
How to set up and validate locally
-
Create some container_repositories locally:
26.times { FactoryBot.create(:container_repository, project: Project.first) }
-
Try running the following to ensure they return the same results:
Example with no results returned (you may have something returned if you have older
container_repositories
records stored locally):ContainerRepository.ready_for_import.limit(2)[0] # old ContainerRepository.ready_for_import.limit(25)[0] # new
Example with results returned:
ContainerRepository.limit(2)[0] # old ContainerRepository.limit(25)[0] # new
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #365893 (closed);