Loose index scan distinct count [RUN ALL RSPEC] [RUN AS-IF-FOSS]
What does this MR do?
This MR introduces a new utility class to build batched distinct count queries using the loose index scan technique.
The utility class is integrated with BatchCount
behind a feature flag: loose_index_scan_for_distinct_values
When counting distinct (distinct(column)
) elements the DB reads all related rows from the disk. Example:
select count(distinct(project_id)) from issues where project_id in (278964, 2789645, 278966);
Aggregate (cost=3703.17..3703.18 rows=1 width=8) (actual time=45.722..45.723 rows=1 loops=1)
-> Index Only Scan using index_issues_on_project_id_and_iid on issues (cost=0.57..3493.77 rows=83758 width=4) (actual time=0.032..30.022 rows=80808 loops=1)
Index Cond: (project_id = ANY ('{278964,2789645,278966}'::integer[]))
Heap Fetches: 3347
Planning Time: 0.575 ms
Execution Time: 45.773 ms
This query returns 1 as the result. To produce this result , the DB reads about 80000 items from the index.
The loose index scan minimalizes the number of rows read per distinct item by utilizing a recursive loop:
Gitlab::Database::DistinctCount.new(Issue, :project_id).count(from: 278964, to: 278967)
WITH RECURSIVE "counter_cte" AS (
(SELECT "issues"."project_id" AS project_id
FROM "issues"
WHERE "issues"."project_id" >= 278964
AND "issues"."project_id" < 278967
ORDER BY "issues"."project_id" ASC
LIMIT 1)
UNION
(SELECT
(SELECT "issues"."project_id"
FROM "issues"
WHERE "issues"."project_id" > "counter_cte"."project_id"
AND "issues"."project_id" < 278967
ORDER BY "issues"."project_id" ASC
LIMIT 1) AS project_id
FROM "counter_cte"
WHERE "counter_cte"."project_id" < 278967))
SELECT COUNT(project_id)
FROM "counter_cte" AS "issues";
Aggregate (cost=22.52..22.53 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1)
CTE counter_cte
-> Recursive Union (cost=0.57..21.82 rows=31 width=4) (actual time=0.020..0.036 rows=2 loops=1)
-> Limit (cost=0.57..0.61 rows=1 width=4) (actual time=0.019..0.019 rows=1 loops=1)
-> Index Only Scan using index_issues_on_project_id_and_iid on issues issues_2 (cost=0.57..3679.39 rows=83257 width=4) (actual time=0.018..0.018 rows=1 loops=1)
Index Cond: ((project_id >= 278964) AND (project_id < 278967))
Heap Fetches: 0
-> WorkTable Scan on counter_cte (cost=0.00..2.06 rows=3 width=4) (actual time=0.007..0.007 rows=0 loops=2)
Filter: (project_id < 278967)
Rows Removed by Filter: 0
SubPlan 1
-> Limit (cost=0.57..0.61 rows=1 width=4) (actual time=0.008..0.009 rows=0 loops=1)
-> Index Only Scan using index_issues_on_project_id_and_iid on issues issues_1 (cost=0.57..15101.54 rows=344968 width=4) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: ((project_id > counter_cte.project_id) AND (project_id < 278967))
Heap Fetches: 0
-> CTE Scan on counter_cte issues (cost=0.00..0.62 rows=31 width=4) (actual time=0.021..0.037 rows=2 loops=1)
Planning Time: 0.979 ms
Execution Time: 0.088 ms
(18 rows)
The Gitlab::Database::DistinctCount
works with Usage Data with one limitation: GROUP BY
is not supported.
Performance
I ran the distinct_count(::Ci::Build.where(time_period), :user_id)
with the new code on DB lab and I got a bit better total runtime: 27min vs 32min (measured on PG.ai where I have about 160ms latency)
Results:
Query | Distinct count | Optimized distinct count |
---|---|---|
Slowest CI::Build batch | 10s - https://explain.depesz.com/s/RWup | 6s - https://explain.depesz.com/s/fubX |
Random CI::Build batch | 2.1s - https://explain.depesz.com/s/sRnv | 0.8s - https://explain.depesz.com/s/Idpy |
Random deployment batch | 1s - https://explain.depesz.com/s/Cg9I | 0.4s https://explain.depesz.com/s/CbcX |
Each measurements are uncached, waited several hours between running the queries on a PRD replica.
Does this MR meet the acceptance criteria?
Conformity
-
📋 Does this MR need a changelog?-
I have included a changelog entry. -
I have not included a changelog entry because _____.
-
-
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team