Optimize scenario where all runners are being considered
pedropombeiro/377963/remove-ordering-by-runner_id
branch, please change to master
before merging
What does this MR do and why?
Describe in detail what your merge request does and why.
This MR improves the logic introduced in !107694 (merged) so that we take into account the scenario where all runners are being queries, not just a subset. In that special case, we can just query the latest 100 builds, without referencing the ci_runners
table.
Part of Provide estimated wait times for instance runners (#377963 - closed)
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
-
Go to the shell in your GDK
gitlab
directory and runbundle exec rake "gitlab:seed:runner_fleet[root]"
. This will seed your GDK with some runners and jobs required for testing this MR. -
Log into http://gdk.test:3000 as an administrator. Make sure you have an Ultimate license enabled.
-
Visit http://gdk.test:3000/-/graphql-explorer and enter the following query:
{ runners { count jobsStatistics { queuedDuration { p50 p75 p90 p95 p99 } } } }
-
Upon running the query, you should see the different percentile values.
Database query plan
Old query
SQL query
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER
BY builds.duration) AS p75, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
FROM (WITH RECURSIVE "array_cte" AS MATERIALIZED (
SELECT "ci_runners"."id"
FROM "ci_runners"), "recursive_keyset_cte" AS ((
SELECT NULL::ci_builds AS records, array_cte_id_array, ci_builds_id_array,
0::bigint AS count
FROM (
SELECT ARRAY_AGG("array_cte"."id") AS array_cte_id_array, ARRAY_AGG("ci_builds"."id") AS ci_builds_id_array
FROM (
SELECT "array_cte"."id"
FROM array_cte) array_cte
LEFT JOIN LATERAL (
SELECT "ci_builds"."id" AS id
FROM "ci_builds"
WHERE "ci_builds"."runner_id" = "array_cte"."id"
ORDER BY "ci_builds"."id" DESC
LIMIT 1) ci_builds ON TRUE
WHERE "ci_builds"."id" IS NOT NULL) array_scope_lateral_query
LIMIT 1)
UNION ALL (
SELECT (
SELECT ci_builds
FROM "ci_builds"
WHERE "ci_builds"."id" = recursive_keyset_cte.ci_builds_id_array[position]
LIMIT 1), array_cte_id_array, recursive_keyset_cte.ci_builds_id_array[:position_query.position - 1] ||
next_cursor_values.id || recursive_keyset_cte.ci_builds_id_array[position_query.position + 1:],
recursive_keyset_cte.count + 1
FROM recursive_keyset_cte, LATERAL (
SELECT id, position
FROM UNNEST(ci_builds_id_array)
WITH ORDINALITY AS u (id, position)
WHERE id IS NOT NULL
ORDER BY 1 DESC
LIMIT 1) AS position_query, LATERAL (
SELECT "record"."id"
FROM (
VALUES (NULL)) AS nulls
LEFT JOIN (
SELECT "ci_builds"."id" AS id
FROM "ci_builds"
WHERE "ci_builds"."runner_id" = recursive_keyset_cte.array_cte_id_array[position]
AND ("ci_builds"."id" < recursive_keyset_cte.ci_builds_id_array[position])
ORDER BY "ci_builds"."id" DESC
LIMIT 1) record ON TRUE
LIMIT 1) AS next_cursor_values))
SELECT (records).*
FROM "recursive_keyset_cte" AS "ci_builds"
WHERE (count <> 0)) ci_builds
WHERE "ci_builds"."started_at" IS NOT NULL
AND "ci_builds"."runner_id" IS NOT NULL
LIMIT 100) builds
There is no full query plan as this query would timeout.
Query plan
Aggregate (cost=1524192.87..1524192.88 rows=1 width=80)
-> Limit (cost=1524187.12..1524190.63 rows=99 width=16)
-> Subquery Scan on ci_builds (cost=1524187.12..1524190.63 rows=99 width=16)
-> CTE Scan on recursive_keyset_cte ci_builds_1 (cost=1524187.12..1524189.39 rows=99 width=3967)
Filter: (((records).started_at IS NOT NULL) AND ((records).runner_id IS NOT NULL) AND (count <> 0))
CTE array_cte
-> Index Only Scan using index_ci_runners_on_active on ci_runners (cost=0.43..113554.15 rows=1818147 width=4)
CTE recursive_keyset_cte
-> Recursive Union (cost=1410170.87..1410632.97 rows=101 width=104)
-> Limit (cost=1410170.87..1410170.89 rows=1 width=104)
-> Subquery Scan on array_scope_lateral_query (cost=1410170.87..1410170.89 rows=1 width=104)
-> Aggregate (cost=1410170.87..1410170.88 rows=1 width=64)
-> Nested Loop (cost=0.70..1401080.13 rows=1818147 width=12)
-> CTE Scan on array_cte (cost=0.00..36362.94 rows=1818147 width=4)
-> Subquery Scan on ci_builds_3 (cost=0.70..0.74 rows=1 width=8)
Filter: (ci_builds_3.id IS NOT NULL)
-> Limit (cost=0.70..0.73 rows=1 width=8)
-> Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on ci_builds ci_builds_4 (cost=0.70..1165.55 rows=45486 width=8)
Index Cond: (runner_id = array_cte.id)
-> Nested Loop (cost=0.86..46.01 rows=10 width=104)
-> Nested Loop (cost=0.15..1.95 rows=10 width=80)
-> WorkTable Scan on recursive_keyset_cte (cost=0.00..0.20 rows=10 width=72)
-> Limit (cost=0.15..0.16 rows=1 width=16)
-> Sort (cost=0.15..0.18 rows=10 width=16)
Sort Key: u.id DESC
-> Function Scan on unnest u (cost=0.00..0.10 rows=10 width=16)
Filter: (id IS NOT NULL)
-> Limit (cost=0.71..0.77 rows=1 width=8)
-> Nested Loop Left Join (cost=0.71..0.77 rows=1 width=8)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Limit (cost=0.71..0.74 rows=1 width=8)
-> Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on ci_builds ci_builds_5 (cost=0.71..426.95 rows=15162 width=8)
Index Cond: ((runner_id = (recursive_keyset_cte.array_cte_id_array)[u."position"]) AND (id < (recursive_keyset_cte.ci_builds_id_array)[u."position"]))
SubPlan 2
-> Limit (cost=0.58..3.60 rows=1 width=1275)
-> Index Scan using index_ci_builds_on_id_partition_id_unique on ci_builds ci_builds_2 (cost=0.58..3.60 rows=1 width=1275)
Index Cond: (id = (recursive_keyset_cte.ci_builds_id_array)[u."position"])
New query
SQL query
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER
BY builds.duration) AS p75, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
FROM "ci_builds"
WHERE "ci_builds"."started_at" IS NOT NULL
AND "ci_builds"."runner_id" IS NOT NULL
ORDER BY "ci_builds"."id" DESC
LIMIT 100) builds
Execution plan
https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14198/commands/49825
Aggregate (cost=332.39..332.40 rows=1 width=80) (actual time=196.899..196.900 rows=1 loops=1)
Buffers: shared hit=343 read=298 dirtied=294
I/O Timings: read=188.040 write=0.000
-> Limit (cost=0.58..330.13 rows=100 width=24) (actual time=43.767..196.658 rows=100 loops=1)
Buffers: shared hit=333 read=298 dirtied=294
I/O Timings: read=188.040 write=0.000
-> Index Scan using ci_builds_pkey on public.ci_builds (cost=0.58..2873274047.35 rows=871879300 width=24) (actual time=43.763..196.596 rows=100 loops=1)
Filter: ((ci_builds.started_at IS NOT NULL) AND (ci_builds.runner_id IS NOT NULL))
Rows Removed by Filter: 533
Buffers: shared hit=333 read=298 dirtied=294
I/O Timings: read=188.040 write=0.000
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.