Limit job statistics calculation to most recent 5000 runners
What does this MR do and why?
Describe in detail what your merge request does and why.
This MR is a follow-up for !107694 (merged) to reduce the possibility of users crafting GraphQL queries that will timeout. Instead of computing the statistics from the latest 100 jobs from a pool of all the runners in a filter, we limit to the latest 5000 runners in that filter. This number is high enough to encompass all the instance runners in .com, which is the initial intended scenario for this GraphQL query. Some additional context is in the original issue: !107694 (comment 1223613181).
Part of Provide estimated wait times for instance runners (#377963 - closed)
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
Database query plan
This is a query that would timeout on .com, but now executes in 10 seconds. To be noted that it is only available to admins on Ultimate plans.
SQL query
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER
BY builds.duration) AS p75, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
FROM (WITH RECURSIVE "array_cte" AS MATERIALIZED (
SELECT "ci_runners"."id"
FROM "ci_runners"
WHERE "ci_runners"."runner_type" = 2
ORDER BY "ci_runners"."id" DESC
LIMIT 5000), "recursive_keyset_cte" AS ((
SELECT NULL::ci_builds AS records, array_cte_id_array, ci_builds_id_array,
0::bigint AS count
FROM (
SELECT ARRAY_AGG("array_cte"."id") AS array_cte_id_array, ARRAY_AGG("ci_builds"."id") AS ci_builds_id_array
FROM (
SELECT "array_cte"."id"
FROM array_cte) array_cte
LEFT JOIN LATERAL (
SELECT "ci_builds"."id" AS id
FROM "ci_builds"
WHERE "ci_builds"."runner_id" = "array_cte"."id"
ORDER BY "ci_builds"."id" DESC
LIMIT 1) ci_builds ON TRUE
WHERE "ci_builds"."id" IS NOT NULL) array_scope_lateral_query
LIMIT 1)
UNION ALL (
SELECT (
SELECT ci_builds
FROM "ci_builds"
WHERE "ci_builds"."id" = recursive_keyset_cte.ci_builds_id_array[position]
LIMIT 1), array_cte_id_array, recursive_keyset_cte.ci_builds_id_array[:position_query.position - 1] ||
next_cursor_values.id || recursive_keyset_cte.ci_builds_id_array[position_query.position + 1:],
recursive_keyset_cte.count + 1
FROM recursive_keyset_cte, LATERAL (
SELECT id, position
FROM UNNEST(ci_builds_id_array)
WITH ORDINALITY AS u (id, position)
WHERE id IS NOT NULL
ORDER BY 1 DESC
LIMIT 1) AS position_query, LATERAL (
SELECT "record"."id"
FROM (
VALUES (NULL)) AS nulls
LEFT JOIN (
SELECT "ci_builds"."id" AS id
FROM "ci_builds"
WHERE "ci_builds"."runner_id" = recursive_keyset_cte.array_cte_id_array[position]
AND ("ci_builds"."id" < recursive_keyset_cte.ci_builds_id_array[position])
ORDER BY "ci_builds"."id" DESC
LIMIT 1) record ON TRUE
LIMIT 1) AS next_cursor_values))
SELECT (records).*
FROM "recursive_keyset_cte" AS "ci_builds"
WHERE (count <> 0)) ci_builds
WHERE "ci_builds"."started_at" IS NOT NULL
AND "ci_builds"."runner_id" IS NOT NULL
LIMIT 100) builds
Execution plan
https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14198/commands/49816
Aggregate (cost=9142.89..9142.90 rows=1 width=80) (actual time=6127.229..6127.324 rows=1 loops=1)
Buffers: shared hit=28326 read=10415 dirtied=6322
I/O Timings: read=5745.482 write=0.000
-> Limit (cost=9137.14..9140.65 rows=99 width=16) (actual time=5946.515..6127.003 rows=100 loops=1)
Buffers: shared hit=28316 read=10415 dirtied=6322
I/O Timings: read=5745.482 write=0.000
-> Subquery Scan on ci_builds (cost=9137.14..9140.65 rows=99 width=16) (actual time=5946.513..6126.977 rows=100 loops=1)
Buffers: shared hit=28316 read=10415 dirtied=6322
I/O Timings: read=5745.482 write=0.000
-> CTE Scan on recursive_keyset_cte ci_builds_1 (cost=9137.14..9139.41 rows=99 width=3967) (actual time=5946.508..6126.867 rows=100 loops=1)
Filter: (((ci_builds_1.records).started_at IS NOT NULL) AND ((ci_builds_1.records).runner_id IS NOT NULL) AND (ci_builds_1.count <> 0))
Rows Removed by Filter: 1
Buffers: shared hit=28316 read=10415 dirtied=6322
I/O Timings: read=5745.482 write=0.000
CTE array_cte
-> Limit (cost=0.43..4796.99 rows=5000 width=4) (actual time=0.520..506.481 rows=5000 loops=1)
Buffers: shared hit=1702 read=7443 dirtied=4598
I/O Timings: read=397.307 write=0.000
-> Index Scan using ci_runners_pkey on public.ci_runners (cost=0.43..367300.09 rows=382878 width=4) (actual time=0.519..505.207 rows=5000 loops=1)
Filter: (ci_runners.runner_type = 2)
Rows Removed by Filter: 3715
Buffers: shared hit=1702 read=7443 dirtied=4598
I/O Timings: read=397.307 write=0.000
CTE recursive_keyset_cte
-> Recursive Union (cost=3878.05..4340.15 rows=101 width=104) (actual time=5940.396..6123.860 rows=101 loops=1)
Buffers: shared hit=28316 read=10415 dirtied=6322
I/O Timings: read=5745.482 write=0.000
-> Limit (cost=3878.05..3878.07 rows=1 width=104) (actual time=5940.368..5940.411 rows=1 loops=1)
Buffers: shared hit=27164 read=10276 dirtied=6197
I/O Timings: read=5650.557 write=0.000
-> Subquery Scan on array_scope_lateral_query (cost=3878.05..3878.07 rows=1 width=104) (actual time=5940.368..5940.410 rows=1 loops=1)
Buffers: shared hit=27164 read=10276 dirtied=6197
I/O Timings: read=5650.557 write=0.000
-> Aggregate (cost=3878.05..3878.06 rows=1 width=64) (actual time=5940.366..5940.400 rows=1 loops=1)
Buffers: shared hit=27164 read=10276 dirtied=6197
I/O Timings: read=5650.557 write=0.000
-> Nested Loop (cost=0.70..3853.04 rows=5000 width=12) (actual time=4.327..5933.537 rows=1865 loops=1)
Buffers: shared hit=27164 read=10276 dirtied=6197
I/O Timings: read=5650.557 write=0.000
-> CTE Scan on array_cte (cost=0.00..100.00 rows=5000 width=4) (actual time=0.526..513.205 rows=5000 loops=1)
Buffers: shared hit=1702 read=7443 dirtied=4598
I/O Timings: read=397.307 write=0.000
-> Subquery Scan on ci_builds_3 (cost=0.70..0.74 rows=1 width=8) (actual time=1.081..1.082 rows=0 loops=5000)
Filter: (ci_builds_3.id IS NOT NULL)
Rows Removed by Filter: 0
Buffers: shared hit=25462 read=2833 dirtied=1599
I/O Timings: read=5253.251 write=0.000
-> Limit (cost=0.70..0.73 rows=1 width=8) (actual time=1.080..1.080 rows=0 loops=5000)
Buffers: shared hit=25462 read=2833 dirtied=1599
I/O Timings: read=5253.251 write=0.000
-> Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_4 (cost=0.70..1165.55 rows=45486 width=8) (actual time=1.079..1.079 rows=0 loops=5000)
Index Cond: (ci_builds_4.runner_id = array_cte.id)
Heap Fetches: 3449
Buffers: shared hit=25462 read=2833 dirtied=1136
I/O Timings: read=5253.251 write=0.000
-> Nested Loop (cost=0.86..46.01 rows=10 width=104) (actual time=1.824..1.826 rows=1 loops=100)
Buffers: shared hit=1152 read=139 dirtied=125
I/O Timings: read=94.924 write=0.000
-> Nested Loop (cost=0.15..1.95 rows=10 width=80) (actual time=0.709..0.710 rows=1 loops=100)
Buffers: shared hit=3
I/O Timings: read=0.000 write=0.000
-> WorkTable Scan on recursive_keyset_cte (cost=0.00..0.20 rows=10 width=72) (actual time=0.000..0.001 rows=1 loops=100)
I/O Timings: read=0.000 write=0.000
-> Limit (cost=0.15..0.16 rows=1 width=16) (actual time=0.707..0.707 rows=1 loops=100)
Buffers: shared hit=3
I/O Timings: read=0.000 write=0.000
-> Sort (cost=0.15..0.18 rows=10 width=16) (actual time=0.706..0.706 rows=1 loops=100)
Sort Key: u.id DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=3
I/O Timings: read=0.000 write=0.000
-> Function Scan on unnest u (cost=0.00..0.10 rows=10 width=16) (actual time=0.146..0.376 rows=1863 loops=100)
Filter: (u.id IS NOT NULL)
Rows Removed by Filter: 2
I/O Timings: read=0.000 write=0.000
-> Limit (cost=0.71..0.77 rows=1 width=8) (actual time=0.722..0.722 rows=1 loops=100)
Buffers: shared hit=548 read=101 dirtied=95
I/O Timings: read=68.475 write=0.000
-> Nested Loop Left Join (cost=0.71..0.77 rows=1 width=8) (actual time=0.721..0.721 rows=1 loops=100)
Buffers: shared hit=548 read=101 dirtied=95
I/O Timings: read=68.475 write=0.000
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=100)
I/O Timings: read=0.000 write=0.000
-> Limit (cost=0.71..0.74 rows=1 width=8) (actual time=0.720..0.720 rows=1 loops=100)
Buffers: shared hit=548 read=101 dirtied=95
I/O Timings: read=68.475 write=0.000
-> Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_5 (cost=0.71..426.95 rows=15162 width=8) (actual time=0.715..0.715 rows=1 loops=100)
Index Cond: ((ci_builds_5.runner_id = (recursive_keyset_cte.array_cte_id_array)[u."position"]) AND (ci_builds_5.id < (recursive_keyset_cte.ci_builds_id_array)[u."position"]))
Heap Fetches: 167
Buffers: shared hit=548 read=101 dirtied=87
I/O Timings: read=68.475 write=0.000
SubPlan 2
-> Limit (cost=0.58..3.60 rows=1 width=1275) (actual time=0.286..0.286 rows=1 loops=100)
Buffers: shared hit=601 read=38 dirtied=30
I/O Timings: read=26.449 write=0.000
-> Index Scan using index_ci_builds_on_id_partition_id_unique on public.ci_builds ci_builds_2 (cost=0.58..3.60 rows=1 width=1275) (actual time=0.280..0.280 rows=1 loops=100)
Index Cond: (ci_builds_2.id = (recursive_keyset_cte.ci_builds_id_array)[u."position"])
Buffers: shared hit=505 read=38 dirtied=5
I/O Timings: read=26.449 write=0.000
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.