GraphQL: Expose jobs statistics in CiRunnersConnection
What does this MR do and why?
Describe in detail what your merge request does and why.
This MR exposes statistics for queued times of jobs executed by runners in a CiRunnersConnection
:
{
runners(type: INSTANCE_TYPE) {
count
jobsStatistics {
queuedDuration {
p50
p75
p90
p95
p99
}
}
}
}
It resorts to the technique described in Efficient IN
operator queries to avoid query timeouts against ci_builds
.
To avoid a huge MR at the end, I'm breaking up the work into several MRs to make reviews easier for everyone. The work is divided as follows:
Part of #377963 (closed)
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
header | header |
---|---|
admin on EE | |
admin on CE |
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
-
Go to the shell in your GDK
gitlab
directory and runbundle exec rake "gitlab:seed:runner_fleet[root]"
. This will seed your GDK with some runners and jobs required for testing this MR. -
Log into http://gdk.test:3000 as an administrator. Make sure you have an Ultimate license enabled.
-
Visit http://gdk.test:3000/-/graphql-explorer and enter the following query:
{ runners(type: INSTANCE_TYPE) { count jobsStatistics { queuedDuration { p50 p75 p90 p95 p99 } } } }
-
Upon running the query, you should see the different percentile values.
-
On the terminal in your GDK gitlab directory, follow the instructions to Simulate a CE instance with a licensed GDK:
export FOSS_ONLY=1 gdk restart rails && gdk restart webpack
-
Re-run the query. It should note that the statistics object is not available.
Database query plans
SQL query
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER
BY builds.duration) AS p75, PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
FROM (WITH RECURSIVE "array_cte" AS MATERIALIZED (
SELECT "ci_runners"."id"
FROM "ci_runners"
WHERE "ci_runners"."runner_type" = 1), "recursive_keyset_cte" AS ((
SELECT NULL::ci_builds AS records, array_cte_id_array, ci_builds_runner_id_array,
ci_builds_id_array, 0::bigint AS count
FROM (
SELECT ARRAY_AGG("array_cte"."id") AS array_cte_id_array, ARRAY_AGG("ci_builds"."runner_id") AS
ci_builds_runner_id_array, ARRAY_AGG("ci_builds"."id") AS ci_builds_id_array
FROM (
SELECT "array_cte"."id"
FROM array_cte) array_cte
LEFT JOIN LATERAL (
SELECT "ci_builds"."runner_id" AS runner_id, "ci_builds"."id" AS id
FROM "ci_builds"
WHERE "ci_builds"."runner_id" = "array_cte"."id"
ORDER BY "ci_builds"."runner_id" DESC, "ci_builds"."id" DESC
LIMIT 1) ci_builds ON TRUE
WHERE "ci_builds"."runner_id" IS NOT NULL
AND "ci_builds"."id" IS NOT NULL) array_scope_lateral_query
LIMIT 1)
UNION ALL (
SELECT (
SELECT ci_builds
FROM "ci_builds"
WHERE "ci_builds"."id" = recursive_keyset_cte.ci_builds_id_array[position]
LIMIT 1), array_cte_id_array, recursive_keyset_cte.ci_builds_runner_id_array[:position_query.position - 1] ||
next_cursor_values.runner_id || recursive_keyset_cte.ci_builds_runner_id_array[position_query.position + 1:],
recursive_keyset_cte.ci_builds_id_array[:position_query.position - 1] || next_cursor_values.id ||
recursive_keyset_cte.ci_builds_id_array[position_query.position + 1:],
recursive_keyset_cte.count + 1
FROM recursive_keyset_cte, LATERAL (
SELECT runner_id, id, position
FROM UNNEST(ci_builds_runner_id_array, ci_builds_id_array)
WITH ORDINALITY AS u (runner_id, id, position)
WHERE runner_id IS NOT NULL
AND id IS NOT NULL
ORDER BY 1 DESC, 2 DESC
LIMIT 1) AS position_query, LATERAL (
SELECT "record"."runner_id", "record"."id"
FROM (
VALUES (NULL, NULL)) AS nulls
LEFT JOIN (
SELECT "ci_builds"."runner_id" AS runner_id, "ci_builds"."id" AS id
FROM "ci_builds"
WHERE "ci_builds"."runner_id" = recursive_keyset_cte.array_cte_id_array[position]
AND (("ci_builds"."runner_id", "ci_builds"."id") <
(recursive_keyset_cte.ci_builds_runner_id_array[position],
recursive_keyset_cte.ci_builds_id_array[position]))
ORDER BY "ci_builds"."runner_id" DESC, "ci_builds"."id" DESC
LIMIT 1) record ON TRUE
LIMIT 1) AS next_cursor_values))
SELECT (records).*
FROM "recursive_keyset_cte" AS "ci_builds"
WHERE (count <> 0)) ci_builds
WHERE "ci_builds"."started_at" IS NOT NULL
LIMIT 100) builds
Execution plan
https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14068/commands/49515
Aggregate (cost=627.94..627.95 rows=1 width=80) (actual time=942.373..942.387 rows=1 loops=1)
Buffers: shared hit=1230 read=477 dirtied=62
I/O Timings: read=914.891 write=0.000
-> Limit (cost=622.16..625.68 rows=100 width=16) (actual time=538.522..941.438 rows=100 loops=1)
Buffers: shared hit=1221 read=476 dirtied=62
I/O Timings: read=914.327 write=0.000
-> Subquery Scan on ci_builds (cost=622.16..625.68 rows=100 width=16) (actual time=538.520..941.399 rows=100 loops=1)
Buffers: shared hit=1221 read=476 dirtied=62
I/O Timings: read=914.327 write=0.000
-> CTE Scan on recursive_keyset_cte ci_builds_1 (cost=622.16..624.43 rows=100 width=3967) (actual time=538.461..941.212 rows=100 loops=1)
Filter: (((ci_builds_1.records).started_at IS NOT NULL) AND (ci_builds_1.count <> 0))
Rows Removed by Filter: 1
Buffers: shared hit=1221 read=476 dirtied=62
I/O Timings: read=914.327 write=0.000
CTE array_cte
-> Index Scan using index_ci_runners_on_runner_type on public.ci_runners (cost=0.43..81.64 rows=96 width=4) (actual time=4.942..136.997 rows=73 loops=1)
Index Cond: (ci_runners.runner_type = 1)
Buffers: shared hit=5 read=75 dirtied=26
I/O Timings: read=134.382 write=0.000
CTE recursive_keyset_cte
-> Recursive Union (cost=75.55..540.52 rows=101 width=136) (actual time=526.379..940.333 rows=101 loops=1)
Buffers: shared hit=1221 read=476 dirtied=62
I/O Timings: read=914.327 write=0.000
-> Limit (cost=75.55..75.57 rows=1 width=136) (actual time=526.373..526.378 rows=1 loops=1)
Buffers: shared hit=255 read=262 dirtied=51
I/O Timings: read=518.192 write=0.000
-> Subquery Scan on array_scope_lateral_query (cost=75.55..75.57 rows=1 width=136) (actual time=526.372..526.377 rows=1 loops=1)
Buffers: shared hit=255 read=262 dirtied=51
I/O Timings: read=518.192 write=0.000
-> Aggregate (cost=75.55..75.56 rows=1 width=96) (actual time=526.371..526.374 rows=1 loops=1)
Buffers: shared hit=255 read=262 dirtied=51
I/O Timings: read=518.192 write=0.000
-> Nested Loop (cost=0.70..74.82 rows=96 width=16) (actual time=18.264..525.968 rows=60 loops=1)
Buffers: shared hit=255 read=262 dirtied=51
I/O Timings: read=518.192 write=0.000
-> CTE Scan on array_cte (cost=0.00..1.92 rows=96 width=4) (actual time=4.946..137.314 rows=73 loops=1)
Buffers: shared hit=5 read=75 dirtied=26
I/O Timings: read=134.382 write=0.000
-> Subquery Scan on ci_builds_3 (cost=0.70..0.75 rows=1 width=12) (actual time=5.319..5.319 rows=1 loops=73)
Filter: ((ci_builds_3.runner_id IS NOT NULL) AND (ci_builds_3.id IS NOT NULL))
Rows Removed by Filter: 0
Buffers: shared hit=250 read=187 dirtied=25
I/O Timings: read=383.811 write=0.000
-> Limit (cost=0.70..0.74 rows=1 width=12) (actual time=5.316..5.316 rows=1 loops=73)
Buffers: shared hit=250 read=187 dirtied=25
I/O Timings: read=383.811 write=0.000
-> Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_4 (cost=0.70..1566.28 rows=45567 width=12) (actual time=5.313..5.313 rows=1 loops=73)
Index Cond: (ci_builds_4.runner_id = array_cte.id)
Heap Fetches: 36
Buffers: shared hit=250 read=187 dirtied=20
I/O Timings: read=383.811 write=0.000
-> Nested Loop (cost=0.87..46.29 rows=10 width=136) (actual time=4.133..4.135 rows=1 loops=100)
Buffers: shared hit=966 read=214 dirtied=11
I/O Timings: read=396.135 write=0.000
-> Nested Loop (cost=0.16..1.98 rows=10 width=112) (actual time=0.066..0.067 rows=1 loops=100)
Buffers: shared hit=6
I/O Timings: read=0.000 write=0.000
-> WorkTable Scan on recursive_keyset_cte (cost=0.00..0.20 rows=10 width=104) (actual time=0.000..0.001 rows=1 loops=100)
I/O Timings: read=0.000 write=0.000
-> Limit (cost=0.16..0.16 rows=1 width=20) (actual time=0.063..0.063 rows=1 loops=100)
Buffers: shared hit=6
I/O Timings: read=0.000 write=0.000
-> Sort (cost=0.16..0.18 rows=10 width=20) (actual time=0.062..0.062 rows=1 loops=100)
Sort Key: u.runner_id DESC, u.id DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=6
I/O Timings: read=0.000 write=0.000
-> Function Scan on u (cost=0.01..0.11 rows=10 width=20) (actual time=0.018..0.035 rows=60 loops=100)
Filter: ((u.runner_id IS NOT NULL) AND (u.id IS NOT NULL))
Rows Removed by Filter: 0
I/O Timings: read=0.000 write=0.000
-> Limit (cost=0.71..0.78 rows=1 width=12) (actual time=2.539..2.539 rows=1 loops=100)
Buffers: shared hit=559 read=113 dirtied=9
I/O Timings: read=248.235 write=0.000
-> Nested Loop Left Join (cost=0.71..0.78 rows=1 width=12) (actual time=2.538..2.538 rows=1 loops=100)
Buffers: shared hit=559 read=113 dirtied=9
I/O Timings: read=248.235 write=0.000
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=100)
I/O Timings: read=0.000 write=0.000
-> Limit (cost=0.71..0.75 rows=1 width=12) (actual time=2.536..2.536 rows=1 loops=100)
Buffers: shared hit=559 read=113 dirtied=9
I/O Timings: read=248.235 write=0.000
-> Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_5 (cost=0.71..560.47 rows=15189 width=12) (actual time=2.533..2.533 rows=1 loops=100)
Index Cond: ((ROW(ci_builds_5.runner_id, ci_builds_5.id) < ROW((recursive_keyset_cte.ci_builds_runner_id_array)[u."position"], (recursive_keyset_cte.ci_builds_id_array)[u."position"])) AND (ci_builds_5.runner_id = (recursive_keyset_cte.array_cte_id_array)[u."position"]))
Heap Fetches: 255
Buffers: shared hit=559 read=113 dirtied=7
I/O Timings: read=248.235 write=0.000
SubPlan 2
-> Limit (cost=0.58..3.60 rows=1 width=1272) (actual time=1.510..1.510 rows=1 loops=100)
Buffers: shared hit=401 read=101 dirtied=2
I/O Timings: read=147.900 write=0.000
-> Index Scan using index_ci_builds_on_id_partition_id_unique on public.ci_builds ci_builds_2 (cost=0.58..3.60 rows=1 width=1272) (actual time=1.506..1.506 rows=1 loops=100)
Index Cond: (ci_builds_2.id = (recursive_keyset_cte.ci_builds_id_array)[u."position"])
Buffers: shared hit=399 read=101
I/O Timings: read=147.900 write=0.000
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.