Skip to content

Limit job statistics calculation to most recent 5000 runners

Pedro Pombeiro requested to merge pedropombeiro/377963/limit-source-runners into master

What does this MR do and why?

Describe in detail what your merge request does and why.

This MR is a follow-up for !107694 (merged) to reduce the possibility of users crafting GraphQL queries that will timeout. Instead of computing the statistics from the latest 100 jobs from a pool of all the runners in a filter, we limit to the latest 5000 runners in that filter. This number is high enough to encompass all the instance runners in .com, which is the initial intended scenario for this GraphQL query. Some additional context is in the original issue: !107694 (comment 1223613181).

Part of Provide estimated wait times for instance runners (#377963 - closed)

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Database query plan

This is a query that would timeout on .com, but now executes in 10 seconds. To be noted that it is only available to admins on Ultimate plans.

SQL query
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER
  BY builds.duration) AS p75, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
  SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
  FROM (WITH RECURSIVE "array_cte" AS MATERIALIZED (
      SELECT "ci_runners"."id"
      FROM "ci_runners"
      WHERE "ci_runners"."runner_type" = 2
      ORDER BY "ci_runners"."id" DESC
      LIMIT 5000), "recursive_keyset_cte" AS ((
        SELECT NULL::ci_builds AS records, array_cte_id_array, ci_builds_id_array,
          0::bigint AS count
        FROM (
          SELECT ARRAY_AGG("array_cte"."id") AS array_cte_id_array, ARRAY_AGG("ci_builds"."id") AS ci_builds_id_array
        FROM (
          SELECT "array_cte"."id"
          FROM array_cte) array_cte
        LEFT JOIN LATERAL (
          SELECT "ci_builds"."id" AS id
          FROM "ci_builds"
          WHERE "ci_builds"."runner_id" = "array_cte"."id"
          ORDER BY "ci_builds"."id" DESC
          LIMIT 1) ci_builds ON TRUE
      WHERE "ci_builds"."id" IS NOT NULL) array_scope_lateral_query
    LIMIT 1)
UNION ALL (
  SELECT (
      SELECT ci_builds
      FROM "ci_builds"
      WHERE "ci_builds"."id" = recursive_keyset_cte.ci_builds_id_array[position]
      LIMIT 1), array_cte_id_array, recursive_keyset_cte.ci_builds_id_array[:position_query.position - 1] ||
	next_cursor_values.id || recursive_keyset_cte.ci_builds_id_array[position_query.position + 1:],
	recursive_keyset_cte.count + 1
  FROM recursive_keyset_cte, LATERAL (
      SELECT id, position
      FROM UNNEST(ci_builds_id_array)
        WITH ORDINALITY AS u (id, position)
      WHERE id IS NOT NULL
      ORDER BY 1 DESC
      LIMIT 1) AS position_query, LATERAL (
    SELECT "record"."id"
    FROM (
      VALUES (NULL)) AS nulls
    LEFT JOIN (
      SELECT "ci_builds"."id" AS id
      FROM "ci_builds"
      WHERE "ci_builds"."runner_id" = recursive_keyset_cte.array_cte_id_array[position]
        AND ("ci_builds"."id" < recursive_keyset_cte.ci_builds_id_array[position])
      ORDER BY "ci_builds"."id" DESC
      LIMIT 1) record ON TRUE
  LIMIT 1) AS next_cursor_values))
SELECT (records).*
FROM "recursive_keyset_cte" AS "ci_builds"
WHERE (count <> 0)) ci_builds
WHERE "ci_builds"."started_at" IS NOT NULL
  AND "ci_builds"."runner_id" IS NOT NULL
LIMIT 100) builds
Execution plan

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14198/commands/49816

 Aggregate  (cost=9142.89..9142.90 rows=1 width=80) (actual time=6127.229..6127.324 rows=1 loops=1)
   Buffers: shared hit=28326 read=10415 dirtied=6322
   I/O Timings: read=5745.482 write=0.000
   ->  Limit  (cost=9137.14..9140.65 rows=99 width=16) (actual time=5946.515..6127.003 rows=100 loops=1)
         Buffers: shared hit=28316 read=10415 dirtied=6322
         I/O Timings: read=5745.482 write=0.000
         ->  Subquery Scan on ci_builds  (cost=9137.14..9140.65 rows=99 width=16) (actual time=5946.513..6126.977 rows=100 loops=1)
               Buffers: shared hit=28316 read=10415 dirtied=6322
               I/O Timings: read=5745.482 write=0.000
               ->  CTE Scan on recursive_keyset_cte ci_builds_1  (cost=9137.14..9139.41 rows=99 width=3967) (actual time=5946.508..6126.867 rows=100 loops=1)
                     Filter: (((ci_builds_1.records).started_at IS NOT NULL) AND ((ci_builds_1.records).runner_id IS NOT NULL) AND (ci_builds_1.count <> 0))
                     Rows Removed by Filter: 1
                     Buffers: shared hit=28316 read=10415 dirtied=6322
                     I/O Timings: read=5745.482 write=0.000
                     CTE array_cte
                       ->  Limit  (cost=0.43..4796.99 rows=5000 width=4) (actual time=0.520..506.481 rows=5000 loops=1)
                             Buffers: shared hit=1702 read=7443 dirtied=4598
                             I/O Timings: read=397.307 write=0.000
                             ->  Index Scan using ci_runners_pkey on public.ci_runners  (cost=0.43..367300.09 rows=382878 width=4) (actual time=0.519..505.207 rows=5000 loops=1)
                                   Filter: (ci_runners.runner_type = 2)
                                   Rows Removed by Filter: 3715
                                   Buffers: shared hit=1702 read=7443 dirtied=4598
                                   I/O Timings: read=397.307 write=0.000
                     CTE recursive_keyset_cte
                       ->  Recursive Union  (cost=3878.05..4340.15 rows=101 width=104) (actual time=5940.396..6123.860 rows=101 loops=1)
                             Buffers: shared hit=28316 read=10415 dirtied=6322
                             I/O Timings: read=5745.482 write=0.000
                             ->  Limit  (cost=3878.05..3878.07 rows=1 width=104) (actual time=5940.368..5940.411 rows=1 loops=1)
                                   Buffers: shared hit=27164 read=10276 dirtied=6197
                                   I/O Timings: read=5650.557 write=0.000
                                   ->  Subquery Scan on array_scope_lateral_query  (cost=3878.05..3878.07 rows=1 width=104) (actual time=5940.368..5940.410 rows=1 loops=1)
                                         Buffers: shared hit=27164 read=10276 dirtied=6197
                                         I/O Timings: read=5650.557 write=0.000
                                         ->  Aggregate  (cost=3878.05..3878.06 rows=1 width=64) (actual time=5940.366..5940.400 rows=1 loops=1)
                                               Buffers: shared hit=27164 read=10276 dirtied=6197
                                               I/O Timings: read=5650.557 write=0.000
                                               ->  Nested Loop  (cost=0.70..3853.04 rows=5000 width=12) (actual time=4.327..5933.537 rows=1865 loops=1)
                                                     Buffers: shared hit=27164 read=10276 dirtied=6197
                                                     I/O Timings: read=5650.557 write=0.000
                                                     ->  CTE Scan on array_cte  (cost=0.00..100.00 rows=5000 width=4) (actual time=0.526..513.205 rows=5000 loops=1)
                                                           Buffers: shared hit=1702 read=7443 dirtied=4598
                                                           I/O Timings: read=397.307 write=0.000
                                                     ->  Subquery Scan on ci_builds_3  (cost=0.70..0.74 rows=1 width=8) (actual time=1.081..1.082 rows=0 loops=5000)
                                                           Filter: (ci_builds_3.id IS NOT NULL)
                                                           Rows Removed by Filter: 0
                                                           Buffers: shared hit=25462 read=2833 dirtied=1599
                                                           I/O Timings: read=5253.251 write=0.000
                                                           ->  Limit  (cost=0.70..0.73 rows=1 width=8) (actual time=1.080..1.080 rows=0 loops=5000)
                                                                 Buffers: shared hit=25462 read=2833 dirtied=1599
                                                                 I/O Timings: read=5253.251 write=0.000
                                                                 ->  Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_4  (cost=0.70..1165.55 rows=45486 width=8) (actual time=1.079..1.079 rows=0 loops=5000)
                                                                       Index Cond: (ci_builds_4.runner_id = array_cte.id)
                                                                       Heap Fetches: 3449
                                                                       Buffers: shared hit=25462 read=2833 dirtied=1136
                                                                       I/O Timings: read=5253.251 write=0.000
                             ->  Nested Loop  (cost=0.86..46.01 rows=10 width=104) (actual time=1.824..1.826 rows=1 loops=100)
                                   Buffers: shared hit=1152 read=139 dirtied=125
                                   I/O Timings: read=94.924 write=0.000
                                   ->  Nested Loop  (cost=0.15..1.95 rows=10 width=80) (actual time=0.709..0.710 rows=1 loops=100)
                                         Buffers: shared hit=3
                                         I/O Timings: read=0.000 write=0.000
                                         ->  WorkTable Scan on recursive_keyset_cte  (cost=0.00..0.20 rows=10 width=72) (actual time=0.000..0.001 rows=1 loops=100)
                                               I/O Timings: read=0.000 write=0.000
                                         ->  Limit  (cost=0.15..0.16 rows=1 width=16) (actual time=0.707..0.707 rows=1 loops=100)
                                               Buffers: shared hit=3
                                               I/O Timings: read=0.000 write=0.000
                                               ->  Sort  (cost=0.15..0.18 rows=10 width=16) (actual time=0.706..0.706 rows=1 loops=100)
                                                     Sort Key: u.id DESC
                                                     Sort Method: top-N heapsort  Memory: 25kB
                                                     Buffers: shared hit=3
                                                     I/O Timings: read=0.000 write=0.000
                                                     ->  Function Scan on unnest u  (cost=0.00..0.10 rows=10 width=16) (actual time=0.146..0.376 rows=1863 loops=100)
                                                           Filter: (u.id IS NOT NULL)
                                                           Rows Removed by Filter: 2
                                                           I/O Timings: read=0.000 write=0.000
                                   ->  Limit  (cost=0.71..0.77 rows=1 width=8) (actual time=0.722..0.722 rows=1 loops=100)
                                         Buffers: shared hit=548 read=101 dirtied=95
                                         I/O Timings: read=68.475 write=0.000
                                         ->  Nested Loop Left Join  (cost=0.71..0.77 rows=1 width=8) (actual time=0.721..0.721 rows=1 loops=100)
                                               Buffers: shared hit=548 read=101 dirtied=95
                                               I/O Timings: read=68.475 write=0.000
                                               ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=100)
                                                     I/O Timings: read=0.000 write=0.000
                                               ->  Limit  (cost=0.71..0.74 rows=1 width=8) (actual time=0.720..0.720 rows=1 loops=100)
                                                     Buffers: shared hit=548 read=101 dirtied=95
                                                     I/O Timings: read=68.475 write=0.000
                                                     ->  Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_5  (cost=0.71..426.95 rows=15162 width=8) (actual time=0.715..0.715 rows=1 loops=100)
                                                           Index Cond: ((ci_builds_5.runner_id = (recursive_keyset_cte.array_cte_id_array)[u."position"]) AND (ci_builds_5.id < (recursive_keyset_cte.ci_builds_id_array)[u."position"]))
                                                           Heap Fetches: 167
                                                           Buffers: shared hit=548 read=101 dirtied=87
                                                           I/O Timings: read=68.475 write=0.000
                                   SubPlan 2
                                     ->  Limit  (cost=0.58..3.60 rows=1 width=1275) (actual time=0.286..0.286 rows=1 loops=100)
                                           Buffers: shared hit=601 read=38 dirtied=30
                                           I/O Timings: read=26.449 write=0.000
                                           ->  Index Scan using index_ci_builds_on_id_partition_id_unique on public.ci_builds ci_builds_2  (cost=0.58..3.60 rows=1 width=1275) (actual time=0.280..0.280 rows=1 loops=100)
                                                 Index Cond: (ci_builds_2.id = (recursive_keyset_cte.ci_builds_id_array)[u."position"])
                                                 Buffers: shared hit=505 read=38 dirtied=5
                                                 I/O Timings: read=26.449 write=0.000

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Pedro Pombeiro

Merge request reports

Loading