Optimize scenario where all runners are being considered (!108003) · Merge requests · GitLab.org / GitLab

Pedro Pombeiro requested to merge pedropombeiro/377963/fix-jobs-duration-stats into pedropombeiro/377963/remove-ordering-by-runner_id Jan 03, 2023

⚠ This MR is based on the pedropombeiro/377963/remove-ordering-by-runner_id branch, please change to master before merging ⚠

What does this MR do and why?

Describe in detail what your merge request does and why.

This MR improves the logic introduced in !107694 (merged) so that we take into account the scenario where all runners are being queries, not just a subset. In that special case, we can just query the latest 100 builds, without referencing the ci_runners table.

Part of Provide estimated wait times for instance runners (#377963 - closed)

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Go to the shell in your GDK gitlab directory and run bundle exec rake "gitlab:seed:runner_fleet[root]". This will seed your GDK with some runners and jobs required for testing this MR.
Log into http://gdk.test:3000 as an administrator. Make sure you have an Ultimate license enabled.

Visit http://gdk.test:3000/-/graphql-explorer and enter the following query:

{
  runners {
    count
    jobsStatistics {
      queuedDuration {
        p50
        p75
        p90
        p95
        p99
      }
    }
  }
}

Upon running the query, you should see the different percentile values.

Database query plan

Old query

SQL query

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER
  BY builds.duration) AS p75, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
  SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
  FROM (WITH RECURSIVE "array_cte" AS MATERIALIZED (
      SELECT "ci_runners"."id"
      FROM "ci_runners"), "recursive_keyset_cte" AS ((
          SELECT NULL::ci_builds AS records, array_cte_id_array, ci_builds_id_array,
            0::bigint AS count
          FROM (
            SELECT ARRAY_AGG("array_cte"."id") AS array_cte_id_array, ARRAY_AGG("ci_builds"."id") AS ci_builds_id_array
          FROM (
            SELECT "array_cte"."id"
            FROM array_cte) array_cte
          LEFT JOIN LATERAL (
            SELECT "ci_builds"."id" AS id
            FROM "ci_builds"
            WHERE "ci_builds"."runner_id" = "array_cte"."id"
            ORDER BY "ci_builds"."id" DESC
            LIMIT 1) ci_builds ON TRUE
        WHERE "ci_builds"."id" IS NOT NULL) array_scope_lateral_query
      LIMIT 1)
  UNION ALL (
    SELECT (
        SELECT ci_builds
        FROM "ci_builds"
        WHERE "ci_builds"."id" = recursive_keyset_cte.ci_builds_id_array[position]
	LIMIT 1), array_cte_id_array, recursive_keyset_cte.ci_builds_id_array[:position_query.position - 1] ||
	  next_cursor_values.id || recursive_keyset_cte.ci_builds_id_array[position_query.position + 1:],
	  recursive_keyset_cte.count + 1
    FROM recursive_keyset_cte, LATERAL (
        SELECT id, position
        FROM UNNEST(ci_builds_id_array)
          WITH ORDINALITY AS u (id, position)
        WHERE id IS NOT NULL
        ORDER BY 1 DESC
        LIMIT 1) AS position_query, LATERAL (
      SELECT "record"."id"
      FROM (
        VALUES (NULL)) AS nulls
    LEFT JOIN (
      SELECT "ci_builds"."id" AS id
      FROM "ci_builds"
      WHERE "ci_builds"."runner_id" = recursive_keyset_cte.array_cte_id_array[position]
          AND ("ci_builds"."id" < recursive_keyset_cte.ci_builds_id_array[position])
        ORDER BY "ci_builds"."id" DESC
        LIMIT 1) record ON TRUE
    LIMIT 1) AS next_cursor_values))
SELECT (records).*
FROM "recursive_keyset_cte" AS "ci_builds"
WHERE (count <> 0)) ci_builds
WHERE "ci_builds"."started_at" IS NOT NULL
  AND "ci_builds"."runner_id" IS NOT NULL
LIMIT 100) builds

There is no full query plan as this query would timeout.

Query plan

Aggregate  (cost=1524192.87..1524192.88 rows=1 width=80)
  ->  Limit  (cost=1524187.12..1524190.63 rows=99 width=16)
        ->  Subquery Scan on ci_builds  (cost=1524187.12..1524190.63 rows=99 width=16)
              ->  CTE Scan on recursive_keyset_cte ci_builds_1  (cost=1524187.12..1524189.39 rows=99 width=3967)
                    Filter: (((records).started_at IS NOT NULL) AND ((records).runner_id IS NOT NULL) AND (count <> 0))
                    CTE array_cte
                      ->  Index Only Scan using index_ci_runners_on_active on ci_runners  (cost=0.43..113554.15 rows=1818147 width=4)
                    CTE recursive_keyset_cte
                      ->  Recursive Union  (cost=1410170.87..1410632.97 rows=101 width=104)
                            ->  Limit  (cost=1410170.87..1410170.89 rows=1 width=104)
                                  ->  Subquery Scan on array_scope_lateral_query  (cost=1410170.87..1410170.89 rows=1 width=104)
                                        ->  Aggregate  (cost=1410170.87..1410170.88 rows=1 width=64)
                                              ->  Nested Loop  (cost=0.70..1401080.13 rows=1818147 width=12)
                                                    ->  CTE Scan on array_cte  (cost=0.00..36362.94 rows=1818147 width=4)
                                                    ->  Subquery Scan on ci_builds_3  (cost=0.70..0.74 rows=1 width=8)
                                                          Filter: (ci_builds_3.id IS NOT NULL)
                                                          ->  Limit  (cost=0.70..0.73 rows=1 width=8)
                                                                ->  Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on ci_builds ci_builds_4  (cost=0.70..1165.55 rows=45486 width=8)
                                                                      Index Cond: (runner_id = array_cte.id)
                            ->  Nested Loop  (cost=0.86..46.01 rows=10 width=104)
                                  ->  Nested Loop  (cost=0.15..1.95 rows=10 width=80)
                                        ->  WorkTable Scan on recursive_keyset_cte  (cost=0.00..0.20 rows=10 width=72)
                                        ->  Limit  (cost=0.15..0.16 rows=1 width=16)
                                              ->  Sort  (cost=0.15..0.18 rows=10 width=16)
                                                    Sort Key: u.id DESC
                                                    ->  Function Scan on unnest u  (cost=0.00..0.10 rows=10 width=16)
                                                          Filter: (id IS NOT NULL)
                                  ->  Limit  (cost=0.71..0.77 rows=1 width=8)
                                        ->  Nested Loop Left Join  (cost=0.71..0.77 rows=1 width=8)
                                              ->  Result  (cost=0.00..0.01 rows=1 width=0)
                                              ->  Limit  (cost=0.71..0.74 rows=1 width=8)
                                                    ->  Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on ci_builds ci_builds_5  (cost=0.71..426.95 rows=15162 width=8)
                                                          Index Cond: ((runner_id = (recursive_keyset_cte.array_cte_id_array)[u."position"]) AND (id < (recursive_keyset_cte.ci_builds_id_array)[u."position"]))
                                  SubPlan 2
                                    ->  Limit  (cost=0.58..3.60 rows=1 width=1275)
                                          ->  Index Scan using index_ci_builds_on_id_partition_id_unique on ci_builds ci_builds_2  (cost=0.58..3.60 rows=1 width=1275)
                                                Index Cond: (id = (recursive_keyset_cte.ci_builds_id_array)[u."position"])

New query

SQL query

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER
  BY builds.duration) AS p75, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
  SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
  FROM "ci_builds"
  WHERE "ci_builds"."started_at" IS NOT NULL
    AND "ci_builds"."runner_id" IS NOT NULL
  ORDER BY "ci_builds"."id" DESC
  LIMIT 100) builds

Execution plan

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14198/commands/49825

 Aggregate  (cost=332.39..332.40 rows=1 width=80) (actual time=196.899..196.900 rows=1 loops=1)
   Buffers: shared hit=343 read=298 dirtied=294
   I/O Timings: read=188.040 write=0.000
   ->  Limit  (cost=0.58..330.13 rows=100 width=24) (actual time=43.767..196.658 rows=100 loops=1)
         Buffers: shared hit=333 read=298 dirtied=294
         I/O Timings: read=188.040 write=0.000
         ->  Index Scan using ci_builds_pkey on public.ci_builds  (cost=0.58..2873274047.35 rows=871879300 width=24) (actual time=43.763..196.596 rows=100 loops=1)
               Filter: ((ci_builds.started_at IS NOT NULL) AND (ci_builds.runner_id IS NOT NULL))
               Rows Removed by Filter: 533
               Buffers: shared hit=333 read=298 dirtied=294
               I/O Timings: read=188.040 write=0.000

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Jan 03, 2023 by Pedro Pombeiro

Optimize scenario where all runners are being considered