Add service calculating ci job queuing time percentiles using ClickHouse
What does this MR do and why?
Extracted from Draft: Resolve "[Runner Fleet dashboard] API fo... (!128752 - closed) this MR adds:
- draft database schema for ci_builds
- a service that selects the data from CH
It will be used in GraphQL API for:
- [Runner Fleet dashboard] API for percentiles of... (#421199 - closed)
- Runner Fleet Dashboard - Admin View:Starter Met... (&11169 - closed)
Database queries and performance
Click to expand
NOTE: to get a decent performance locally, you need to add PARTITION BY toStartOfDay(started_at_bucket)
to the materialized view:
diff --git a/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql b/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql
index 0fb59783192f..0a2aad8e5243 100644
--- a/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql
+++ b/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql
@@ -9,3 +9,6 @@ CREATE TABLE ci_finished_builds_aggregated_queueing_delay_percentiles
)
ENGINE = AggregatingMergeTree()
ORDER BY (status, runner_type, started_at_bucket)
+PARTITION BY toStartOfDay(started_at_bucket)
Otherwise, we'll only have a single partition, and ClickHouse will only use a single thread. See more discussion on the slack thread
vladimirs-macbook-pro.local :) SELECT count(*) FROM ci_finished_builds;
SELECT count(*)
FROM ci_finished_builds
Query id: 3098f35e-9497-46e6-8a27-290e196facc5
┌───count()─┐
│ 300000000 │
└───────────┘
1 row in set. Elapsed: 0.001 sec.
vladimirs-macbook-pro.local :) EXPLAIN PIPELINE SELECT started_at_bucket as time,
INTERVAL quantileMerge(0.99)(start_delay_quantile) SECOND as p99
,
INTERVAL quantileMerge(0.95)(start_delay_quantile) SECOND as p95
,
INTERVAL quantileMerge(0.90)(start_delay_quantile) SECOND as p90
,
INTERVAL quantileMerge(0.50)(start_delay_quantile) SECOND as p50
,
INTERVAL quantileMerge(0.25)(start_delay_quantile) SECOND as p25
FROM ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE status IN ['success', 'failure'] AND
started_at_bucket >= '2023-09-06 09:19:24' AND
started_at_bucket <= '2023-09-06 12:19:24'
GROUP BY started_at_bucket
ORDER BY started_at_bucket;
EXPLAIN PIPELINE
SELECT
started_at_bucket AS time,
toIntervalSecond(quantileMerge(0.99)(start_delay_quantile)) AS p99,
toIntervalSecond(quantileMerge(0.95)(start_delay_quantile)) AS p95,
toIntervalSecond(quantileMerge(0.9)(start_delay_quantile)) AS p90,
toIntervalSecond(quantileMerge(0.5)(start_delay_quantile)) AS p50,
toIntervalSecond(quantileMerge(0.25)(start_delay_quantile)) AS p25
FROM ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE (status IN ['success', 'failure']) AND (started_at_bucket >= '2023-09-06 09:19:24') AND (started_at_bucket <= '2023-09-06 12:19:24')
GROUP BY started_at_bucket
ORDER BY started_at_bucket ASC
Query id: 6b666bc0-21d3-43ec-af30-0cd6c25a60c5
┌─explain───────────────────────────────────────┐
│ (Expression) │
│ ExpressionTransform │
│ (Sorting) │
│ MergingSortedTransform 10 → 1 │
│ MergeSortingTransform × 10 │
│ LimitsCheckingTransform × 10 │
│ PartialSortingTransform × 10 │
│ (Expression) │
│ ExpressionTransform × 10 │
│ (Aggregating) │
│ Resize 5 → 10 │
│ AggregatingTransform × 5 │
│ StrictResize 5 → 5 │
│ (Expression) │
│ ExpressionTransform × 5 │
│ (ReadFromMergeTree) │
│ MergeTreeThread × 5 0 → 1 │
└───────────────────────────────────────────────┘
17 rows in set. Elapsed: 0.004 sec.
vladimirs-macbook-pro.local :) SELECT started_at_bucket as time,
INTERVAL quantileMerge(0.99)(start_delay_quantile) SECOND as p99
,
INTERVAL quantileMerge(0.95)(start_delay_quantile) SECOND as p95
,
INTERVAL quantileMerge(0.90)(start_delay_quantile) SECOND as p90
,
INTERVAL quantileMerge(0.50)(start_delay_quantile) SECOND as p50
,
INTERVAL quantileMerge(0.25)(start_delay_quantile) SECOND as p25
FROM ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE status IN ['success', 'failure'] AND
started_at_bucket >= '2023-09-06 09:19:24' AND
started_at_bucket <= '2023-09-06 12:19:24'
GROUP BY started_at_bucket
ORDER BY started_at_bucket;
SELECT
started_at_bucket AS time,
toIntervalSecond(quantileMerge(0.99)(start_delay_quantile)) AS p99,
toIntervalSecond(quantileMerge(0.95)(start_delay_quantile)) AS p95,
toIntervalSecond(quantileMerge(0.9)(start_delay_quantile)) AS p90,
toIntervalSecond(quantileMerge(0.5)(start_delay_quantile)) AS p50,
toIntervalSecond(quantileMerge(0.25)(start_delay_quantile)) AS p25
FROM ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE (status IN ['success', 'failure']) AND (started_at_bucket >= '2023-09-06 09:19:24') AND (started_at_bucket <= '2023-09-06 12:19:24')
GROUP BY started_at_bucket
ORDER BY started_at_bucket ASC
Query id: b81c4fd0-6de5-4abf-af7b-b3430d492137
┌───────────────────────time─┬─p99─┬─p95─┬─p90─┬─p50─┬─p25─┐
│ 2023-09-06 09:20:00.000000 │ 47 │ 30 │ 23 │ 6 │ 2 │
│ 2023-09-06 09:25:00.000000 │ 44 │ 29 │ 22 │ 6 │ 2 │
│ 2023-09-06 09:30:00.000000 │ 45 │ 30 │ 22 │ 7 │ 2 │
│ 2023-09-06 09:35:00.000000 │ 44 │ 30 │ 23 │ 6 │ 2 │
│ 2023-09-06 09:40:00.000000 │ 47 │ 30 │ 23 │ 6 │ 2 │
│ 2023-09-06 09:45:00.000000 │ 47 │ 29 │ 22 │ 6 │ 2 │
│ 2023-09-06 09:50:00.000000 │ 47 │ 30 │ 23 │ 6 │ 2 │
│ 2023-09-06 09:55:00.000000 │ 46 │ 30 │ 23 │ 6 │ 2 │
│ 2023-09-06 10:00:00.000000 │ 47 │ 30 │ 23 │ 6 │ 2 │
│ 2023-09-06 10:05:00.000000 │ 44 │ 29 │ 22 │ 7 │ 2 │
│ 2023-09-06 10:10:00.000000 │ 44 │ 28 │ 22 │ 6 │ 2 │
│ 2023-09-06 10:15:00.000000 │ 43 │ 29 │ 23 │ 6 │ 2 │
│ 2023-09-06 10:20:00.000000 │ 41 │ 29 │ 22 │ 6 │ 2 │
│ 2023-09-06 10:25:00.000000 │ 40 │ 28 │ 22 │ 6 │ 2 │
│ 2023-09-06 10:30:00.000000 │ 36 │ 26 │ 21 │ 6 │ 2 │
│ 2023-09-06 10:35:00.000000 │ 34 │ 26 │ 21 │ 6 │ 2 │
│ 2023-09-06 10:40:00.000000 │ 30 │ 24 │ 20 │ 6 │ 2 │
│ 2023-09-06 10:45:00.000000 │ 26 │ 22 │ 18 │ 6 │ 2 │
│ 2023-09-06 10:50:00.000000 │ 22 │ 19 │ 16 │ 5 │ 2 │
│ 2023-09-06 10:55:00.000000 │ 18 │ 15 │ 13 │ 5 │ 2 │
│ 2023-09-06 11:00:00.000000 │ 14 │ 12 │ 10 │ 4 │ 2 │
│ 2023-09-06 11:05:00.000000 │ 9 │ 8 │ 7 │ 3 │ 1 │
│ 2023-09-06 11:10:00.000000 │ 5 │ 4 │ 3 │ 1 │ 0 │
│ 2023-09-06 11:15:00.000000 │ 0 │ 0 │ 0 │ 0 │ 0 │
└────────────────────────────┴─────┴─────┴─────┴─────┴─────┘
24 rows in set. Elapsed: 0.073 sec. Processed 6.11 thousand rows, 837.48 KB (84.23 thousand rows/s., 11.54 MB/s.)
How to set up and validate locally
- Follow steps from https://docs.gitlab.com/ee/development/database/clickhouse/clickhouse_within_gitlab.html up until
ClickHouseTestRunner.new.ensure_schema
- Connect to clickhouse and generate the data (replace 300000000 with number of builds you want to generate):
INSERT INTO ci_finished_builds (id, project_id, pipeline_id, status, finished_at, created_at, started_at, queued_at, runner_id, runner_manager_system_xid, runner_run_untagged, runner_type, runner_manager_version, runner_manager_revision, runner_manager_platform, runner_manager_architecture) SELECT number AS id, toUInt64(1 * pow(1 - (rand() / 4294967295.0), -1/1.0)) AS project_id, toUInt64(1 * pow(1 - (rand() / 4294967295.0), -1/1.0)) AS pipeline_id, arrayElement(['success', 'success', 'success', 'success', 'success', 'failed', 'failed', 'cancelled'], 1 + (rand() % 8)) AS status, (now() - toIntervalSecond(rand() % (86400 * 30))) AS finished_at, (finished_at - toIntervalSecond(-600 * ln(1 - rand() / 4294967295.0))) AS started_at, (started_at - toIntervalSecond(-10 * ln(1 - rand() / 4294967295.0))) AS queued_at, (queued_at - toIntervalSecond(-10 * ln(1 - rand() / 4294967295.0))) AS created_at, rand() % 10000 AS runner_id, toString(rand() % 1000) AS runner_manager_system_xid, rand() % 2 AS runner_run_untagged, rand() % 3 + 1 AS runner_type, '' AS runner_manager_version, '' AS runner_manager_revision, '' AS runner_manager_platform, '' AS runner_manager_architecture FROM numbers(300000000);
- execute the following in console:
Feature.enable(:clickhouse_ci_analytics) Ci::CollectQueueingDelayHistoryService.new(current_user: User.first, percentiles: [99, 95, 90, 50, 25], runner_type: 1, from_time: 3.hours.ago, to_time: Time.now).execute
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Edited by Vladimir Shushlin