Split CI minutes resets into different workers
What does this MR do?
Related to #213223 (closed)
In #213223 (comment 317200857) we noticed that ClearSharedRunnersMinutesWorker
was being killed by Sidekiq Memory Killer
because it was running for very long time and consuming a lot of memory due to the amount of data being processed.
This worker does not scale because it keeps processing an always increasing amount of namespaces and projects.
In this MR I've taken a different approach.
-
ClearSharedRunnersMinutesWorker
runs as cronjob on the 1st of every month - Based on the total number of namespaces, it pre-batches work by ID range and executes a new
Ci::BatchResetMinutesWorker
per ID range. Currently on Gitlab.com there are almost 7M namespaces. So rather than having 1 worker dealing with 7M updates, we have a constant batch size of100,000
records per worker. In total, as of today, we should create about 70Ci::BatchResetMinutesWorker
- Each
Ci::BatchResetMinutesWorker
will perform the existing logic ofNamespace#reset_ci_minutes(ids)
in batches of 1000
Other notable changes:
One of the reasons that caused the the CI minutes not to be fully processed was the exclusive lease of 1 hour that was taken by ClearSharedRunnersMinutesWorker
. As Sidekiq Memory Killer
killed and restarted the worker after 29 minutes of running, by restarting it the worker exited immediately (successfully) because of the existing exclusive lease. This then caused the worker not to be retried.
In this MR I've also:
- not used
include CronjobQueue
inCi::BatchResetMinutesWorker
as it disables retries. Instead we want to be able to retry the batch if for some reasons it fails (e.g. sql timeouts) - not used the exclusive lease in the new strategy so that if the worker is killed and retried we can still do the processing immediately
Feature flag
This new logic is switched on by default using ci_parallel_minutes_reset
feature flag
Query plans
For the sake of understanding the load that ClearSharedRunnersMinutesWorker
performs overall. I'm reporting here the query plans of what happens on each batch. Each Ci::BatchResetMinutesWorker
receives a range of 100,000
IDs to process.
Let's consider the 2nd instance of Ci::BatchResetMinutesWorker (processing IDs from 100,001 to 200,000).
Namespace.reset_ci_minutes_for_batch!
we use each_batch
to process the 100,000 namespaces in further batches of 1000
1. Inside SELECT “namespaces”.“id” FROM “namespaces”
WHERE “namespaces”.“id” BETWEEN 100001 AND 200000 AND “namespaces”.“id” >= 100001
ORDER BY “namespaces”.“id” ASC LIMIT 1 OFFSET 1000
/chatops run explain SELECT “namespaces”.“id” FROM “namespaces” WHERE “namespaces”.“id” BETWEEN 100001 AND 200000 AND “namespaces”.“id” >= 100001 ORDER BY “namespaces”.“id” ASC LIMIT 1 OFFSET 1000
Limit (cost=189.60..189.79 rows=1 width=4) (actual time=1.686..1.687 rows=1 loops=1)
Buffers: shared hit=194 read=1
I/O Timings: read=0.016
-> Index Only Scan using namespaces_pkey on namespaces (cost=0.43..17415.18 rows=92060 width=4) (actual time=0.090..1.622 rows=1001 loops=1)
Index Cond: ((id >= 100001) AND (id <= 200000) AND (id >= 100001))
Heap Fetches: 674
Buffers: shared hit=194 read=1
I/O Timings: read=0.016
Planning time: 2.378 ms
Execution time: 1.718 ms
Namespace.recalculate_extra_shared_runners_minutes_limits!(namespaces)
2. Then we execute UPDATE "namespaces"
SET extra_shared_runners_minutes_limit = GREATEST((namespaces.shared_runners_minutes_limit + namespaces.extra_shared_runners_minutes_limit) - ROUND(namespace_statistics.shared_runners_seconds / 60.0), 0)
FROM namespace_statistics
WHERE "namespaces"."id" BETWEEN 100001 AND 200000
AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001
AND (namespaces.shared_runners_minutes_limit > 0)
AND (namespaces.extra_shared_runners_minutes_limit > 0)
AND (namespace_statistics.namespace_id = namespaces.id)
AND (namespace_statistics.shared_runners_seconds > (namespaces.shared_runners_minutes_limit * 60));
ModifyTable on public.namespaces (cost=0.86..1905.35 rows=1 width=354) (actual time=417.958..417.959 rows=0 loops=1)
Buffers: shared hit=66 read=104 dirtied=1
I/O Timings: read=416.707
-> Nested Loop (cost=0.86..1905.35 rows=1 width=354) (actual time=417.955..417.955 rows=0 loops=1)
Buffers: shared hit=66 read=104 dirtied=1
I/O Timings: read=416.707
-> Index Scan using namespaces_pkey on public.namespaces (cost=0.43..1900.88 rows=1 width=336) (actual time=417.952..417.952 rows=0 loops=1)
Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
Filter: ((namespaces.shared_runners_minutes_limit > 0) AND (namespaces.extra_shared_runners_minutes_limit > 0))
Rows Removed by Filter: 937
Buffers: shared hit=66 read=104 dirtied=1
I/O Timings: read=416.707
-> Index Scan using index_namespace_statistics_on_namespace_id on public.namespace_statistics (cost=0.42..4.45 rows=1 width=14) (actual time=0.000..0.000 rows=0 loops=0)
Index Cond: (namespace_statistics.namespace_id = namespaces.id)
Filter: (namespace_statistics.shared_runners_seconds > (namespaces.shared_runners_minutes_limit * 60))
Rows Removed by Filter: 0
Time: 418.864 ms
- planning: 0.700 ms
- execution: 418.164 ms
- I/O read: 416.707 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 66 (~528.00 KiB) from the buffer pool
- reads: 104 (~832.00 KiB) from the OS file cache, including disk I/O
- dirtied: 1 (~8.00 KiB)
- writes: 0
Namespace.reset_shared_runners_seconds!(namespaces)
which resets minutes for the namespaces and related projects
3. Then UPDATE "namespace_statistics"
SET "shared_runners_seconds" = 0, "shared_runners_seconds_last_reset" = '2020-04-10 07:39:54.487884'
WHERE "namespace_statistics"."namespace_id" IN (
SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."id" BETWEEN 100001 AND 200000 AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001
)
AND "namespace_statistics"."shared_runners_seconds" != 0;
UPDATE "project_statistics"
SET "shared_runners_seconds" = 0, "shared_runners_seconds_last_reset" = '2020-04-10 07:39:54.489526'
WHERE "project_statistics"."namespace_id" IN (
SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."id" BETWEEN 100001 AND 200000 AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001
)
AND "project_statistics"."shared_runners_seconds" != 0;
ModifyTable on public.namespace_statistics (cost=0.86..5672.88 rows=6 width=32) (actual time=88.383..88.384 rows=0 loops=1)
Buffers: shared hit=2996 read=71 dirtied=12
I/O Timings: read=80.028
-> Nested Loop (cost=0.86..5672.88 rows=6 width=32) (actual time=10.935..75.227 rows=5 loops=1)
Buffers: shared hit=2982 read=67 dirtied=7
I/O Timings: read=67.303
-> Index Scan using namespaces_pkey on public.namespaces (cost=0.43..1896.08 rows=960 width=10) (actual time=0.018..0.611 rows=937 loops=1)
Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
Buffers: shared hit=170
-> Index Scan using index_namespace_statistics_on_namespace_id on public.namespace_statistics (cost=0.42..3.92 rows=1 width=14) (actual time=0.079..0.079 rows=0 loops=937)
Index Cond: (namespace_statistics.namespace_id = namespaces.id)
Filter: (namespace_statistics.shared_runners_seconds <> 0)
Rows Removed by Filter: 0
Buffers: shared hit=2812 read=67 dirtied=7
I/O Timings: read=67.303
Time: 89.057 ms
- planning: 0.612 ms
- execution: 88.445 ms
- I/O read: 80.028 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 2996 (~23.40 MiB) from the buffer pool
- reads: 71 (~568.00 KiB) from the OS file cache, including disk I/O
- dirtied: 12 (~96.00 KiB)
- writes: 0
ModifyTable on public.project_statistics (cost=0.87..32719.46 rows=105 width=96) (actual time=3288.946..3288.946 rows=0 loops=1)
Buffers: shared hit=3055 read=2827 dirtied=216
I/O Timings: read=3201.249
-> Nested Loop (cost=0.87..32719.46 rows=105 width=96) (actual time=246.081..3258.271 rows=11 loops=1)
Buffers: shared hit=3009 read=2807 dirtied=200
I/O Timings: read=3171.649
-> Index Scan using namespaces_pkey on public.namespaces (cost=0.43..1896.08 rows=960 width=10) (actual time=0.018..2.231 rows=937 loops=1)
Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
Buffers: shared hit=170
-> Index Scan using index_project_statistics_on_namespace_id on public.project_statistics (cost=0.43..32.10 rows=1 width=74) (actual time=3.370..3.473 rows=0 loops=937)
Index Cond: (project_statistics.namespace_id = namespaces.id)
Filter: (project_statistics.shared_runners_seconds <> 0)
Rows Removed by Filter: 3
Buffers: shared hit=2839 read=2807 dirtied=200
I/O Timings: read=3171.649
Time: 3.290 s
- planning: 0.573 ms
- execution: 3.289 s
- I/O read: 3.201 s
- I/O write: 0.000 ms
Shared buffers:
- hits: 3055 (~23.90 MiB) from the buffer pool
- reads: 2827 (~22.10 MiB) from the OS file cache, including disk I/O
- dirtied: 216 (~1.70 MiB)
- writes: 0
Namespace.reset_ci_minutes_notifications!(namespaces)
4. Finally UPDATE "namespaces"
SET "last_ci_minutes_notification_at" = NULL, "last_ci_minutes_usage_notification_level" = NULL
WHERE "namespaces"."id" BETWEEN 100001 AND 200000 AND "namespaces"."id" >= 100001 AND "namespaces"."id" < 101001;
ModifyTable on public.namespaces (cost=0.43..1896.08 rows=960 width=348) (actual time=5323.079..5323.079 rows=0 loops=1)
Buffers: shared hit=47399 read=4794 dirtied=3819
I/O Timings: read=5069.302
-> Index Scan using namespaces_pkey on public.namespaces (cost=0.43..1896.08 rows=960 width=348) (actual time=0.031..4.259 rows=937 loops=1)
Index Cond: ((namespaces.id >= 100001) AND (namespaces.id <= 200000) AND (namespaces.id >= 100001) AND (namespaces.id < 101001))
Buffers: shared hit=170
Time: 5.323 s
- planning: 0.254 ms
- execution: 5.323 s
- I/O read: 5.069 s
- I/O write: 0.000 ms
Shared buffers:
- hits: 47399 (~370.30 MiB) from the buffer pool
- reads: 4794 (~37.50 MiB) from the OS file cache, including disk I/O
- dirtied: 3819 (~29.80 MiB)
- writes: 0
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team