Improve PipelineScheduleWorker and check if schedule is executed (!106661) · Merge requests · GitLab.org / GitLab

Tianwen Chen requested to merge 338609-performance-for-pipeline-schedule-worker into master Dec 12, 2022

What does this MR do and why?

Improve the performance for PipelineScheduleWorker to reduce the time to enqueue pipelines.
Check if schedule is executed or not.

Currently, the PipelineScheduleWorker takes a long time to process, it might not be done before the next PipelineScheduleWorker starts to run. This means the pipeline schedules that haven't been processed by the first PipelineScheduleWorker might be picked up by the second PipelineScheduleWorker.

There was a merge request !101328 (merged) trying to make the RunPipelineScheduleWorker idempotent to reduce the duplicated enqueue. However, this didn't prevent the duplication from happening at two different timestamps and there are still duplications occurring reported by user in this thread #338609 (comment 1200751371).

Therefore, the aim of this MR is to reduce the time from executing PipelineScheduleWorker so that it won't have any collapsing with the second run.

At the same time, a check is added to prevent the schedule from being executed when it's already scheduled in the next run.

Queries and their performance

This MR uses this scope:

      Ci::PipelineSchedule
        .select(:id, :owner_id) # Minimize the selected columns
        .runnable_schedules
        .find_in_batches do |schedules|

And the queries produced are:

First batch produces the following query:

SELECT "ci_pipeline_schedules"."id", "ci_pipeline_schedules"."owner_id" FROM "ci_pipeline_schedules" WHERE "ci_pipeline_schedules"."active" = TRUE AND (next_run_at < '2023-01-05 02:19:00') ORDER BY "ci_pipeline_schedules"."id" ASC LIMIT 1000

The postgres.ai analyze report is at https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14210/commands/49856

The explain is:

 Limit  (cost=0.42..171.29 rows=1000 width=8) (actual time=0.317..30.370 rows=1000 loops=1)
   Buffers: shared hit=4859 read=22 dirtied=36
   I/O Timings: read=1.154 write=0.000
   ->  Index Scan using ci_pipeline_schedules_pkey on public.ci_pipeline_schedules  (cost=0.42..14468.22 rows=84670 width=8) (actual time=0.316..30.249 rows=1000 loops=1)
         Filter: (ci_pipeline_schedules.active AND (ci_pipeline_schedules.next_run_at < '2023-01-05 02:19:00'::timestamp without time zone))
         Rows Removed by Filter: 4302
         Buffers: shared hit=4859 read=22 dirtied=36
         I/O Timings: read=1.154 write=0.000

Batches after first one produce the query similar to the above with an addition condition "ci_pipeline_schedules"."id" > $MAX_ID_FROM_LAST_BATCH:

SELECT "ci_pipeline_schedules"."id", "ci_pipeline_schedules"."owner_id" FROM "ci_pipeline_schedules" WHERE "ci_pipeline_schedules"."active" = TRUE AND (next_run_at < '2023-01-05 02:19:00') AND "ci_pipeline_schedules"."id" > 10000 ORDER BY "ci_pipeline_schedules"."id" ASC LIMIT 1000

The postgres.ai analyze report is at https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14210/commands/49857

The explain is:

 Limit  (cost=0.42..178.99 rows=1000 width=8) (actual time=0.040..3.928 rows=1000 loops=1)
   Buffers: shared hit=2911
   I/O Timings: read=0.000 write=0.000
   ->  Index Scan using ci_pipeline_schedules_pkey on public.ci_pipeline_schedules  (cost=0.42..14723.73 rows=82451 width=8) (actual time=0.039..3.837 rows=1000 loops=1)
         Index Cond: (ci_pipeline_schedules.id > 10000)
         Filter: (ci_pipeline_schedules.active AND (ci_pipeline_schedules.next_run_at < '2023-01-05 02:19:00'::timestamp without time zone))
         Rows Removed by Filter: 1950
         Buffers: shared hit=2911
         I/O Timings: read=0.000 write=0.000

As it can be seen above, an index should be used instead of the filter. The index is prepared in this MR and added in this MR !107904 (closed)

If the index is added at postgres.ai as follow:

EXEC CREATE INDEX index_ci_pipeline_schedules_on_id_and_next_run_at_and_active ON ci_pipeline_schedules USING btree (id, next_run_at) WHERE (active = true);

The execution plans for the above queries become:

First batch produces the following query:

SELECT "ci_pipeline_schedules"."id", "ci_pipeline_schedules"."owner_id" FROM "ci_pipeline_schedules" WHERE "ci_pipeline_schedules"."active" = TRUE AND (next_run_at < '2023-01-05 02:19:00') ORDER BY "ci_pipeline_schedules"."id" ASC LIMIT 1000

The postgres.ai analyze report is at https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14210/commands/49860

The explain is:

 Limit  (cost=0.42..127.70 rows=1000 width=8) (actual time=0.922..3.279 rows=1000 loops=1)
   Buffers: shared hit=935 read=5
   I/O Timings: read=0.956 write=0.000
   ->  Index Scan using index_ci_pipeline_schedules_on_id_and_next_run_at_and_active on public.ci_pipeline_schedules  (cost=0.42..9687.35 rows=76103 width=8) (actual time=0.921..3.181 rows=1000 loops=1)
         Index Cond: (ci_pipeline_schedules.next_run_at < '2023-01-05 02:19:00'::timestamp without time zone)
         Buffers: shared hit=935 read=5
         I/O Timings: read=0.956 write=0.000

Batches after first one produce the query similar to the above with an addition condition "ci_pipeline_schedules"."id" > $MAX_ID_FROM_LAST_BATCH:

SELECT "ci_pipeline_schedules"."id", "ci_pipeline_schedules"."owner_id" FROM "ci_pipeline_schedules" WHERE "ci_pipeline_schedules"."active" = TRUE AND (next_run_at < '2023-01-05 02:19:00') AND "ci_pipeline_schedules"."id" > 10000 ORDER BY "ci_pipeline_schedules"."id" ASC LIMIT 1000

The postgres.ai analyze report is at https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14210/commands/49859

The explain is:

 Limit  (cost=0.42..133.50 rows=1000 width=8) (actual time=0.213..2.369 rows=1000 loops=1)
   Buffers: shared hit=945 read=9
   I/O Timings: read=0.335 write=0.000
   ->  Index Scan using index_ci_pipeline_schedules_on_id_and_next_run_at_and_active on public.ci_pipeline_schedules  (cost=0.42..9862.55 rows=74108 width=8) (actual time=0.212..2.269 rows=1000 loops=1)
         Index Cond: ((ci_pipeline_schedules.id > 10000) AND (ci_pipeline_schedules.next_run_at < '2023-01-05 02:19:00'::timestamp without time zone))
         Buffers: shared hit=945 read=9
         I/O Timings: read=0.335 write=0.000

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #338609 (closed)

Edited Jan 04, 2023 by Tianwen Chen

Improve PipelineScheduleWorker and check if schedule is executed

What does this MR do and why?

Queries and their performance

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports