Backend: Scheduled pipelines are triggering multiple times
Summary
Scheduled pipelines are occasionaly triggered multiple times unexpectedly.
Steps to reproduce
- Setup a scheduled pipeline with
0 */3 * * *
Custom schedule on Beijing timeline - Wait
- Measure and see how many pipelines are triggered per hour with a chart like https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12272748&udv=0
Example Project
There have been a few of these... (click to expand):
GitLab JiHu Code Sync
URL | Owner | Frequency chart | Schedule Screenshot |
---|---|---|---|
https://gitlab.com/gitlab-jh/code-sync/-/pipeline_schedules/146378/edit | @gitlab-jh-bot | https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12272748&udv=0 |
Triage Ops
URL | Owner | Frequency chart | Schedule Screenshot |
---|---|---|---|
https://gitlab.com/gitlab-org/quality/triage-ops/-/pipeline_schedules/157187/edit | @gitlab-bot | https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12481416&udv=0 |
GitLab Bi-Hourly Pipeline
URL | Owner | Frequency chart | Schedule Screenshot |
---|---|---|---|
https://gitlab.com/gitlab-org/gitlab/-/pipeline_schedules/23503/edit | @gitlab-bot | https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12481421&udv=0 |
What is the current bug behavior?
Pipeline occasionally gets triggered multiple times
What is the expected correct behavior?
Pipeline is triggered once according to the schedule
Relevant logs and/or screenshots
- https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12272748&udv=0
- gitlab-org/quality/triage-ops#722 (comment 651848969)
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
From #338609 (comment 942671252):
PipelineScheduleWorker
runs at 04:00am- It finds "runnable schedules"
- For each schedule, it uses
PipelineScheduleService
to executeschedule.schedule_next_run!
(which updates thenext_run_at
column in the database)- It then asynchronously queues a
RunPipelineScheduleWorker
with the schedule information- Importantly, the above may take longer than 5 minutes, so it may still be processing schedules from the original "runnable schedules" list generated 5 minutes earlier
PipelineScheduleWorker
runs at 04:05am - the 04:00am job is still running at this time- It finds "runnable schedules" - this may include schedules that co-exist in the still-running 04:00am job due to
next_run_at
not being updated yet- It duplicates a
RunPipelineScheduleWorker
for these schedulesI might be off the mark, but it appears that
PipelineScheduleWorker
might not be marked as idempotent? Looking at the docs on Sidekiq idempotent jobs,idempotent!
with adeduplicate :until_executed
strategy looks like it could be a good option. The docs state:It can be used to prevent jobs from running simultaneously multiple times.
which looks to be the key problem here (I'm assuming that cron-based workers can still use this feature).
Shifting
schedule.schedule_next_run!
into theRunPipelineScheduleWorker
is a good idea, but would duplication still not be possible, if it takes longer than 5 minutes for the worker to be picked up and executed (if there is a large backlog of pending jobs)?
- Move
schedule_next_run!
insideRunPipelineScheduleWorker
- Make
RunPipelineScheduleWorker
idempotent withdeduplicate :until_executed
- Potentially remove the
PipelineScheduleService
altogether
To validate the fix we should compare the PipelineScheduleWorker
duration before and after to ensure it's lower than 5 minutes.
Unknown but may be related to !62826 (merged)