Reset skipped jobs on new alive jobs during pipeline processing
What does this MR do and why?
As discovered in #388539 (closed), a race condition sometimes occurs during pipeline processing, which results in some dependent jobs remaining skipped after clicking the "Play all manual" button. A simplified visual workflow of what is happening can be found here.
In this MR, we fix this condition by re-running ResetSkippedJobsService
on any jobs change from a "stopped" status to an "alive" status during pipeline processing. The status definitions can be found in app/models/concerns/ci/has_status.rb.
Feature Flag: ci_reset_skipped_jobs_in_atomic_processing
How to set up and validate locally
-
The following steps present a scenario where there is a high likelihood of demonstrating the issue.
-
First, check out this branch and amend the codebase with a couple
sleep
s. The goal is to ensure thatAtomicProcessingService
starts running just after playingmanual-job-1
and before playingmanual-job-2
.
Add sleep(0.5)
to app/services/ci/play_manual_stage_service.rb:15
:
Add sleep(1)
to app/services/ci/pipeline_processing/atomic_processing_service.rb:94
:
**If the above sleep times don't reliably reproduce the error, try updating the times to 1.5
and 2
seconds <-- These are what worked on my local machine.
- Update your project's
.gitlab-ci.yml
file with the following contents:
stages:
- build
- test
manual-job-1:
stage: build
when: manual
script: echo
manual-job-2:
stage: build
when: manual
script: echo
job-1:
stage: test
needs: [manual-job-1, manual-job-2]
script: echo
- Run the pipeline. Initially it should look like the following screenshot. (Note that the initial processing may take a few seconds longer given the
sleep
s we added.
- Now click the "Play all manual" button of the
build
stage. The following results:
In the above screenshot, observe that job-1
is in skipped status.
- Now enable the feature flag:
Feature.enable(:ci_reset_skipped_jobs_in_atomic_processing)
- Repeat steps 3-4, and observe that the issue does not occur and
job-1
goes to created status and eventually succeeds. Repeat the test several times to ensure the pipeline reliably succeeds.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #388539 (closed)