Draft: Fix stuck skipped jobs after playing a manual job
What does this MR do and why?
This MR fixes the problem when a manual job is played and some subsequent jobs are stuck at "skipped" status.
Related to #341561 (closed)
When playing a manual job, we enqueue PipelineProcessWorker
in a few places:
-
Ci::PlayBuildService
for the played job (build.enqueue
triggersafter_transition
) -
Ci::AfterRequeueJobService
for each subsequent skipped job (processable.process
triggersafter_transition
)
In this commit, we skip pipeline processing for all by using "skip_pipeline_processing", and enqueue a single PipelineProcessWorker
manually.
Why?
With the following examples in the section below, you cannot always reproduce the bug in the production environment. This is a sign of a race-condition or concurrency problem. I tried those examples in my local environment and saw that PipelineProcessWorker
was enqueued more than once redundantly, sometimes.
Example CI config:
stages:
- A
- B
- C
A:
stage: A
script: exit 0
when: manual
B:
stage: B
needs: ["A"]
script: exit 0
C1:
stage: C
needs: ["B"]
script: exit 0
C2:
stage: C
needs: ["B"]
script: exit 0
C3:
stage: C
needs: ["B"]
script: exit 0
Initial status:
When you play the job A:
- In
Ci::AfterRequeueJobService
; dependent jobs are iterated and run theprocess
method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb
(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
job.process(current_user)
end
- Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
- When we process "C2", its status will be "created". And we enqueue a
PipelineProcessWorker
right after it. - In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
-
PipelineProcessWorker
callsCi::PipelineProcessing::AtomicProcessingService
andupdate_processable!
is called for "C2". - When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
- The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
- And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".
Screenshots or screen recordings / How to set up and validate locally
Example 1 of the bug: "Steps to reproduce" section of #341561 (closed)
diff --git a/app/services/ci/after_requeue_job_service.rb b/app/services/ci/after_requeue_job_service.rb
index 9101ae91967..742bd4e3e54 100644
--- a/app/services/ci/after_requeue_job_service.rb
+++ b/app/services/ci/after_requeue_job_service.rb
@@ -13,6 +13,7 @@ def process_subsequent_jobs(processable)
(stage_dependent_jobs(processable) | needs_dependent_jobs(processable))
.each do |processable|
process(processable)
+ sleep 1
end
end
diff --git a/app/services/ci/play_build_service.rb b/app/services/ci/play_build_service.rb
index c1cf06a4631..ba0f3992ad9 100644
--- a/app/services/ci/play_build_service.rb
+++ b/app/services/ci/play_build_service.rb
@@ -8,6 +8,8 @@ def execute(build, job_variables_attributes = nil)
# Try to enqueue the build, otherwise create a duplicate.
#
if build.enqueue
+ sleep 1
+
build.tap do |build|
build.update(user: current_user, job_variables_attributes: job_variables_attributes || [])
stages:
- first
- second
- third
first-a:
stage: first
script: sleep 10
when: manual
first-b:
stage: first
script: sleep 10
when: manual
second-a:
stage: second
script: echo
needs: [first-a]
second-b:
stage: second
script: echo
needs: [first-b]
third-a:
stage: third
script: echo
needs:
- first-a
- second-a
when: manual
third-b:
stage: third
script: echo
needs:
- first-b
- second-b
when: manual
Click "Play All";
You'll see this;
Then this;
- Example 2 of the bug: Customer report: #341561 (comment 703918241)
- Example 3 of the bug: "Steps to reproduce" section of #336513 (closed)
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.