Skip to content

Draft: Fix stuck skipped jobs after playing a manual job

Furkan Ayhan requested to merge 341561-skipped-after-running-manual into master

What does this MR do and why?

This MR fixes the problem when a manual job is played and some subsequent jobs are stuck at "skipped" status.

Related to #341561 (closed)

When playing a manual job, we enqueue PipelineProcessWorker in a few places:

  1. Ci::PlayBuildService for the played job (build.enqueue triggers after_transition)
  2. Ci::AfterRequeueJobService for each subsequent skipped job (processable.process triggers after_transition)

In this commit, we skip pipeline processing for all by using "skip_pipeline_processing", and enqueue a single PipelineProcessWorker manually.

Why?

With the following examples in the section below, you cannot always reproduce the bug in the production environment. This is a sign of a race-condition or concurrency problem. I tried those examples in my local environment and saw that PipelineProcessWorker was enqueued more than once redundantly, sometimes.


Example CI config:

stages:
  - A
  - B
  - C

A:
  stage: A
  script: exit 0
  when: manual

B:
  stage: B
  needs: ["A"]
  script: exit 0

C1:
  stage: C
  needs: ["B"]
  script: exit 0

C2:
  stage: C
  needs: ["B"]
  script: exit 0

C3:
  stage: C
  needs: ["B"]
  script: exit 0

Initial status:

Screen_Shot_2021-11-05_at_13.58.06

When you play the job A:

Screen_Shot_2021-11-05_at_13.58.25

  1. In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb

(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
  job.process(current_user)
end
  1. Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
  2. When we process "C2", its status will be "created". And we enqueue a PipelineProcessWorker right after it.
  3. In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
  4. PipelineProcessWorker calls Ci::PipelineProcessing::AtomicProcessingService and update_processable! is called for "C2".
  5. When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
    • The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
  6. And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".

Screen_Shot_2021-11-05_at_13.59.03

Screenshots or screen recordings / How to set up and validate locally

Example 1 of the bug: "Steps to reproduce" section of #341561 (closed)
diff --git a/app/services/ci/after_requeue_job_service.rb b/app/services/ci/after_requeue_job_service.rb
index 9101ae91967..742bd4e3e54 100644
--- a/app/services/ci/after_requeue_job_service.rb
+++ b/app/services/ci/after_requeue_job_service.rb
@@ -13,6 +13,7 @@ def process_subsequent_jobs(processable)
       (stage_dependent_jobs(processable) | needs_dependent_jobs(processable))
       .each do |processable|
         process(processable)
+        sleep 1
       end
     end

diff --git a/app/services/ci/play_build_service.rb b/app/services/ci/play_build_service.rb
index c1cf06a4631..ba0f3992ad9 100644
--- a/app/services/ci/play_build_service.rb
+++ b/app/services/ci/play_build_service.rb
@@ -8,6 +8,8 @@ def execute(build, job_variables_attributes = nil)
       # Try to enqueue the build, otherwise create a duplicate.
       #
       if build.enqueue
+        sleep 1
+
         build.tap do |build|
           build.update(user: current_user, job_variables_attributes: job_variables_attributes || [])
stages:
  - first
  - second
  - third

first-a:
  stage: first
  script: sleep 10
  when: manual

first-b:
  stage: first
  script: sleep 10
  when: manual

second-a:
  stage: second
  script: echo
  needs: [first-a]

second-b:
  stage: second
  script: echo
  needs: [first-b]

third-a:
  stage: third
  script: echo
  needs:
    - first-a
    - second-a
  when: manual

third-b:
  stage: third
  script: echo
  needs:
    - first-b
    - second-b
  when: manual

Click "Play All";

Screen_Shot_2021-10-25_at_22.24.13

You'll see this;

Screen_Shot_2021-10-25_at_23.14.59

Then this;

Screen_Shot_2021-10-25_at_23.19.23

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Furkan Ayhan

Merge request reports

Loading