Fix DAG order of subsequent jobs after requeue
What does this MR do and why?
When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx
but after introducing same-stage jobs, this disappeared.
We recently fixed this problem for 99% of occurrences: !77528 (merged), however, the problem still exists for the same-stage pipelines. Example use-case: #217129 (comment 826686713)
In this MR, we are ordering jobs by stage first, then their DAG relationships.
We realized this when investigating #341561 (closed) and the discussions in !72778 (closed). (!72778 (comment 729165514))
This bug fix is behind a FF ci_fix_order_of_subsequent_jobs
#345587 (closed).
Previous attempt: !74394 (closed)
Screenshots or screen recordings
Example CI config:
stages: [a, b, c]
a1:
stage: a
script: exit $(($RANDOM % 2))
a2:
stage: a
script: exit 0
needs: [a1]
b1:
stage: b
script: exit 0
needs: [b2]
b2:
stage: b
script: exit 0
c1:
stage: c
script: exit 0
needs: [b2]
c2:
stage: c
script: exit 0
Initial status:
When you retry a1:
- In
Ci::AfterRequeueJobService
; dependent jobs are iterated and run theprocess
method on them. - Those jobs are in random order if they are in the same stage.
- When we retry "a1...",
Ci::AfterRequeueJobService
sets subsequent jobs as "created" one by one. - After setting a job as "created", we enqueue a
PipelineProcessWorker
. - In our race-condition case, that worker is started to be processed before even iterating other jobs.
- Summary; retry a1, set b1,
PipelineProcessWorker
runs, then set b2. - When we calculate the status of b1 in AtomicProcessing, we use the status of b2, which is "skipped" at the moment. So, we mark b1 as "skipped".
- The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
- And b1 stays in the "skipped" state forever, even after processing b2.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #341561 (closed)