Draft: Fix the order of subsequent jobs when requeuing a job (in DB)
What does this MR do and why?
This MR is just an alternative PoC for #341561 (closed).
In other MR (!74394 (closed)), we tried solving the ordering problem on the Ruby side, in this, on the DB side.
Copied from the other MR:
When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by
stage_idx
but after introducing same-stage jobs, this disappeared.In this MR, we are ordering jobs by stage first, then their DAG relationships.
We realized this when investigating #341561 (closed) and the discussions in !72778 (closed). (!72778 (comment 729165514))
This bug fix is behind a FF ci_fix_order_of_subsequent_jobs
#345587 (closed).
DB
New SQL query;
SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN (
SELECT id, MAX(depth) AS depth
FROM (
WITH RECURSIVE "base_and_descendants" AS (
(SELECT 1 AS depth, ARRAY[ci_builds.id] AS tree_path, false AS tree_cycle, "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."id" = 1670888824)
UNION
(SELECT ("base_and_descendants"."depth" + 1), tree_path || "ci_builds".id, "ci_builds".id = ANY(tree_path), "ci_builds".*
FROM "ci_builds", "base_and_descendants", "ci_build_needs"
WHERE "ci_builds"."commit_id" = "base_and_descendants"."commit_id"
AND "ci_build_needs"."build_id" = "ci_builds"."id"
AND "ci_build_needs"."name" = "base_and_descendants"."name"
AND "base_and_descendants"."tree_cycle" = FALSE)
)
SELECT ci_builds.id, MAX(depth) AS depth
FROM "base_and_descendants" AS "ci_builds"
WHERE "ci_builds"."id" NOT IN (SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."id" = 1670888824)
GROUP BY "ci_builds"."id"
UNION
SELECT ci_builds.id, -1 AS depth FROM "ci_builds" WHERE "ci_builds"."commit_id" = 386704131 AND (stage_idx > 1)
) union_ci_builds
GROUP BY id
) t2
ON ci_builds.id = t2.id
WHERE ("ci_builds"."status" IN ('skipped'))
ORDER BY stage_idx ASC, depth ASC, id ASC;
- Cold run: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7250/commands/25786
- Hot run: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7250/commands/25787
How to set up and validate locally
Bug description:
Example CI config:
stages:
- A
- B
- C
A:
stage: A
script: exit 0
when: manual
B:
stage: B
needs: ["A"]
script: exit 0
C1:
stage: C
needs: ["B"]
script: exit 0
C2:
stage: C
needs: ["B"]
script: exit 0
C3:
stage: C
needs: ["B"]
script: exit 0
Initial status:
When you play the job A:
- In
Ci::AfterRequeueJobService
; dependent jobs are iterated and run theprocess
method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb
(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
job.process(current_user)
end
- Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
- When we process "C2", its status will be "created". And we enqueue a
PipelineProcessWorker
right after it. - In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
-
PipelineProcessWorker
callsCi::PipelineProcessing::AtomicProcessingService
andupdate_processable!
is called for "C2". - When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
- The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
- And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.