Draft: Fix the order of subsequent jobs when requeuing a job (in DB) (!74491) · Merge requests · GitLab.org / GitLab

Furkan Ayhan requested to merge 341561-order-of-processed-jobs-with-sql into master Nov 15, 2021

What does this MR do and why?

This MR is just an alternative PoC for #341561 (closed).

In other MR (!74394 (closed)), we tried solving the ordering problem on the Ruby side, in this, on the DB side.

Copied from the other MR:

When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx but after introducing same-stage jobs, this disappeared.

In this MR, we are ordering jobs by stage first, then their DAG relationships.

We realized this when investigating #341561 (closed) and the discussions in !72778 (closed). (!72778 (comment 729165514))

This bug fix is behind a FF ci_fix_order_of_subsequent_jobs #345587 (closed).

DB

New SQL query;

SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN (
  SELECT id, MAX(depth) AS depth
  FROM (
    WITH RECURSIVE "base_and_descendants" AS (
      (SELECT 1 AS depth, ARRAY[ci_builds.id] AS tree_path, false AS tree_cycle, "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."id" = 1670888824)
      UNION
      (SELECT ("base_and_descendants"."depth" + 1), tree_path || "ci_builds".id, "ci_builds".id = ANY(tree_path), "ci_builds".*
       FROM "ci_builds", "base_and_descendants", "ci_build_needs"
       WHERE "ci_builds"."commit_id" = "base_and_descendants"."commit_id"
         AND "ci_build_needs"."build_id" = "ci_builds"."id"
         AND "ci_build_needs"."name" = "base_and_descendants"."name"
         AND "base_and_descendants"."tree_cycle" = FALSE)
    )
    SELECT ci_builds.id, MAX(depth) AS depth
    FROM "base_and_descendants" AS "ci_builds"
    WHERE "ci_builds"."id" NOT IN (SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."id" = 1670888824)
    GROUP BY "ci_builds"."id"

    UNION

    SELECT ci_builds.id, -1 AS depth FROM "ci_builds" WHERE "ci_builds"."commit_id" = 386704131 AND (stage_idx > 1)
  ) union_ci_builds
  GROUP BY id
) t2
ON ci_builds.id = t2.id
WHERE ("ci_builds"."status" IN ('skipped'))
ORDER BY stage_idx ASC, depth ASC, id ASC;

How to set up and validate locally

Bug description:

Example CI config:

stages:
  - A
  - B
  - C

A:
  stage: A
  script: exit 0
  when: manual

B:
  stage: B
  needs: ["A"]
  script: exit 0

C1:
  stage: C
  needs: ["B"]
  script: exit 0

C2:
  stage: C
  needs: ["B"]
  script: exit 0

C3:
  stage: C
  needs: ["B"]
  script: exit 0

Initial status:

When you play the job A:

In Ci::AfterRequeueJobService; dependent jobs are iterated and run the process method on them.

# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/2b63a36845760f558f268a695d7bdb92cc2520ff/app/services/ci/after_requeue_job_service.rb

(stage_dependent_jobs(processable) | needs_dependent_jobs(processable)).each do |job|
  job.process(current_user)
end

Those jobs are in random order. In our example, the order is: "C2, B, C1, C3".
When we process "C2", its status will be "created". And we enqueue a PipelineProcessWorker right after it.
In our race-condition case, that worker is started to be processed before even iterating other jobs; "B, C1, C3"
PipelineProcessWorker calls Ci::PipelineProcessing::AtomicProcessingService and update_processable! is called for "C2".
When we calculate the status of "C2", we use the status of "B", which is "skipped" at the moment. So, we mark "C2" as "skipped".
- The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
And "C2" stays in the "skipped" state forever, even after processing "B, C1, C3".

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Nov 15, 2021 by Furkan Ayhan

Draft: Fix the order of subsequent jobs when requeuing a job (in DB)

What does this MR do and why?

DB

How to set up and validate locally

MR acceptance checklist

Merge request reports