Draft: Fix the order of subsequent same-stage jobs when requeuing a job
What does this MR do and why?
When we requeue a job, we need to process subsequent skipped jobs, moreover, we need to do this in a specific order. Previously, we had an order by stage_idx
but after introducing same-stage jobs, this disappeared.
We recently fixed this problem for 99% of occurrences: !77528 (merged), however, the problem still exists for the same-stage pipelines. Example use-case: #217129 (comment 826686713)
In this MR, we are ordering jobs by stage first, then their DAG relationships.
We realized this when investigating #341561 (closed) and the discussions in !72778 (closed). (!72778 (comment 729165514))
- This bug fix is behind a FF
ci_fix_order_of_subsequent_jobs
#345587 (closed). - This MR adds a new column
processing_order
to theci_builds_metadata
table.
Note: We have two options for this fix; 1. doing the sorting on the Ruby side, 2. doing the sorting on the DB side. I chose the first one for this MR. And, I'll create another MR for the second option. (just did: !74491 (closed))
DB
UP
== 20220201092453 AddProcessingOrderToCiBuildsMetadata: migrating =============
-- add_column(:ci_builds_metadata, :processing_order, :integer)
-> 0.0024s
== 20220201092453 AddProcessingOrderToCiBuildsMetadata: migrated (0.0024s) ====
DOWN
== 20220201092453 AddProcessingOrderToCiBuildsMetadata: reverting =============
-- remove_column(:ci_builds_metadata, :processing_order, :integer)
-> 0.0020s
== 20220201092453 AddProcessingOrderToCiBuildsMetadata: reverted (0.0035s) ====
New query:
SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN "ci_builds_metadata" ON "ci_builds_metadata"."build_id" = "ci_builds"."id"
WHERE "ci_builds"."type" IN ('Ci::Processable',
'Ci::Build',
'Ci::Bridge')
AND "ci_builds"."commit_id" = 383904039
AND ("ci_builds"."status" IN ('skipped'))
AND (stage_idx > 0
OR "ci_builds"."scheduling_type" = 1
AND (EXISTS
(SELECT 1
FROM "ci_build_needs"
WHERE (ci_builds.id=ci_build_needs.build_id)
AND "ci_build_needs"."name" = 'build1')))
ORDER BY "ci_builds_metadata"."processing_order" ASC;
Query performance
ALTER TABLE "ci_builds_metadata" ADD "processing_order" integer DEFAULT 0 NOT NULL;
cold: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7511/commands/26727
hot: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7511/commands/26728
Screenshots or screen recordings
Bug description:
Example CI config:
default:
image: alpine
test1:
script: exit 0
when: manual
test2:
needs: ["test1"]
script: exit 0
parallel: 20
test3:
needs: ["test2"]
script: exit 0
parallel: 20
Initial status:
When you play the job A:
- In
Ci::AfterRequeueJobService
; dependent jobs are iterated and run theprocess
method on them.
# simplified version from https://gitlab.com/gitlab-org/gitlab/-/blob/84ff263c6a740b5ddf0202166c1cc085a42184a6/app/services/ci/after_requeue_job_service.rb#L26-28
stage_dependent_jobs.or(needs_dependent_jobs).ordered_by_stage.each do |job|
job.process(current_user)
end
- Those jobs are in random order if they are in the same stage. In our example, the order is: "test3..., test2...".
- When we process "test3...", its status will be "created". And we enqueue a
PipelineProcessWorker
right after it. - In our race-condition case, that worker is started to be processed before even iterating other jobs; "test2..."
-
PipelineProcessWorker
callsCi::PipelineProcessing::AtomicProcessingService
andupdate_processable!
is called for "test3...". - When we calculate the status of "test3...", we use the status of "test2...", which is "skipped" at the moment. So, we mark "test3..." as "skipped".
- The problem is here, we are calculating the status of a job according to an on-the-fly state of the pipeline.
- And "test3..." stays in the "skipped" state forever, even after processing "test2...".
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.