Backend: Stage play manual jobs leave some jobs in skipped state
Problem
When using manual jobs , the use of the stage-level play button can cause improper transitions and leave some further jobs in a skipped state.
Technical details as mentioned by @furkanayhan :
- The "Play All" button triggers
Ci::PlayBuildService
for each manual job in the stage. - They both call
Ci::EnqueueJobService
. - In
ResetSkippedJobsService
, we lock jobs one by one but I think there is a race condition because of simultaneous workers. Maybe we should lock all jobs instead of one by one.
More information in this following thread
Investigation
- Detailed analysis: #388539 (comment 1312083689)
- Simplified analysis and potential solution: #388539 (comment 1353271403)
- Latest proposed solution: #388539 (comment 1367106753)
Proposal
We are proceeding with the latest proposed solution. As part of this solution, it was also deemed necessary to update ResetSkippedJobsService
to support multiple jobs as input by default, for performance reasons. This effort is being tracked in a separate issue: Backend: Update ResetSkippedJobsService to work... (#410223 - closed).
Additionally, adjacent to the current problem, we determined that it would be best to update the PipelineProcessWorker
deduplication strategy from :until_executing
to :until_executed, if_deduplicated: :reschedule_once
. The purpose of this change is to:
- Provide more clarity on pipeline processing.
- Improve performance by reducing the number of jobs that run and then are immediately dropped from not obtaining the
lease
inAtomicProcessService.execute
.
Implementation
Description | MR / Issue |
---|---|
Update PipelineProcessWorker deduplication strategy to until_executed
|
!115261 (merged) |
[Feature flag] Roll out ci_pipeline_process_worker_dedup_until_executed
|
#397829 (closed) |
Remove ci_pipeline_process_worker_dedup_until_executed feature flag |
!120174 (merged) |
(Prerequisite) Backend: Update ResetSkippedJobsService to work with multiple jobs as input | #410223 (closed) |
Reset skipped jobs on new alive jobs during pipeline processing | !118269 (merged) |
[Feature flag] Roll out ci_reset_skipped_jobs_in_atomic_processing
|
#410203 (closed) |
We can use the following logs to monitor the frequency of occurrence:
Kibana Sidekiq logs: https://log.gprd.gitlab.net/goto/fd383b80-0bae-11ee-a017-0d32180b1390 (Filtered by json.message: "Running ResetSkippedJobsService on new alive jobs
")