Re-spawn the AssignResource worker if busy
What does this MR do and why?
We previously encountered a problem where a job with a resource group is stuck due to a race condition. This is due to the fact that the AssignResourceFromResourceGroupWorker
, which allocates a job to a resource group, can only run one at a time per resource group using the deduplicated: until_executed
strategy. This was resolved by adding a if_deduplicated: reschedule_once
option to the AssignResourceFromResourceGroupWorker
. (More details here: Pipeline job depends on Resource Group could be... (#342123 - closed).)
Now, it turns out that we are still running into race conditions for pipelines that are run in parallel, or pipelines with multiple downstream/child pipelines that then run in parallel. Essentially, there is a situation where the AssignResourceFromResourceGroupWorker
might stop assigning jobs to a specific resource group because it checks if a resource is free before a resource is freed.
This MR solves this "stuck" situation by kicking off a AssignResourceFromResourceGroupWorker
job for a resource group if:
- there are no "free" resources yet, AND
- there are still more upcoming processables/builds for that resource group
The idea is that for the next round of AssignResourceFromResourceGroupWorker
, the resource would already be free and can be assigned to a build.
This change is behind a Feature Flag. Rollout issue: [Feature flag] Rollout of `respawn_assign_resou... (#450793 - closed)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
N/A
How to set up and validate locally
This is actually very hard to replicate locally and, so far, I've only observed this in this example group: https://gitlab.com/dkua1_ultimate_group/private/zd/gitlab-job-stuck-at-waiting-status-460524.
I would suggest just testing that there are no errors happening with this setup:
-
Create a project
-
Add a
.gitlab-ci.yml
and child.deploy.yml
pipeline configuration (see example above) -
Run the pipeline several times with the Feature Flag
respawn_assign_resource_worker
enabled and disabled..gitlab-ci.yml
# note: the script here is just based on a user-reported issue with this particular problem, # where there is a job that changes the resource_group's process_mode during pipeline execution, # further exacerbating the possibility of the race condition happening build: stage: build resource_group: "resource_group_1" script: - apk add --no-cache curl - | curl "https://gdk.test:3443/api/v4/projects/32/resource_groups/resource_group_1" \ -k -X PUT \ --header "Authorization: Bearer <the-personal-access-token>" \ --data "{\"process_mode\": \"oldest_first\"}" deploy: stage: deploy resource_group: "resource_group_1" trigger: include: ".deploy.yml" strategy: depend
.deploy.yml child pipeline configuration
deploy: stage: deploy script: - echo "DEPLOY" environment: name: production action: start deploy2: stage: deploy script: - echo "DEPLOY2" environment: name: production2 action: start deploy3: stage: deploy script: - echo "DEPLOY3" environment: name: production3 action: start deploy4: stage: deploy script: - echo "DEPLOY4" environment: name: production4 action: start deploy5: stage: deploy script: - echo "DEPLOY5" environment: name: production5 action: start
Related to #436988 (closed)