Skip to content

Use until_executing when deduplicating the "assign resource worker"

What does this MR do and why?

Jobs with a resource group sometimes gets stuck "waiting for resource" due to a race condition. (See: https://docs.gitlab.com/ee/ci/resource_groups/#race-conditions-in-complex-or-busy-pipelines)

This problem is due to the fact that the AssignResourceFromResourceGroupWorker, which allocates a job to a resource group, is deduplicated with an until_executed strategy. This means that if a job for a resource group is still running, a newly queued job for the same resource group gets dropped.

We came up with different solutions to resolve this, and settled on "re-spawning" the worker when certain conditions are met. However, upon verifying in production, we observed that !147313 (merged) did not really fix the problem because the "re-spawned" job also runs into race conditions (see #436988 (comment 1856609263)).

The change

This current MR tackles the actual cause of the problem, which is: jobs being dropped if another job for the same resource group is RUNNING. Here, we change the deduplication strategy to until_executing, which means that jobs will be dropped if another job for the same resource group is QUEUED; if the job is already running, new jobs can be queued. I believe that this change, in combination with the first fix, will prevent the possibility of jobs getting stuck at "waiting for resource".

Caveats and considerations

This issue is impossible to replicate locally, so it is very hard to verify the actual effectiveness of the fix.

  • This is instead introduced behind a feature flag, which will be enabled for example projects in production, where I will test the changes
  • It's not possible to switch deduplication strategies through a feature flag, so I have instead introduced a new worker that is the exact copy of AssignResourceFromResourceGroupWorker, except it has a deduplication strategy of until_executing. (FF rollout issue: #460793 (closed))
  • Switching between the new worker and the old worker when enabling/disabling feature flags should be okay. See: #460793 (closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

N/A. See validation steps.

How to set up and validate locally

The problem is impossible to replicate locally, but we can instead make sure that this change does not introduce any errors. We can also make sure that the correct workers are called depending on the status of the feature flag.

Setup

  1. Create a project

  2. Add a .gitlab-ci.yml and child .deploy.yml pipeline configuration

    .gitlab-ci.yml
    build:
      stage: build
      script: echo "building stuff"
    
    deploy_a:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_a_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    # you can delete all the other deploy triggers below for faster testing
    deploy_b:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_b_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    deploy_c:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_c_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    deploy_d:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_d_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    
    deploy_e:
      stage: deploy
      variables:
        RESOURCE_GROUP_KEY: "resource_group_e_child"
      trigger:
        include: ".deploy.yml"
        strategy: depend
    .deploy.yml
    deploy1:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY"
      environment:
        name: production
        action: start
    
    deploy2:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY2"
      environment:
        name: production2
        action: start
    
    deploy3:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY3"
      environment:
        name: production3
        action: start
      
    deploy4:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY4"
      environment:
        name: production4
        action: start
    
    deploy5:
      stage: deploy
      resource_group: $RESOURCE_GROUP_KEY
      script:
        - echo "DEPLOY5"
      environment:
        name: production5
        action: start
  3. (Optional) Enable log level :info for your development environment by editing the config/environments/development.rb file and adding a config.log_level = :info line.

    • in 2 different terminal windows, run the following in your GDK directory:

      to check logs for AssignResourceFromResourceGroupWorker

      gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::AssignResourceFromResourceGroupWorker"'

      to check logs for NewAssignResourceFromResourceGroupWorker

      gdk tail rails-background-jobs | grep '"class":"Ci::ResourceGroups::NewAssignResourceFromResourceGroupWorker"'

Testing

With the assign_resource_worker_deduplicate_until_executing disabled, run the pipeline a couple of times and verify that AssignResourceFromResourceGroupWorker is being called.

  • in https://gdk.test:3443/admin/background_jobs, check the Metrics tab and verify that AssignResourceFromResourceGroupWorker jobs are being run

    expand for screenshot

    Screenshot_2024-05-09_at_14.30.55

  • if you have enabled log level :info, verify that:

    • logs for AssignResourceFromResourceGroupWorker are showing
    • logs for NewAssignResourceFromResourceGroupWorker are NOT showing

With the assign_resource_worker_deduplicate_until_executing enabled, run the pipeline a couple of times and verify that NewAssignResourceFromResourceGroupWorker is being called.

  • in https://gdk.test:3443/admin/background_jobs, check the Metrics tab and verify that AssignResourceFromResourceGroupWorker jobs are being run

    expand for screenshot

    Screenshot_2024-05-09_at_14.47.52

  • if you have enabled log level :info, verify that:

    • logs for AssignResourceFromResourceGroupWorker are NOT showing
    • logs for NewAssignResourceFromResourceGroupWorker are showing

Related to #436988 (closed)

Edited by Pam Artiaga

Merge request reports

Loading