Resource group timeout
Problem
It seems like customers are infrequently encountering issues where CI jobs get incorrectly and permanently stuck "waiting for resource" (see: #121715 and #121717). It seems the system can occasionally think a resource group has jobs running, when in fact, there are no jobs running associated with that resource group. This blocks all future jobs in this resource group from starting, and has to the potential to bring all CI pipelines to a grinding halt. The only solution is having Support manually release a resource group.
I'm not sure why this is happening, or whether the underlying issue can be fixed, but as a fallback solution, it'd be nice if resource groups "timed out." I'll caveat this by saying I don't have a deep understanding of GitLab's architecture, so it's hard to describe what I mean with much detail. But basically, if a resource group exists for a long enough period of time without any running jobs associated with it, it should time out and be cleaned up automatically.
Functionality like this would potentially solve the pain point customers are experiencing in #121715 and #121717.
This request is similar to #118434, but not the same. (That issue is asking for the job itself to time out, but it wouldn't fix the problem of a phantom resource group blocking future jobs from running.)
Problem details
It seems some of the system failed to process pipeline jobs due to ActiveRecord::StaleObjectError
(i.e. concurrent update).
Proposal
Use Gitlab::OptimisticLocking
to update the pipeline job concurrent-safely:
diff --git a/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb b/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
index dfd97498fc8..055b27ea3c2 100644
--- a/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
+++ b/app/services/ci/resource_groups/assign_resource_from_resource_group_service.rb
@@ -10,7 +10,9 @@ def execute(resource_group)
free_resources = resource_group.resources.free.count
resource_group.upcoming_processables.take(free_resources).each do |processable|
- processable.enqueue_waiting_for_resource
+ Gitlab::OptimisticLocking.retry_lock(processable, name: 'enqueue_waiting_for_resource') do |processable|
+ processable.enqueue_waiting_for_resource
+ end
end
end
# rubocop: enable CodeReuse/ActiveRecord