Draft: Enforce concurrency limit for Shared Runners queue
What does this MR do?
Related to https://gitlab.com/gitlab-org/gitlab/-/issues/329123
This change enforces a concurrency limit as the number of jobs that can be running on Shared Runners. The limit is enforced at root namespace level.
When a Runner requests a new job to pick, before we assign the job to the Runner we check if the concurrency limit is exceeded, if so we ignore the build and move on to the next job.
The concurrency value is tracked in Redis and is:
- incremented any time a job is being running by a shared runner
- decremented any time a job completes after being executed by a shared runner
Trade-offs considered
Ideally we should not have jobs in pending
state if they can't be picked by the runner. The complex aspect of the queue implementation is that a job could be picked by either shared or specific runners is both are configured for the given project. Blocking a job just because it could potentially be picked by shared runners is not possible, at least until we have a separate queue for shared runners.
- Pros: simpler approach. Leave job in
pending
until the number of running jobs goes below the concurrency limit - Cons: Jobs still remain in the queue and could potentially be picked again by
RegisterJobService
. Although with an increasing number of jobs in the queue, the fair scheduling should drop the priority of those builds overtime.
idea 2
Another idea I have considered was to change Ci::ProcessBuildService
to enqueue
the build only if the concurrency limit allowed that. Otherwise the job would have remained in created
state.
- Pros: Jobs don't become visible to the runners
- Cons: We need to match builds with any specific runners to see if they could potentially be picked by specific runners and ignore the concurrency limit. This was getting complex when considering that, in case of both specific and shared runners, a job could be moved to the queue and still be picked by shared runners.
idea 3
Similarly to the concept of resource_group
we could either leverage the existing status waiting_for_resource
or introduce a new status which gates when the build can go to the pending
state.
- Pros: This seems to be probably the most semantically correct approach
- Cons: It's more complex and would require quite some refactoring to implement. Given that we are limiting concurrency of jobs as the MVC, where we could change strategy in the future, it carries a considerable risk if we switch strategy.
Screenshots (strongly suggested)
Does this MR meet the acceptance criteria?
Conformity
-
📋 Does this MR need a changelog?-
I have included a changelog entry. -
I have not included a changelog entry because _____.
-
-
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team