Skip to content

Draft: Enforce concurrency limit for Shared Runners queue

What does this MR do?

Related to https://gitlab.com/gitlab-org/gitlab/-/issues/329123

This change enforces a concurrency limit as the number of jobs that can be running on Shared Runners. The limit is enforced at root namespace level.

When a Runner requests a new job to pick, before we assign the job to the Runner we check if the concurrency limit is exceeded, if so we ignore the build and move on to the next job.

The concurrency value is tracked in Redis and is:

  • incremented any time a job is being running by a shared runner
  • decremented any time a job completes after being executed by a shared runner

Trade-offs considered

Ideally we should not have jobs in pending state if they can't be picked by the runner. The complex aspect of the queue implementation is that a job could be picked by either shared or specific runners is both are configured for the given project. Blocking a job just because it could potentially be picked by shared runners is not possible, at least until we have a separate queue for shared runners.

  • Pros: simpler approach. Leave job in pending until the number of running jobs goes below the concurrency limit
  • Cons: Jobs still remain in the queue and could potentially be picked again by RegisterJobService. Although with an increasing number of jobs in the queue, the fair scheduling should drop the priority of those builds overtime.

idea 2

Another idea I have considered was to change Ci::ProcessBuildService to enqueue the build only if the concurrency limit allowed that. Otherwise the job would have remained in created state.

  • Pros: Jobs don't become visible to the runners
  • Cons: We need to match builds with any specific runners to see if they could potentially be picked by specific runners and ignore the concurrency limit. This was getting complex when considering that, in case of both specific and shared runners, a job could be moved to the queue and still be picked by shared runners.

idea 3

Similarly to the concept of resource_group we could either leverage the existing status waiting_for_resource or introduce a new status which gates when the build can go to the pending state.

  • Pros: This seems to be probably the most semantically correct approach
  • Cons: It's more complex and would require quite some refactoring to implement. Given that we are limiting concurrency of jobs as the MVC, where we could change strategy in the future, it carries a considerable risk if we switch strategy.

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Fabio Pitino

Merge request reports

Loading