Fix allowed_plans handling for instance runners
allowed_plans
feature was added to make it possible to limit some instance runners only to a specified SaaS plans.
By it's nature it works similarly to how other filters like tags, run_untagged
or protected
are handled: GitLab iterates over the basic list of jobs applicable for a runner that asked for a job and excludes jobs not matching that runner. One of the matchers is the allowed_plans
one. If it's defined and plan related to the job (job -> project -> namespace -> plan attached to the namespace) doesn't match, job is dropped from the list.
To make responses for /api/v4/jobs/request
be returned in reasonable times, we've defined an arbitrary MAX_QUEUE_DEPTH
limit equal to 45. If after 45 iterations a matching job is not found, we return back to the Runner with 409 Conflict
response.
Usually this works as expected. The problem happens when a lot of not applicable jobs are targeting a runner with allowed_plans
configuration.
As in that case, an applicable job may be not handled at all, because there will be always 45+ more not applicable jobs already existing before it in the queue. Not applicable jobs will then hang in pending
with the stuck
label until the stuck builds cleaner cancels them. And at the same time applicable jobs will hang in a normal pending
for same long time. As cleaning up stale jobs is done in an order more or less similar to their creation (automatic cancel decision is based on the pending duration), not applicable jobs will be blocking the applicable jobs for their entire lifetime.
We need to update the allowed_plans
mechanism to prevent such locking.
One of the ideas proposed by @mbobin is to update the Ci::PipelineCreation::DropNotRunnableBuildsService
to handle also this case, just like it was done for pipeline minutes. With that, job matching against allowed_plans
would be done once, at the job creation time. And if job would not match the criteria, it would be canceled immediately. We would not need to check for it later, and most importantly - it would not generate a backlog of not matching jobs that would extend the queue length and clog it for applicable jobs.