[Experimental] Define monitoring threshold for job queue duration
What does this MR do?
This is an experimental feature that may be removed without deprecation!
Taking jobs from the pending
queue is one of the most popular
indicators of whether the Runner works as expected.
Since a while we're exporting a histogram metric
gitlab_runner_job_queue_duration_seconds_*
that provides information
about what are the queuing durations of jobs landing on the runner.
This is a nice metric that allows to analyse the behavior and brings data for things like capacity and configuration changes planning.
For basic monitoring and especially for defining an SLI (Service Level Indicator) based on the job queuing duration - if that is the important factor for the user - we could leverage the same data received from GitLab, but export it in a much simpler form.
And this is what this commit brings. Apart of updating the histogram
(which we still do!) we're now able to define a threshold in seconds. If
the queuing duration of the received job is longer than configured
threshold, a dedicated counter metric
gitlab_runner_acceptable_job_queuing_duration_exceeded
is increased.
In the monitoring layer that counter can be next analysed with the
rate()
function and, for example, alert when the rate of exceeding the
threshold is higher than an acceptable value.
Such configuration could look as follows:
[[runners]]
name = "example-runner"
[runners.monitoring]
[[runners.monitoring.job_queuing_durations]]
periods = ["* * * * * * *"]
timezone = "UTC"
threshold = "1m30s"
In this case jobs queued for less or equal than 1 minute 30 seconds will be counted as acceptable. But a job that was queued for 1 minute and 31 seconds will exceed the threshold and therefore it will increase the counter.
Optionally, a second value sent from GitLab can be used - the
ProjectJobsRunningOnInstanceRounnerCount
. This one is usable only in
case of instance runners, as for other types (group or project) it's
always set to Inf+
.
But for instance runners it will ifnrom how many jobs a particular
project is already executing on the instance runners at the moment of
job scheduling. That number is checked up to a limit which is hardcoded
in GitLab. If a project runs from 0 to
INSTANCE_RUNNER_RUNNING_JOBS_MAX_BUCKET
it will be set to the specific
number. If it exceedes the limit - it will be set to
INSTANCE_RUNNER_RUNNING_JOBS_MAX_BUCKET+
(where the limit value is
placed instead of the constant name). This number allows to analyse Fair
Scheduling Algorithm that is built into GitLab CI/CD and used for
instance runners.
To include that field in the threshold exceeding calculation, the
jobs_running_for_project
entry should be configured with a regexp to
match against the value sent by GitLab. That could look as follows:
[[runners]]
name = "example-runner"
[runners.monitoring]
[[runners.monitoring.job_queuing_durations]]
periods = ["* * * * * * *"]
timezone = "UTC"
threshold = "1m30s"
jobs_running_for_project = "^[0-3]$"
In this case the metric will be increased when job queuing duration exceeds 1 minute and 30 seconds but only when GitLab reported that the project, at the moment of job scheduling, was already running from 0 to 3 jobs on any existing instance runner. If that project have been running 4 jobs or more on the instance runners, the threshold is ignored and expectation is counted as met.
For a single runner configuration we can define multiple configurations
for job_queuing_durations
, matched by different time periods. This
allows to define different thresholds for dedicated times. The periods
field is evaluated using a cron syntax within the configured time zone.
If timezone
field is not defined, the Local
one is assumed which
should use the time zone set for the runner process in the OS.
Entries are evaluated in the order of definition, and the last matching configuration is applied for a given time.
Example of the periods usage could look like:
[[runners]]
name = "example-runner"
[runners.monitoring]
[[runners.monitoring.job_queuing_durations]]
periods = ["* * * * * * *"]
timezone = "UTC"
threshold = "1m"
[[runners.monitoring.job_queuing_durations]]
periods = ["* * * * * sat,sun *"]
timezone = "UTC"
threshold = "5m"
With this configuration, a 1 minute threshold would be used as a default
for most of the time, but during the weekend (sat,sun
) that threshold
would be increased to 5 minutes.
When merged, this change will add a metric exported as:
. HELP gitlab_runner_acceptable_job_queuing_duration_exceeded Increased each time when the queuing duration was longer than the configured threshold
. TYPE gitlab_runner_acceptable_job_queuing_duration_exceeded counter
gitlab_runner_acceptable_job_queuing_duration_exceeded_total{runner="9_F4bzrV3",system_id="s_b5a2f9de542e"} 0
Why was this MR needed?
To provide a generalized way of defining an SLI in GitLab Runner based on the job queuing duration (which is one of the most popular factors of defining whether the Runner setup is healthy and works as expected).
Idea was taken from the discussion at gitlab-com/runbooks!6225 (comment 1642599654)
What's the best way to test this MR?
What are the relevant issue numbers?
https://gitlab.com/gitlab-org/ci-cd/shared-runners/infrastructure/-/issues/194