Expose queue duration related metrics in job payload sent to the runner
What does this MR do and why?
This MR adds a queued_for
and project_jobs_running_on_instance_runners_count
fields sent in the job_info
section of job payload sent to GitLab Runner.
queued_for
, representing the difference between time of response generation and time of scheduling the job for queueing (which is set with the queued_at
field in the database record), Runner will know how long the job was being in the queue.
This value will be next used on the GitLab Runner side to generate a histogram metric, representing queueing times for each specific runner and [[runners]]
worker that asks for jobs. This will allow runner administrators to track queueing times of jobs targeting their runners.
Similarly project_jobs_running_on_instance_runners
, for jobs targeting instance runners (and only instance type runners!) will allow the runner administrators to see how scheduled jobs fit into the fair scheduling algorithm.
We will use such information, for example, to improve our SLI definitions for SaaS runners on GitLab.com. As currently our apdex is the same for all different runner types that we manage, as it's calculated from a general metric exposed from GitLab, which by design doesn't partition such information per runner.
With this change and the planned GitLab Runner change, each runner owner will be able to track such data with other runner metrics. And for us this means that we will be able to define different SLIs for each SaaS runner shard that we maintain.
The biggest value of this change is however in the fact, that this metric would become usable for self-hosted GitLab Runner instances. As for GitLab installations like GitLab.com, individual users who self-host their runners and would like to track queuing performance are unable to do that, as GitLab internal metrics are... well... internal job_queue_duration_seconds
exposed by GitLab can't be partitioned by the individual Runner ID, as such cardinality of data would quickly kill any Prometheus server.
By passing this data to Runner and exposing it there, each Runner owner can track the queuing timing of their own runner instances. Without a need for GitLab administrators to expose GitLab's metric and with the data being partitioned by each individual runner.
GitLab Runner update at gitlab-runner!3499 (merged).
Screenshots or screen recordings
N/A
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.