Proposal: Experiment with pushing internal job metrics to InfluxDB

Context

Practical follow-up of gitlab-org/quality/engineering-productivity/team#147 (closed).
Discussion with @acunskis: https://docs.google.com/document/d/1XOwlsESg6mL416wSzXJ5Gk2uCNHSL27Ac-ZhiuQn5w4/edit

There are some metrics related to our CI/CD pipelines that are not possible to get by looking at the data stored in GitLab internal database from Sisense. Some examples are:

Job-level Rspec-related metrics

When was a specific Rspec filter triggered/used
- e.g. How often and for which jobs did we reach the 20 specs failures threshold? (see MR)
How often did we receive PG::Canceled errors? (see MR)
Statistics on RSpec errors we had in certain jobs (we have other tools in place to get those metrics ATM)
Aggregate the Rspec errors per pipeline, group of specs, …
How many RSpec failures did the job have at the first RSpec run? And at the second RSpec run, if any?
How many RSpec retries we had that succeeded with the rspec-retry gem, compared to Rspec retries in a new process?
- I have the impression that this number is WAY higher.

Job-level non Rspec-related metrics

How many and which jobs had to retry the RSpec suite in a separate process?
How long did a container image (a.k.a. docker) pull take?
What is the cache hit/miss at the beginning of certain jobs?
- More generally, if a job has access to its own logs (I suppose so?), then get the timings of each steps inside the job (e.g. how long did it take for a git clone, how long did it take for installing gems, ...)
How long did the job take to download artifacts?

pipeline-level RSpec-related metrics

Delta between the slowest RSpec job and the fastest RSpec job of the same category that were parallelized at the same time (e.g. RSpec system), excluding the jobs that had flaky tests (i.e. where the failed tests were retried in another process in the same job)
- It can probably be done in a dashboard once we have the job-level data available (e.g. group jobs by pipeline_id, take the min and max job based on their "test category", only take jobs that haven't been retried in a separate process, and compute the delta)

Goal

Experiment with instrumenting our CI/CD jobs with InfluxDB, and create a graph in the Quality Grafana Dashboard.

We can even receive alerts from Grafana about those metrics 🎉

First iteration proposal

Start with one use-case above
Make a proof of concept for that one metric
Set an alert for it in Grafana
If we like the process and the outcome, create issues for the other ideas above

Technical considerations

Eventually, we should be able to visualize the data in Sisense/Tableau/Kibana. Before this happens though, we can set everything up in Grafana, as the setup is already there, and move to other visualization tools when we're ready. It should not be a heavy move in my view.
Have a working setup locally with docker containers for GDK, influxDB and Grafana, so that we can test the setup end-to-end locally.
Example of how to push data to InfluxDB

Edited Jan 09, 2024 by David Dieulivol