Proposal: Experiment with pushing internal job metrics to InfluxDB
Context
- Practical follow-up of gitlab-org/quality/engineering-productivity/team#147 (closed).
- Discussion with @acunskis: https://docs.google.com/document/d/1XOwlsESg6mL416wSzXJ5Gk2uCNHSL27Ac-ZhiuQn5w4/edit
There are some metrics related to our CI/CD pipelines that are not possible to get by looking at the data stored in GitLab internal database from Sisense. Some examples are:
Job-level Rspec-related metrics
- When was a specific Rspec filter triggered/used
- e.g. How often and for which jobs did we reach the 20 specs failures threshold? (see MR)
- How often did we receive
PG::Canceled
errors? (see MR) - Statistics on RSpec errors we had in certain jobs (we have other tools in place to get those metrics ATM)
- Aggregate the Rspec errors per pipeline, group of specs, …
- How many RSpec failures did the job have at the first RSpec run? And at the second RSpec run, if any?
- How many RSpec retries we had that succeeded with the
rspec-retry
gem, compared to Rspec retries in a new process?- I have the impression that this number is WAY higher.
Job-level non Rspec-related metrics
- How many and which jobs had to retry the RSpec suite in a separate process?
- How long did a container image (a.k.a. docker) pull take?
- What is the cache hit/miss at the beginning of certain jobs?
- More generally, if a job has access to its own logs (I suppose so?), then get the timings of each steps inside the job (e.g. how long did it take for a git clone, how long did it take for installing gems, ...)
- How long did the job take to download artifacts?
pipeline-level RSpec-related metrics
- Delta between the slowest RSpec job and the fastest RSpec job of the same category that were parallelized at the same time (e.g. RSpec system), excluding the jobs that had flaky tests (i.e. where the failed tests were retried in another process in the same job)
- It can probably be done in a dashboard once we have the job-level data available (e.g. group jobs by
pipeline_id
, take the min and max job based on their "test category", only take jobs that haven't been retried in a separate process, and compute the delta)
- It can probably be done in a dashboard once we have the job-level data available (e.g. group jobs by
Goal
Experiment with instrumenting our CI/CD jobs with InfluxDB, and create a graph in the Quality Grafana Dashboard.
We can even receive alerts from Grafana about those metrics
First iteration proposal
- Start with one use-case above
- Make a proof of concept for that one metric
- Set an alert for it in Grafana
- If we like the process and the outcome, create issues for the other ideas above
Technical considerations
- Eventually, we should be able to visualize the data in Sisense/Tableau/Kibana. Before this happens though, we can set everything up in Grafana, as the setup is already there, and move to other visualization tools when we're ready. It should not be a heavy move in my view.
- Have a working setup locally with docker containers for GDK, influxDB and Grafana, so that we can test the setup end-to-end locally.
- Example of how to push data to InfluxDB
Edited by David Dieulivol