Adding pipeline job failure reason metric
Background
We had an incident about failures of child pipeline creations. The error report came from users. We might have detected the error if we have a specific metric about it.
When implementing the child-of-child pipelines feature, we introduced a new failure reason reached_max_descendant_pipelines_depth: 1009
.
After rolling out the feature for some users, the number of jobs that were failed with the reason 1009
increased.
This issue is opened for "whether we can improve upon our Grafana dashboards and alerting given that we are aware of the error returned" section of the corrective actions issue after discussing it with Grzegorz.
Proposal
Currently, we have a dashboard in Grafana about job failures. The data is provided by Runner.
Similarly, we can also add a metric and create a dashboard by providing data from the Rails side where we call drop
method on jobs.
Dummy example;
failure_reason = :reached_max_descendant_pipelines_depth
counter = Gitlab::Metrics.counter(:gitlab_ci_job_failures, 'Job failures')
counter.increment(reason: failure_reason)
@bridge.drop!(failure_reason)
Of course, having this metric/dashboard is not enough itself. Either we implement an alert mechanism for unusual increases or we check the dashboard when rolling out a feature.