Retry and auto-close master broken incidents (!2151) · Merge requests · GitLab.org / Quality Department / triage-ops · GitLab

Jennifer Li requested to merge jennli-triage-master-broken-i2 into master Apr 11, 2023

What does this MR do and why?

This MR implements the following features:

Retries failed job and posts the retry web url to the triage notes
Retries the pipeline if there are at least 10 failed jobs in one incident
Closes the incident if all of the failed jobs are caused by transient failures that are known to us. This includes master-brokendependency-upgrade , master-brokengitlab-com-overloaded , master-brokenfailed-to-pull-image, master-brokenrunner-disk-full

Related issues:

Expected impact & dry-runs

These are strongly recommended to assist reviewers and reduce the time to merge your change.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-with-a-dry-run on how to perform dry-runs for new policies.

See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.

Action items

If adding environment variables for reactive processors, update config/triage-web.yaml and .gitlab/ci/triage-web.yml
(If applicable) Add documentation to the handbook pages for Triage Operations =>
(If applicable) Identify the affected groups and how to communicate to them:
- /cc @person_or_group =>
- Relevant Slack channels =>
- Engineering week-in-review

Edited Apr 12, 2023 by Jennifer Li