Retry and auto-close master broken incidents
What does this MR do and why?
This MR implements the following features:
- Retries failed job and posts the retry web url to the triage notes
- Retries the pipeline if there are at least 10 failed jobs in one incident
- Closes the incident if all of the failed jobs are caused by transient failures that are known to us. This includes master-brokendependency-upgrade , master-brokengitlab-com-overloaded , master-brokenfailed-to-pull-image, master-brokenrunner-disk-full
Related issues:
- Automatically close broken master incidents for... (gitlab-org/gitlab#398689 - closed)
- Automatically retry broken master pipelines if ... (gitlab-org/gitlab#398243 - closed)
Expected impact & dry-runs
These are strongly recommended to assist reviewers and reduce the time to merge your change.
See https://gitlab.com/gitlab-org/quality/triage-ops/-/tree/master/doc/scheduled#testing-with-a-dry-run on how to perform dry-runs for new policies.
See https://gitlab.com/gitlab-org/quality/triage-ops/-/blob/master/doc/reactive/best_practices.md#use-the-sandbox-to-test-new-processors on how to make sure a new processor can be tested.
Action items
-
If adding environment variables for reactive processors, update config/triage-web.yaml
and.gitlab/ci/triage-web.yml
-
(If applicable) Add documentation to the handbook pages for Triage Operations => - (If applicable) Identify the affected groups and how to communicate to them:
-
/cc @ person_or_group
=> -
Relevant Slack channels => -
Engineering week-in-review
-
Edited by Jennifer Li