Consistently handle recovery alerts
What does this MR do?
This MR better aligns the processing of recovery alerts between the Prometheus & HTTP alert integrations. Recovery alerts are notifications we receive from a monitoring tool that indicate that a problem with an application is resolved.
This MR has has two sub-goals:
- address inconsistency in how recovery alerts are treated relative to the language used in incident management settings
- address inconsistency in how recovery alerts are handled based on source
Related issues & MRs
- Related issues: #299962 (closed), #299960 (closed)
- Pre-req MR: !58519 (merged) - refactor to alert processing specs
- FE MR: !58515 (merged) - update text for setting to auto-resolve incidents to match this behavior
- 2nd BE MR: !58513 (merged) - update text in system note for recovery alert
- Docs MR: !58514 (merged) - adds docs for this behavior and other incident settings
Changes
Before | After |
---|---|
Recovery alerts closed the alert and corresponding issue only if the setting was enabled. | Recovery alerts always close the corresponding alert. |
- Prometheus recovery alerts which do not correspond to an existing GitLab alert are swallowed. - HTTP recovery alerts which do not correspond to an existing GitLab alert are handled by creating a new open alert. |
Recovery alerts which do not correspond to an existing GitLab alert are created as open alerts, then immediately automatically resolved. |
Testing
Enable/disable automatic incident creation/resolution
- With maintainer+ permissions, navigate to Settings > Operations > Incidents in a project
- Toggle
Create an incident. Incidents are created for each alert triggered.
to control incident creation - Toggle
Automatically close incidents when the associated Prometheus alert resolves.
to control incident resolution
- Toggle
Sending recovery alerts
- With maintainer+ permissions, navigate to Settings > Operations > Alert integrations in the project
- Creating a recovery alert for a generic HTTP integration:
- Click 'Add new integration' button
- Enter a name, switch the toggle to active, skip the rest, then
- Click 'Save & create test alert'
- Send a test recovery alert with payload:
{ "title": "This is a self-resolving HTTP alert", "end_time": "2021-04-30T11:22:40Z" }
- To send a firing alert, exclude the
end_time
key in the payload
- Creating a recovery alert for a Prometheus integration:
- Click 'Add new integration' button
- Switch the toggle to active, make up any URL (it won't matter), then click 'Save & create test alert'
- Click 'Save & create test alert'
- Send a test recovery alert with payload:
{ "version" : "4", "groupKey": null, "status": "resolved", "receiver": "", "groupLabels": {}, "commonLabels": {}, "commonAnnotations": {}, "externalURL": "", "alerts": [{ "startsAt": "2021-04-30T11:22:40Z", "generatorURL": "http://host?g0.expr=up", "endsAt": "2021-04-30T19:22:40Z", "status": "resolved", "labels": { "gitlab_environment_name": "production" }, "annotations": { "title": "This is a self-resolving Prometheus alert" } }] }
- To send a firing alert, replace
payload["alerts"]["status"]
with a value of"firing"
Does this MR meet the acceptance criteria?
Conformity
-
📋 Does this MR need a changelog?-
I have included a changelog entry. -
I have not included a changelog entry because _____.
-
-
Documentation (if required) - !58514 (merged) -
Code review guidelines -
Merge request performance guidelines -
Style guides - [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Edited by Sarah Yasonik