Skip to content

Consistently handle recovery alerts

Sarah Yasonik requested to merge sy-consistently-auto-resolve-alerts into master

What does this MR do?

This MR better aligns the processing of recovery alerts between the Prometheus & HTTP alert integrations. Recovery alerts are notifications we receive from a monitoring tool that indicate that a problem with an application is resolved.

This MR has has two sub-goals:

  1. address inconsistency in how recovery alerts are treated relative to the language used in incident management settings
  2. address inconsistency in how recovery alerts are handled based on source

Related issues & MRs

Changes

Before After
Recovery alerts closed the alert and corresponding issue only if the setting was enabled. Recovery alerts always close the corresponding alert.
- Prometheus recovery alerts which do not correspond to an existing GitLab alert are swallowed.
- HTTP recovery alerts which do not correspond to an existing GitLab alert are handled by creating a new open alert.
Recovery alerts which do not correspond to an existing GitLab alert are created as open alerts, then immediately automatically resolved.

Testing

Enable/disable automatic incident creation/resolution

  1. With maintainer+ permissions, navigate to Settings > Operations > Incidents in a project
    • Toggle Create an incident. Incidents are created for each alert triggered. to control incident creation
    • Toggle Automatically close incidents when the associated Prometheus alert resolves. to control incident resolution

Sending recovery alerts

  1. With maintainer+ permissions, navigate to Settings > Operations > Alert integrations in the project
  2. Creating a recovery alert for a generic HTTP integration:
    • Click 'Add new integration' button
    • Enter a name, switch the toggle to active, skip the rest, then
    • Click 'Save & create test alert'
    • Send a test recovery alert with payload:
      { 
        "title": "This is a self-resolving HTTP alert", 
        "end_time": "2021-04-30T11:22:40Z" 
      }
    • To send a firing alert, exclude the end_time key in the payload
  3. Creating a recovery alert for a Prometheus integration:
    • Click 'Add new integration' button
    • Switch the toggle to active, make up any URL (it won't matter), then click 'Save & create test alert'
    • Click 'Save & create test alert'
    • Send a test recovery alert with payload:
      {
        "version" : "4",
        "groupKey": null,
        "status": "resolved",
        "receiver": "",
        "groupLabels": {},
        "commonLabels": {},
        "commonAnnotations": {},
        "externalURL": "", 
        "alerts": [{
          "startsAt": "2021-04-30T11:22:40Z", 
          "generatorURL": "http://host?g0.expr=up", 
          "endsAt": "2021-04-30T19:22:40Z",
          "status": "resolved",
          "labels": {
            "gitlab_environment_name": "production"
          }, 
          "annotations": {
            "title": "This is a self-resolving Prometheus alert"
          }
        }]
      }
    • To send a firing alert, replace payload["alerts"]["status"] with a value of "firing"

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Sarah Yasonik

Merge request reports

Loading