Escalate alerts according to the escalation policies for a project
Once escalation policies tables and models are available for projects with on-call schedules, we will want to escalate alerts according to the rules of the escalation policy for the project. This issue represents the work needed to actually adhere to the policy defined for a project.
Scope/requirements of this issue:
- Add support for escalating alerts when escalation policy dictates
- Escalations will only need to be to a provided schedule
- Escalations will only need to be after a given number of minutes >=0
- Escalations will be if the alert is not either
acknowledged
orresolved
, as defined by user - Escalation rules should not apply to alerts which were created before the escalation policy
- Escalation rules should be adhered to as closely as possible for the sake of user trust. If we're late, that's money.
- Escalations should only occur once per escalation rule, per alert.
- Re-triggered alerts should start the escalation policy over, as if they had just been created. ("Re-triggered" meaning that the status was set to acknowledged/resolved, then back to triggered)
- If an escalation policy is modified, existing alerts should follow the original escalation rules. (If using the new rules is easier, do that instead & communicate the change in expectations to Product.)
Out of scope: backfilling escalation policies, auto-creating escalation policies, system notes for escalations, email updates
Proposal:
Table: incident_management_alert_escalations
Model: IncidentManagement::AlertEscalations
Column | Required | Type | Description |
---|---|---|---|
id | true | Integer | ID of the escalation |
policy_id | true | Integer | Escalation Policy to which the escalation corresponds |
alert_id | true | Integer | ID of the alert |
created_at | true | datetime_with_zone | Creation time of the escalation (AKA - time at which the escalation was "triggered") |
updated_at | true | datetime_with_zone | Update time of the escalation (AKA - time at which notifications were last sent out) |
Flow:
- An alert comes in.
- An escalation policy is identified.
- Any zero-minute escalation rules are enacted.
- An
Escalation
is added to theincident_management_escalations
table. - A cronjob runs every minute, starting a job for each
Escalation
. - Job content:
- Get Escalation.
- Get job
start_time
. - Get alert. Get policy & rules.
- Filter to applicable rules.
-
alert.status >= escalation_rule.status
(the status isn't expectedly resolved/ack-ed) -
(escalation.current_time - escalation.created_at) >= escalation_rule.time_elapsed
(it's been too long) -
(escalation.updated_at - escalation.created_at) < escalation_rule.time_elapsed
(we haven't already notified for this rule)
-
- For each applicable rule, send notifications.
- Set
escalation.updated_at
to jobstart_time
.
- On status change of alert or incident to
Resolved
, remove theEscalation
. On status change of an alert fromResolved
to anything else, create anEscalation
.
When the same alert keeps firing:
- Notify when new alerts arrive and on escalations only.
- New alerts trigger the escalation policy, sending one notification per rule.
- Re-occurrences of existing alerts do nothing extra, but the alert will continue be escalated according to the escalation policy. (EX - An alert was created 16 minutes ago. There are escalation rules for 0, 10, & 30 minutes. We've already sent out a notification at 0 minutes and another at 10 minutes. Now, the alert integration receives the same payload again, but we do nothing.)
Validations/constraints:
-
escalation
,alert
should be present - Unique constraint: Combo of
policy_id, alert_id
should be unique
Edited by Sean Arnold