Add Pending Alert Escalations table, model, services and worker (!64274) · Merge requests · GitLab.org / GitLab

Sean Arnold requested to merge 323139-create-alert-escalations into master Jun 17, 2021

What does this MR do?

Note: This is behind feature flag escalation_policies_mvc, and licensed flag escalation_policies.

DB Migration

This adds the AlertEscalation(incident_management_alert_escalations) table, as part of #323139 (closed).

`incident_management_pending_alert_escalations`	type	Null
id	bigint	not null
rule_id	bigint	null
alert_id	bigint	not null
schedule_id	bigint	not null
status	smallint	not null
process_at	time with zone	not null
created_at	time with zone	not null
updated_at	time with zone	not null

Database commands:

== 20210617022324 CreateIncidentManagementPendingAlertEscalations: migrating ==

CREATE TABLE incident_management_pending_alert_escalations (
  id bigserial NOT NULL,
  rule_id bigint,
  alert_id bigint NOT NULL,
  schedule_id bigint NOT NULL,
  process_at timestamp with time zone NOT NULL,
  created_at timestamp with time zone NOT NULL,
  updated_at timestamp with time zone NOT NULL,
  status smallint NOT NULL,
  PRIMARY KEY (id, process_at)
) PARTITION BY RANGE (process_at);
CREATE INDEX index_incident_management_pending_alert_escalations_on_alert_id
  ON incident_management_pending_alert_escalations USING btree (alert_id);

CREATE INDEX index_incident_management_pending_alert_escalations_on_rule_id
  ON incident_management_pending_alert_escalations USING btree (rule_id);

CREATE INDEX index_incident_management_pending_alert_escalations_on_schedule_id
  ON incident_management_pending_alert_escalations USING btree (schedule_id);

CREATE INDEX index_incident_management_pending_alert_escalations_on_process_at
  ON incident_management_pending_alert_escalations USING btree (process_at);

ALTER TABLE incident_management_pending_alert_escalations ADD CONSTRAINT fk_rails_fcbfd9338b
  FOREIGN KEY (schedule_id) REFERENCES incident_management_oncall_schedules(id) ON DELETE CASCADE;

ALTER TABLE incident_management_pending_alert_escalations ADD CONSTRAINT fk_rails_057c1e3d87
  FOREIGN KEY (rule_id) REFERENCES incident_management_escalation_rules(id) ON DELETE SET NULL;

ALTER TABLE incident_management_pending_alert_escalations ADD CONSTRAINT fk_rails_8d8de95da9
  FOREIGN KEY (alert_id) REFERENCES alert_management_alerts(id) ON DELETE CASCADE;

Down

== 20210617022324 CreateIncidentManagementPendingAlertEscalations: reverting ==
-- drop_table(:incident_management_pending_alert_escalations)
   -> 0.0145s
== 20210617022324 CreateIncidentManagementPendingAlertEscalations: reverted (0.0216s)

Creation of Pending Alert Escalations

We create an escalation on all incoming alerts where the project has an Escalation policy (and rules) set up. This is of course guarded by the feature flag.

The logic for creating the escalations is held in IncidentManagement::PendingEscalations::CreateService, which takes a target (an AlertManagement::Alert, and in the future, an Incident issue).

Deleting / Creating Escalations on status changes

We create or delete escalations as a result of an Alert status change:

Alert Status change	Result
`triggered/acknowledged` -> `resolved/ignored`	Delete existing Alert Escalations for alert
`resolved/ignored` -> `triggered/acknowledged`	Create a new Alert Escalation for the alert
`resolved/ignored` -> `resolved/ignored`	No change
`triggered/acknowledged` -> `triggered/acknowledged`	No change

`IncidentManagement::PendingEscalations::ProcessService`

This evaluates the rule information that is stored on each PendingEscalation. If the criteria is met (the required status is not set on the alert, and enough time as passed so that process_at is now in the past), then we notify the oncall schedule.

Workers

To run the service mentioned above, we have a Cron worker and a job worker.

The cron worker, IncidentManagement::Escalations::ScheduleEscalationCheckCronWorker, iterates over the pending escalations which are ready to process, and spawns a IncidentManagement::Escalations::PendingAlertEscalationCheckWorker job for each.

It does this in batches of 1000 using bulk_perform_async.

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
I have tested this MR in all supported browsers, or it's not needed.
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Related to #323139 (closed)

Edited Jun 27, 2021 by Sean Arnold

Add Pending Alert Escalations table, model, services and worker