Skip to content

Optimize alert integration notification failure logs

Sarah Yasonik requested to merge sy-improve-alert-notifcation-logs into master

What does this MR do and why?

  • Updates the message for alert creation failure logs to a static string for easier searching in Kibana.
  • Removes log for failed recovery alert resolution, as cannot be reached.

How to set up and validate locally

Modified log for creation failure

A user can't directly cause a creation failure, but there is a race condition when multiple, new, identical payloads are saved that the same time. This might cause errors in either the AlertManagement::Alert#fingerprint validation or a ActiveRecord::RecordNotUnique error (will be handled better in a future MR).

Unfortunately, neither of these errors can be easily triggered by modifying inputs. So the simplest way to see the behavior locally is hackily via pry.

  1. Add a debugger in app/services/concerns/alert_management/alert_processing.rb
     def process_new_alert
       return if resolving_alert?
    +  binding.pry
       if alert.save
  2. Tail the logs: tail -f log/development.log
  3. Trigger alert processing in the rails console
    payload = { 'annotations' => { 'title' => 'TITLE' }, 'startsAt' => '2021-04-30T11:22:40Z' }
    project = Project.first
    AlertManagement::ProcessPrometheusAlertService.new(project, payload)
  4. When the debugger pops up, modify an alert attribute which will cause validations to fail.
    > alert.title = nil
    > continue
  5. View the logs to see the error.

Removed log for status change failure

This can't be triggered by any action I'm aware of or have seen evidence of in the logs:

  1. AlertManagement::Alert#resolve defines a fallback for when the payload is missing an appropriate timestamp
  2. AlertManagement::AlertProcessing doesn't modify any attributes on an alert prior to the calling #resolve
  3. There are no database-level constraints or validations which might cause a status change to resolved to fail

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Sarah Yasonik

Merge request reports

Loading