Prometheus alerts delivered from alertmanager into GitLab issues are silently being dropped
See gitlab-com/runbooks!2592 (merged) and gitlab-com/gl-infra/production#2451 (closed) for more details.
GitLab.com's AlertManager infrastructure delivers some alerts to GitLab.com issues, but these alerts are being silently dropped.
On Jul 23, 2020 @ 00:40:07.791
, AlertManager delivered a webhook alert to GitLab.com:
Log entry (while it lasts) https://log.gprd.gitlab.net/app/kibana#/discover/doc/AW5F1e45qthdGjPJueGO/pubsub-rails-inf-gprd-003224?id=lWQceXMBOELd9C8V9tGa
The server responded with a 200.
The following params were delivered to GitLab.com:
{
"key": "receiver",
"value": "issue:gitlab\\.com/gitlab-com/gl-infra/production"
},
{
"key": "status",
"value": "firing"
},
{
"key": "alerts",
"value": "[{\"status\"=>\"firing\", \"labels\"=>{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"annotations\"=>{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}, \"startsAt\"=>\"2020-07-23T00:30:00.587237764Z\", \"endsAt\"=>\"0001-01-01T00:00:00Z\", \"generatorURL\"=>\"https://prometheus.gprd.gitlab.net/graph?g0.expr=probe_ssl_earliest_cert_expiry%7Bjob%3D%22blackbox%22%7D+-+time%28%29+%3C+14+%2A+86400&g0.tab=1\", \"fingerprint\"=>\"1f00c90951546e3b\"}]"
},
{
"key": "groupLabels",
"value": "{\"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}"
},
{
"key": "commonLabels",
"value": "{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}"
},
{
"key": "commonAnnotations",
"value": "{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}"
},
{
"key": "externalURL",
"value": "http://alerts-01-inf-ops:9093"
},
{
"key": "version",
"value": "4"
},
{
"key": "groupKey",
"value": "{}/{env=\"gprd\",pager=\"issue\",project=\"gitlab.com/gitlab-com/gl-infra/production\"}:{alertname=\"SSLCertExpiresSoon\", env=\"gprd\", stage=\"main\", tier=\"sv\", type=\"blackbox\"}"
},
{
"key": "namespace_id",
"value": "gitlab-com/gl-infra"
},
{
"key": "project_id",
"value": "production"
},
{
"key": "alert",
"value": "{\"receiver\"=>\"issue:gitlab\\\\.com/gitlab-com/gl-infra/production\", \"status\"=>\"firing\", \"alerts\"=>[{\"status\"=>\"firing\", \"labels\"=>{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"annotations\"=>{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}, \"startsAt\"=>\"2020-07-23T00:30:00.587237764Z\", \"endsAt\"=>\"0001-01-01T00:00:00Z\", \"generatorURL\"=>\"https://prometheus.gprd.gitlab.net/graph?g0.expr=probe_ssl_earliest_cert_expiry%7Bjob%3D%22blackbox%22%7D+-+time%28%29+%3C+14+%2A+86400&g0.tab=1\", \"fingerprint\"=>\"1f00c90951546e3b\"}], \"groupLabels\"=>{\"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"commonLabels\"=>{\"alert_type\"=>\"cause\", \"alertname\"=>\"SSLCertExpiresSoon\", \"env\"=>\"gprd\", \"environment\"=>\"gprd\", \"instance\"=>\"https://status.gitlab.com\", \"job\"=>\"blackbox\", \"monitor\"=>\"default\", \"pager\"=>\"issue\", \"project\"=>\"gitlab.com/gitlab-com/gl-infra/production\", \"provider\"=>\"gcp\", \"region\"=>\"us-east\", \"severity\"=>\"s2\", \"shard\"=>\"default\", \"stage\"=>\"main\", \"tier\"=>\"sv\", \"type\"=>\"blackbox\"}, \"commonAnnotations\"=>{\"description\"=>\"[FILTERED]\", \"runbook\"=>\"docs/frontend/ssl_cert.md\", \"title\"=>\"[FILTERED]\"}, \"externalURL\"=>\"http://alerts-01-inf-ops:9093\", \"version\"=>\"4\", \"groupKey\"=>\"{}/{env=\\\"gprd\\\",pager=\\\"issue\\\",project=\\\"gitlab.com/gitlab-com/gl-infra/production\\\"}:{alertname=\\\"SSLCertExpiresSoon\\\", env=\\\"gprd\\\", stage=\\\"main\\\", tier=\\\"sv\\\", type=\\\"blackbox\\\"}\"}"
}
The webhook was successfully delivered, but did not create an issue.
The alertmanager configuration is as follows:
- name: issue:gitlab.com/gitlab-com/gl-infra/production
webhook_configs:
- http_config:
bearer_token: SECRET
send_resolved: true
url: https://gitlab.com/gitlab-com/gl-infra/production/prometheus/alerts/notify.json
In the case of the GitLab.com alert that was lost, we could have easily missed a SSL certificate renewal alert had it not been noticed through other means. It is critical for the availability of GitLab.com that our alerting infrastructure works as expected.
Therefore I'm marking this as ~P2 ~S2