Show alerts in environment index page
Problem to solve
As part of #8295 (comment 298418198) we want to stop deployment in case an an alert is raised by alert manager (See more &2877 (closed) about what's "Alert" is). A good first step to this would be to notify users that such an event happened even before stopping anything.
Intended users
Further details
In case there is a degradation in performance or quality, we will notify the user on the environment index page (deploy board) so that they will know something is wrong and can take action.
Using the existing Prometheus API we will query the current threshold of error rates
We already associate Environments to Alerts in 1:N relation. This means we can show a list of alerts for a specific environment, or only show the latest one.
For more information, see &2877 (closed) for what devopsmonitor team is planning in an upcoming milestone:
Screenshot |
---|
Proposal
- We will display the latest alert (already supported in &2877 (closed)) in case a threshold is crossed for the environment on the environment list/deploy board.
- This will only be done for primary environments (no grouped review environments for example)
- Only one alert will be visible at a time
- The alert which will be shown is the latest one unless there is a critical alert that is persisting.
- Alerts in the environment page/deploy board should be dismissed automatically if a corresponding metric returns to normal and doesn't exceed a threshold. If the alert has already ended, it should not appear.
- The payload of the alert will include
[Alert severity icon]
[Alert severity title]
-[when alert started]
[alert condition]
[metric name]
-[Error rate]
.[View details]
-
[Alert severity icon]
,[Alert severity title]
,[when alert started]
,[alert condition]
, and[metric name]
are pulled from the alerts API -
[View details]
links to the metrics page with the correct environment selected solving #214927 (closed) -
[Error rate]
will use [\pre-existing defined error rates] (https://docs.gitlab.com/ee/user/project/integrations/prometheus_library/nginx_ingress.html#metrics-supported)
-
Name | Query |
---|---|
Throughput (req/sec) | sum(label_replace(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m]), "status_code", "${1}xx", "status", "(.)..")) by (status_code) |
Latency (ms) | sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_sum{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_count{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) * 1000 |
HTTP Error Rate (%) | sum(rate(nginx_ingress_controller_requests{status=~"5.",namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}.*"}[2m])) * 100 |
- Introduce error below environment or pod information (incase deployment board is active) similar to merge request widgets frontend backend
Mockup (browser made) |
---|
code I injected to create the mockup above
<div style="
/* padding-top: 5px; */
/* padding-bottom: 5px; */
"><div class="mr-widget-extension d-flex align-items-center pl-3" style="
vertical-align: middle;
/* margin-top: 5px; */
padding-top: 5px;
padding-bottom: 5px;
"><svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 12 12" style="
margin-right: 8px;
">
<path fill-rule="evenodd" d="M6.70565033,0.184992446 L10.7943497,2.49459124 C11.2310076,2.74124783 11.5,3.19708802 11.5,3.69040121 L11.5,8.30959879 C11.5,8.80291198 11.2310076,9.25875217 10.7943497,9.50540876 L6.70565033,11.8150076 C6.26899239,12.0616641 5.73100761,12.0616641 5.29434967,11.8150076 L1.20565033,9.50540876 C0.768992386,9.25875217 0.5,8.80291198 0.5,8.30959879 L0.5,3.69040121 C0.5,3.19708802 0.768992386,2.74124783 1.20565033,2.49459124 L5.29434967,0.184992446 C5.73100761,-0.0616641488 6.26899239,-0.0616641488 6.70565033,0.184992446 Z" style="
fill: #8c210d;
"></path>
</svg>
<span style="
margin-right: 4px;
">Critical - HTTP error rate exceeded 0.1%.</span><button type="button" class="btn btn-link btn-md"><!----> View details</button></div> <!----></div>
Permissions and Security
Documentation
Availability & Testing
What does success look like, and how can we measure that?
What is the type of buyer?
Is this a cross-stage feature?
Links / references
Scoped off
Edited by Dimitrie Hoekstra