Add a fast_timeout for the `ServerService.ServerInfo` endpoint
What does this MR do?
- Adds a fast timeout for the
ServerService.ServerInfo
endpoint - Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/49116
- Reduces the impact of https://gitlab.com/gitlab-org/gitlab-ce/issues/49112
# Simulate a Gitaly server under extreme load....
$ lldb -p $(pgrep gitaly) &
# Without this fix: Prometheus scrape endpoint times out after ~55s with a 500 error
$ time curl http://localhost:3000/-/metrics
NameError at /-/metrics
=======================
> uninitialized constant GRPC::GRPC
...
real 0m58.135s
user 0m0.014s
sys 0m0.016s
# With this fix: Prometheus scrape endpoint times out after ~10s, returns the prometheus metrics successfully
$ time curl http://localhost:3000/-/metrics
client_browser_timing_count{event="contentComplete"} 40
# HELP client_browser_timing Multiprocess metric
# TYPE client_browser_timing histogram
client_browser_timing_bucket{event="connect",le="+Inf"} 40
client_browser_timing_bucket{event="connect",le="0.005"} 40
...
real 0m13.607s
user 0m0.020s
sys 0m0.027s
Why was this MR needed?
When any single Gitaly server fails, up to 50% of the web and api fleet workload is saturated by prometheus healthcheck requests, which happen 4 times a minute on each node, with each request currently taking almost a full minute, before failing with a 500 error.
With this fix: when any single Gitaly server fails: the prometheus endpoint will eventually return after 10 seconds with the correct metrics and a 200 response. https://gitlab.com/gitlab-org/gitlab-ce/issues/49112 will further correct this behaviour to detach healthchecking from prometheus scraping.
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Documentation created/updated -
API support added -
Tests added for this feature/bug - Conform by the code review guidelines
-
Has been reviewed by a UX Designer -
Has been reviewed by a Frontend maintainer -
Has been reviewed by a Backend maintainer -
Has been reviewed by a Database specialist
-
-
Conform by the merge request performance guides -
Conform by the style guides -
Conform by the database guides -
If you have multiple commits, please combine them into a few logically organized commits by squashing them -
Internationalization required/considered -
End-to-end tests pass ( package-and-qa
manual pipeline job)
What are the relevant issue numbers?
- Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/49116
- Reduces the impact of https://gitlab.com/gitlab-org/gitlab-ce/issues/49112
Edited by Rémy Coutable