Tweaks to setting HTTP client timeouts
What does this MR do?
Refs #233109 (closed)
Several weeks ago we noticed that Prometheus queries running in Usage Ping timed out only after blocking for up to 1 minute, a very long time to wait for a response from the server. This causes head-of-line blocking, since everything else in UP needs to wait for these queries to either return successfully or fail, since everything in UP runs sequentially.
In a somewhat knee-jerk response to this we then set request timeouts in PrometheusClient
itself to fairly strict values (5 seconds connection timeout, 10 seconds read timeout.) However, Usage Ping is not the only component using PrometheusClient
, so this had knock-on effects on other parts of the product that use Prometheus (self-monitoring for instance) that started to time out more frequently.
Moreover, we are currently running an experiment to set default timeouts directly at the HTTP client level, which PrometheusClient
is built on, and several other services such as web-hooks. All of this leads to a somewhat fragmented and confusing situation.
This MR tries to mitigate this by:
- Unblocking the HTTP default timeouts feature to go live by removing the strict timeouts from
PrometheusClient
and instead:- Restoring the previous 60 seconds timeout in
PrometheusService
, which builds on top ofPrometheusClient
, and which is used by features like self-monitoring, but not Usage Ping - In Usage Ping, simply rely on the new HTTP default timeouts, which are stricter than 60s, but not as strict as the ones we originally put in place. We are also working on !38293 (merged), which will short-circuit data collection as soon as we start hitting timeouts, so as to unclog the pipes
- Restoring the previous 60 seconds timeout in
I also included some quality of life improvements like logging the time we had to wait for webhook requests to fail to application logs, since timeouts appear to happen quite frequently when invoking webhooks.
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides - [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done