Tweaks to setting HTTP client timeouts (!39318) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 233109-http-timeout-handling-improvements into master Aug 12, 2020

What does this MR do?

Several weeks ago we noticed that Prometheus queries running in Usage Ping timed out only after blocking for up to 1 minute, a very long time to wait for a response from the server. This causes head-of-line blocking, since everything else in UP needs to wait for these queries to either return successfully or fail, since everything in UP runs sequentially.

In a somewhat knee-jerk response to this we then set request timeouts in PrometheusClient itself to fairly strict values (5 seconds connection timeout, 10 seconds read timeout.) However, Usage Ping is not the only component using PrometheusClient, so this had knock-on effects on other parts of the product that use Prometheus (self-monitoring for instance) that started to time out more frequently.

Moreover, we are currently running an experiment to set default timeouts directly at the HTTP client level, which PrometheusClient is built on, and several other services such as web-hooks. All of this leads to a somewhat fragmented and confusing situation.

This MR tries to mitigate this by:

Unblocking the HTTP default timeouts feature to go live by removing the strict timeouts from PrometheusClient and instead:
- Restoring the previous 60 seconds timeout in PrometheusService, which builds on top of PrometheusClient, and which is used by features like self-monitoring, but not Usage Ping
- In Usage Ping, simply rely on the new HTTP default timeouts, which are stricter than 60s, but not as strict as the ones we originally put in place. We are also working on !38293 (merged), which will short-circuit data collection as soon as we start hitting timeouts, so as to unclog the pipes

I also included some quality of life improvements like logging the time we had to wait for webhook requests to fail to application logs, since timeouts appear to happen quite frequently when invoking webhooks.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Edited Aug 12, 2020 by Matthias Käppler

Tweaks to setting HTTP client timeouts

What does this MR do?

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Merge request reports