Skip to content

Decrease the default fast statement timeout

Bob Van Landuyt requested to merge bvl-search-count-optimizations into master

What does this MR do and why?

Decreasing from 5s to 4.5s will allow callers that gracefully handle these exceptions some time to complete within the low-urgency request duration.

This is currently used by the SearchController#count action to count records within a user defined scope. This endpoint is the biggest contributor to the error budget spend for groupglobal search based on the old 5s fixed threshold: https://log.gprd.gitlab.net/goto/5774d0b0-93e5-11ec-9dd2-93d354bef8e7

This shows that most (>82%) of the requests exceeding the 5s duration would result in a 408: https://log.gprd.gitlab.net/goto/c6a635c0-93e8-11ec-a649-b7cbb8e4f62e This means that for these requests we've occupied a puma-worker and the database for 5s without accomplishing anything. It also means that we would terminate 17% of those slow requests earlier while they might still have produced results. But since we gracefully handle those timeouts, I think this is an acceptable outcome: The user wasn't waiting for those anyway.

When lowering this statement timeout for the SearchController#count, we are not limiting the sum of all queries, but rather each query. So, the issue count that could query twice, might still succeed but cause an apdex violation: https://gitlab.com/gitlab-org/gitlab/blob/bb9480ad4cb5b7852b61d4026323ff5b9b75c228/lib/gitlab/search_results.rb#L88

It is also used by`Gitlab::IssuablesCountForState`

This for counting merge requests/issues or epics by state when searching on their respective dashboards. This use is instrumented and hasn't been exceeding the statement timeout all too often:

image

Thanos

This also updates the urgency for the SearchController#count endpoint so that it remains at the 5s target duration when we switch to the new configurable target duration: https://docs.gitlab.com/ee/development/application_slis/rails_request_apdex.html

Isn't this gaming the system? No, I don't think so: it's improving the accuracy of the SLI measuring request duration. The 5s statement timeout was added as a safeguard to not hold a database connection for a query that we won't be able to complete. However, in some cases it is acceptable to not return results with very minor user impact. This means that there is less noise in the signal for when things do unexpectedly slow down: we can tighten our SLO and alert sooner.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Bob Van Landuyt

Merge request reports

Loading