Investigate and optimize slow Prometheus queries
Started originally in !37924 (merged):
-
@nolith started a discussion: Thanks @mkaeppler
💚 I'm leaving a comment to make sure we have a follow-up to evaluate the benefits of this change and consider lowering those numbers again.
I do agree with you, those numbers are really generous.
After using the stricter timeouts in PrometheusClient
for a while, other issues have crept up and we have now moved those default timeouts further down the stack and into Gitlab::HTTP
. It appears, however, that PrometheusService
and Clusters::Applications::Prometheus
, which sit on top of PrometheusClient
and are used e.g. by GitLab Self-Monitoring, are still experiencing timeouts and so I have rolled the FF back for now.
A common offender appears to be DeploymentQuery
, as seen here: https://sentry.gitlab.net/gitlab/gitlabcom/issues/1748090/?query=is%3Aunresolved%20ReactiveCachingWorker
But there could be others, especially since a lot of application code uses try_get
, a non-throwing variant of the HTTP GET call that will silently log errors and return nulls. I have already extended our logging logic to capture all occurrences, regardless of which variant is used.
I think we should have the respective teams investigate why these queries time out so much, and improve them so as to run no longer than a second or so.