Change manage events metric counter method
What does this MR do and why?
Describe in detail what your merge request does and why.
Relates to #343679 (closed)
Context
In multiple recent runs of manual SaaS Service Ping, metrics failures have caused UMAU to be null for our SaaS customer base. This is a significantly important metric for the business and the failure should be treated as a priority1 severity1 issue.
Underlying problems
- Underlying relation has extremely big number of rows:
1343 524 676
. Since used query is a distinct count, it uses default batch size of 10 000 while batch counter has max loops limit of 10 000 Before each query is executed it is checked if non constraint is violated. In case that any constraint has not been met, query is discarded and fallback value is been returned.
For case of affected metric if one take number of rows and divide it by batch size the result surpasses max loops limit 1343524676 / 10000 = 134 352,4676
Advised corrective action is to use estimated batch counter since it has higher max volume limit of 4000000000
- Underlaying query
SELECT COUNT(DISTINCT events.author_id) FROM events
performs distinct count on non unique attribute, and it might suffer from unbalanced batching problem (more detailed explanation is available at documentation) Using estimated batch counter instead of currently used ordinary batch counter might be able to address the problem.
Database
This MR affects two metrics all time and monthly, below there are links to postgres ai with old and new queries:
- Old query monthly https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7026/commands/24871
- New query monthly https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7026/commands/24875
- Old all time query https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/7026/commands/24877
- New all time query https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/7026/commands/24879
Old queries have ~ 5,5 s of execution time while new are between 1 - 1,5 s
Screenshots or screen recordings
These are strongly recommended to assist reviewers and reduce the time to merge your change.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.