[Error Budget] Improve `Groups::UsageQuotasController#index` performance
Summary
The grouputilization error budgets are being impacted by the performance of the Groups::UsageQuotasController#index
action.
Example apdex failures from Kibana:
Proposal
The bulk of the time of these failures is time spent in DB, so we can focus our attention on that and see if there are areas we can improve the query performance, or make related changes (e.g. reducing how often we need to make the query via caching).
DB queries
In this view, the largest offending queries being executed are:
Query details
```On a cold cache, this query is very slow and for our larger customers, the request probably times out:
Time: 1.910 min
- planning: 11.657 ms
- execution: 1.910 min
- I/O read: 2.478 min
- I/O write: 0.000 ms
Shared buffers:
- hits: 327554 (~2.50 GiB) from the buffer pool
- reads: 92754 (~724.60 MiB) from the OS file cache, including disk I/O
- dirtied: 6477 (~50.60 MiB)
- writes: 0
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75311
It performs better on a warm cache, but still not great: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75312
Query details
This goes through the full_user_count method, which performs poorly for huge group hierarchies:
Time: 19.771 s
- planning: 11.245 ms
- execution: 19.759 s
- I/O read: 21.140 s
- I/O write: 0.000 ms
Shared buffers:
- hits: 160267 (~1.20 GiB) from the buffer pool
- reads: 10814 (~84.50 MiB) from the OS file cache, including disk I/O
- dirtied: 357 (~2.80 MiB)
- writes: 0
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75314
It performs better on a warm cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75318
The class being used performs three similar queries, all with similar performance as they go via the billed_user_finder
:
these queries have been tested against one of our largest known group hierarchies, they're certainly worst-case scenarios
Additional notes
The likelihood is that there won't be much that can be done to improve the query performance, it's known that they're slow and inefficient.
We might see significant improvement when initiatives like https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/7130+ move forward.
For now, we can look into caching and other related improvements:
Generating a pending member count
Currently, the query is made to determine whether or not a link is displayed to the pending members view.
Focussing on reducing how often the query is made for a group, some options are:
# | Proposal | Estimated effort | Estimated impact/notes |
---|---|---|---|
1 | Utilise ReactiveCaching for the pending member count | Medium-high | Although this won't improve the initial query performance, subsequent queries should be much faster (for the life of the cache), we employ this same technique for the billable members count |
2 | Always display a pending members tab | Low-medium | This removes the need for the query entirely, but adds a new tab to be built on the frontend |
See #429205 (comment 1619973329) for the initial discussion/context