Clamp DB pool size to defined MIN and MAX
What does this MR do?
- Refs #36377 (closed)
- Closes https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9197
- Closes #195163
After using the current pool sizing strategy in prod for a while, we had identified several issues that led to production incidents, which I will not reiterate here (there's a long-ish trail of breadcrumbs in the referenced issue, however). The issues mainly boil down to either not having enough connections to pgbouncer available, or not having a cap in place to the requested configuration.
This MR attempts to choose a safer range for the database connection pools we use.
The previous approach was to either use the configured system concurrency, or configured pool size, whichever is larger.
The new algorithm is more intricate, introducing bounds while leaving some room for flexibility, and works as follows:
- Define headroom
h
, which is a heuristically picked number by which the pool size is increased additionally (currently:10
) - Define
MAX
which is a hard cap for the maximum number of pooled connections (currently:20
) -
t
: Obtain the current runtime concurrency (max thread count) -
u
: Obtain the pool size as configured indatabase.yml
(or 0 if not specified) - Define
MIN
asmin(t + h, MAX)
=> this is the absolute lower bound for the pool size that does not also exceed our ceiling - Define
p
by clampingu
toMIN,MAX
=> this will at least yield the already valid thread based size, at most the defined hard cap, or otherwise au
that is considered valid
p
is the final pool size.
Rationale:
- we want to establish hard limits, so that we can never slip below or above pool sizes that would either cause connection timeouts due to backlogging or swamp pgbouncer with incoming requests
- we want some level of auto-scaling this value based on how many threads we configure
- we want to give users the option to override this, but only within reasonable bounds (see above)
I believe the above approach ticks all these boxes. It is a bit complicated but I racked my brain trying to simplify it without success.
Does this MR meet the acceptance criteria?
Conformity
- [-] Changelog entry
- [-] Documentation (if required)
-
Code review guidelines - [-] Merge request performance guidelines
-
Style guides - [-] Database guides
- [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
-
Informed Infrastructure department of a default or new setting change, if applicable per definition of done