Skip to content

Clamp DB pool size to defined MIN and MAX

Matthias Käppler requested to merge 9197-conn-pool-lower-bound into master

What does this MR do?

After using the current pool sizing strategy in prod for a while, we had identified several issues that led to production incidents, which I will not reiterate here (there's a long-ish trail of breadcrumbs in the referenced issue, however). The issues mainly boil down to either not having enough connections to pgbouncer available, or not having a cap in place to the requested configuration.

This MR attempts to choose a safer range for the database connection pools we use.

The previous approach was to either use the configured system concurrency, or configured pool size, whichever is larger.

The new algorithm is more intricate, introducing bounds while leaving some room for flexibility, and works as follows:

  1. Define headroom h, which is a heuristically picked number by which the pool size is increased additionally (currently: 10)
  2. Define MAX which is a hard cap for the maximum number of pooled connections (currently: 20)
  3. t: Obtain the current runtime concurrency (max thread count)
  4. u: Obtain the pool size as configured in database.yml (or 0 if not specified)
  5. Define MIN as min(t + h, MAX) => this is the absolute lower bound for the pool size that does not also exceed our ceiling
  6. Define p by clamping u to MIN,MAX => this will at least yield the already valid thread based size, at most the defined hard cap, or otherwise a u that is considered valid

p is the final pool size.

Rationale:

  • we want to establish hard limits, so that we can never slip below or above pool sizes that would either cause connection timeouts due to backlogging or swamp pgbouncer with incoming requests
  • we want some level of auto-scaling this value based on how many threads we configure
  • we want to give users the option to override this, but only within reasonable bounds (see above)

I believe the above approach ticks all these boxes. It is a bit complicated but I racked my brain trying to simplify it without success.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Edited by 🤖 GitLab Bot 🤖

Merge request reports

Loading