Consider replacing puma_worker_killer

Problem

We have known for a while that puma_worker_killer is not a great solution to managing memory. It is a blunt tool that performs two tasks:

It occasionally reaps worker processes when a fixed memory budget measured in total process cluster RSS is exceeded.
It optionally reaps workers based on a timer.

Both approaches are not ideal for several reasons:

RSS is a poor measure of real memory use since it does not account for shared pages, or it might be dormant and could be returned to the OS kernel before running OOM. This means PWK over-estimates how memory, given a node memory budget, is actually being utilized.
These memory limits are static and are not easily tweaked. Some of the values are hard-coded in gitlab-rails, others are configurable but spread out across several repositories (Omnibus, charts).
The memory limits need to be tuned in lockstep with other configuration that is memory-bound, such as the calculation in Omnibus for how many Puma workers it should run by default, but also with system requirements for the various reference architectures we propose.

Proposal

In &8105 we found that a primary reason for memory use going up over time in Puma is Ruby heap fragmentation. You can read about how this manifests in detail here.

Rather than using a fixed amount of RSS we allocate to Puma processes, I suggest we look at heap utilization instead. High memory use is not a bad thing as long as it is used efficiently. It just means we are expanding into what the system makes available to use. It is only when memory gets saturated that we run into problems.

That said, I think we should consider writing a new memory watchdog for Puma that optimizes for the following instead:

Maintain high heap utilization in Puma workers.
Avoid node memory saturation by expanding too far into available memory.

I would think of 1 as "herding processes" over time and maintaining order and 2 as a precautionary kill-switch in case 1 fails to keep memory use steady.

For 1, I think we could look at heap fragmentation as the inverse of heap utilization. A Ruby heap with a high degrees of fragmentation is poorly utilized. To do this, we could regularly sample a heap fragmentation metric. If fragmentation exceed a desired level, the memory watchdog would reap that process.

For 2, i.e. if memory use expands even in face of low fragmentation, we can still try to preempt the Linux OOM killer and prevent node memory from saturating by applying a maximum threshold. I think this threshold should not be based on RSS, but rather on PSS (Proportional Set Size). This has the following benefits:

It accounts for resident memory that is shared between all Puma processes, including the primary.
It scales with the number of processes running. If there are N Puma processes, then adding an additional worker will automatically reduce PSS for each worker proportionally to N+1.

This requires less fine-tuning in configuration files where we budget the primary and worker processes separately currently. Since it is a last-resort, I would also expect that this needs less frequent tuning, or could even be automatically set by allocating a fixed percentage of node memory to Puma at runtime.

Measuring success

I think we can measure success by:

Getting to a steadier, flatter memory curve in Puma workers
Reducing the number of customer tickets reporting node availability issue due to over-eager PWK kills
Reducing the complexity and frequency of maintaining worker killer config

Edited Jun 22, 2022 by Matthias Käppler