Default Puma per_worker_max_memory_mb is too low (>13.5)

Summary

A customer invoked emergency support today as GitLab was not usable (slow). GitLab team members can read more in the ticket

Single virtual machine deployment, which the customer had increased to 64gb RAM and 32 CPUs, rebooted, and when I joined the call

Load average was ~48-50 (ie: threads requested CPU exceeded the CPU count)
Memory use was about 28gb disk cache, and 6gb free, the rest for process RSS (plenty of free memory)
Reviewing puma_stdout.log every time PumaWorkerKiller polled, it killed a worker.
The situation got critical during APAC hours on Tuesday, they'd upgraded to 13.12 the previous weekend.
Looking back at puma_stdout.log there were no PumaWorkerKiller: Out of memory. errors prior to the upgrade. In their case, it looks like there might be a regression:13.12

I suggested that they set "per_worker_max_memory_mb": 1300 and reconfigure.

After this, load average settled down to below 20, and there were no PumaWorkerKiller: Out of memory. issues.

We document how to change this value but not how to determine an appropriate value.

I selected this value on the basis that

a rolling restart occurs periodically to do any required or recommended housekeeping
GitLab.com has a value of 1342 set.
PumaWorkerKiller triggers at 98% of the configured value (so the effective value is 1274)

Restarting Puma workers appears not to be cheap, so it should be tuned so that the benefit of PumaWorkerKiller restarting workers exceeds its cost.

I've also noticed workers being recycled on very high frequencies in other tickets: (all internal links)

https://gitlab.zendesk.com/agent/tickets/215193 (13.10. I also recommended increasing it)
(removed - they're still in 13.1, so the original 850mb threshold)

Steps to reproduce

Example Project

What is the current bug behavior?

Excessive interventions by PumaWorkerKiller recycling workers, such as on every single 20 second check.

What is the expected correct behavior?

PumaWorkerKiller intervenes occasionally.
We document expectations, such as: a worker would be expected to run for at least an hour, or at least 3 hours.

Relevant logs and/or screenshots

From: https://gitlab.zendesk.com/agent/tickets/221213

This is to illustrate a worker being killed ever time PumaWorkerKiller checks. It's from a 13.1 instance, so the numbers are wrong.

{"timestamp":"2021-06-29T10:05:53.427Z","pid":70409,"message":"- Worker 0 (PID: 26140) booted, phase: 0"}
{"timestamp":"2021-06-29T10:06:12.805Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4810.1484375 mb out of max: 4798.08 mb. Sending TERM to pid 26022 consuming 997.9453125 mb."}
{"timestamp":"2021-06-29T10:06:13.491Z","pid":70409,"message":"- Worker 2 (PID: 26193) booted, phase: 0"}
{"timestamp":"2021-06-29T10:06:32.806Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4805.3203125 mb out of max: 4798.08 mb. Sending TERM to pid 25743 consuming 1001.7265625 mb."}
{"timestamp":"2021-06-29T10:06:33.458Z","pid":70409,"message":"- Worker 3 (PID: 26273) booted, phase: 0"}
{"timestamp":"2021-06-29T10:06:52.807Z","pid":70409,"message":"PumaWorkerKiller: Consuming 4741.6484375 mb with master and 4 workers."}
{"timestamp":"2021-06-29T10:07:12.808Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4827.26171875 mb out of max: 4798.08 mb. Sending TERM to pid 26140 consuming 995.2265625 mb."}
{"timestamp":"2021-06-29T10:07:13.424Z","pid":70409,"message":"- Worker 0 (PID: 26380) booted, phase: 0"}
{"timestamp":"2021-06-29T10:07:32.809Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4887.19140625 mb out of max: 4798.08 mb. Sending TERM to pid 26193 consuming 1000.94921875 mb."}
{"timestamp":"2021-06-29T10:07:33.419Z","pid":70409,"message":"- Worker 2 (PID: 26476) booted, phase: 0"}
{"timestamp":"2021-06-29T10:07:52.810Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4891.7109375 mb out of max: 4798.08 mb. Sending TERM to pid 26273 consuming 1010.65625 mb."}
{"timestamp":"2021-06-29T10:07:53.520Z","pid":70409,"message":"- Worker 3 (PID: 26668) booted, phase: 0"}
{"timestamp":"2021-06-29T10:08:12.811Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4878.02734375 mb out of max: 4798.08 mb. Sending TERM to pid 26081 consuming 1017.703125 mb."}
{"timestamp":"2021-06-29T10:08:13.435Z","pid":70409,"message":"- Worker 1 (PID: 26920) booted, phase: 0"}
{"timestamp":"2021-06-29T10:08:32.812Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4855.03515625 mb out of max: 4798.08 mb. Sending TERM to pid 26380 consuming 1004.70703125 mb."}
{"timestamp":"2021-06-29T10:08:33.379Z","pid":70409,"message":"- Worker 0 (PID: 27147) booted, phase: 0"}
{"timestamp":"2021-06-29T10:08:52.813Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4842.765625 mb out of max: 4798.08 mb. Sending TERM to pid 26920 consuming 1000.75390625 mb."}
{"timestamp":"2021-06-29T10:08:53.439Z","pid":70409,"message":"- Worker 1 (PID: 27351) booted, phase: 0"}
{"timestamp":"2021-06-29T10:09:12.815Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4871.23828125 mb out of max: 4798.08 mb. Sending TERM to pid 26668 consuming 1001.00390625 mb."}
{"timestamp":"2021-06-29T10:09:13.459Z","pid":70409,"message":"- Worker 3 (PID: 27545) booted, phase: 0"}
{"timestamp":"2021-06-29T10:09:32.816Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4924.44921875 mb out of max: 4798.08 mb. Sending TERM to pid 27351 consuming 1015.8828125 mb."}
{"timestamp":"2021-06-29T10:09:33.474Z","pid":70409,"message":"- Worker 1 (PID: 27618) booted, phase: 0"}
{"timestamp":"2021-06-29T10:09:52.818Z","pid":70409,"message":"PumaWorkerKiller: Out of memory. 4 workers consuming total: 4883.78515625 mb out of max: 4798.08 mb. Sending TERM to pid 26476 consuming 1013.671875 mb."}
{"timestamp":"2021-06-29T10:09:53.418Z","pid":70409,"message":"- Worker 2 (PID: 27690) booted, phase: 0"}

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited May 30, 2022 by 🤖 GitLab Bot 🤖