Fix check for DISABLE_PUMA_WORKER_KILLER environment variable
What does this MR do and why?
See #365954 (closed)
This came out of an investigation into why we weren't seeing any memory killer events in SaaS. I described the issue here: #334831 (comment 966134364)
The Puma worker killer is a gem we use to reap Puma workers in case the process cluster runs above a given memory budget.
It is meant to be enabled by default, but you can disable it via the DISABLE_PUMA_WORKER_KILLER
environment variable.
Copying over the bug description:
In puma.rb, we check ENV['DISABLE_PUMA_WORKER_KILLER']
for truthiness -- that is not correct since environment variables are always strings and any string, even when empty, is truthy in Ruby:
irb(main):004:0> "" || "not set"
=> ""
This is not a problem as long as we don't set this environment at all (i.e. the expression evaluates to nil
, which is falsey) or set it to something truthy (1
, true
, foo
), but as soon as we set it to something that is (supposed to be) falsey (0
, false
), then the unless
condition will still evaluate to true, which is what triggers this bug.
(
Turns out we explicitly set this to false
in our GitLab chart: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/blob/3322ba5ff67d3ff4996222610cd8966e66a5ab35/charts/gitlab/gprd/charts/gitlab/charts/webservice/values.yaml#L112
Therefore the worker killer is neither running for users of our chart by default, nor for SaaS.
I fixed this by using Gitlab::Utils.to_boolean
on the string first. I also removed the double-negative ("unless disabled" -> "if enabled")
How to set up and validate locally
On master
you can set DISABLE_PUMA_WORKER_KILLER=0
(or "false") to trigger the bug. This ought to enable it, but it doesn't.
On this branch, it now should leave it enabled.
Risk considerations
As my analysis in #334831 (comment 964366171) shows, we sometimes run twice the RSS budget compared to what should be allowed given the current worker killer thresholds. However, these are figures when working at the extremes in SaaS; I have therefore decided to do two things to de-risk this fix:
-
Disable the PWK in chart-based deployments including SaaS. This is done via #364184 (closed). -
Increase memory killer limits. I tested these limits against our gpt
performance test suite here.
Related MRs:
- gitlab-org/build/CNG!1042 (merged)
- omnibus-gitlab!6173 (merged)
- gitlab-org/charts/gitlab!2605 (merged)
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #334831 (closed)