Add ability to collect diagnostic reports
Problem to solve
In &8105 we identified the need to automatically collect diagnostic data from production Puma instances, such as:
- Ruby heap dumps
- jemalloc stats
- Process maps
- any other diagnostic reports
This is to reduce reliance on SREs to produce such reports on behalf of other teams and promote a self-service approach to debugging production issues.
For the MVC, we can focus on just one of these reports, and potentially follow up with others as necessary.
I suggest starting with the recently implemented jemalloc stats report because it is small in size and in JSON format it can be used for post-processing.
Proposal
In this first iteration, we will focus on producing such reports on a running Puma production instance. Collection of the reports will still require the help of an SRE to copy down those files for analysis.
We must assume that most or all of this functionality incurs a significant cost and may interfere with serving user traffic. We therefore need to make sure that these reports are collected in a way that minimizes impact on node availability. It should likely only be enabled on certain nodes, to begin with, guarded by an environment switch.
The overall goal to keep in mind here is autonomy: producing the reports must not require intervention at the node level i.e. require help from an SRE.
Requirements
- Puma workers dump a report file via some sort of signal; the signal must not be based on per-process interaction, since the goal here is to collect these reports without the need for node access.
- The signal could be based on events (such as low memory situations) or timers, or both. We should consider designing a pluggable/flexible interface to adapt any number of signals.
- Only one Puma worker ever dumps reports simultaneously. This ensures continued availability since some reports like heap dumps block the Ruby VM.
- Ideally, no communication between nodes is required, such as distributed locks, but we could consider it if absolutely necessary.
- Report files must not pile up and potentially saturate local pod storage (heap dumps can be hundreds of MB in size). If possible, consider gzipping files and perform cleanups automatically.
- The change is guarded either by an environment switch or perhaps better, by an
ops
FF. - (Optional) The implementation works also for Sidekiq not just Puma. While this is a stretch goal, we should avoid "designing past" Sidekiq and find ourselves with a solution in a month from now that is not reusable at all.