Add ability to collect Ruby heap dumps
We found that restarting workers based on high heap fragmentation is only effective on api
nodes. Moreover, we are seeing what appears to be substantial memory leaks in web
nodes.
We cannot diagnose these without pulling heap dumps. Since it is difficult to time this even with help from an SRE (it would have to happen before the process dies, likely during a weekend), I think we should build heap dump collection straight into the application.
My proposal is to:
-
MR1: Add a new life-cycle hook on_worker_stop
that is called when a Puma or Sidekiq worker is about to shut down (!103372 (merged)) -
MR2: Wiring: Leverage memory-watchdog
to signal the worker that it should dumpObjectSpace
before shutting down (!103957 (merged)). This does not yet write heap dumps. -
MR3: Refactor - extract shared logic from ReportsDaemon
into a newReporter
class: !104264 (merged) -
MR4: Refactor - extract shared logic from Jemalloc
report intoReporter
: !104727 (merged) -
MR5: Add gzip
support toReporter
file streaming logic: !105115 (merged) -
MR6: Implement HeapDump
report method to produce an object space dump: !106406 (merged)
The uploader will then pick this up and put it into GCS (this was done in #362902 (closed))
Edited by Matthias Käppler