Add heap fragmentation metric
What does this MR do and why?
Related to #365252 (closed)
In &8105 we found that heap growth when the Puma worker killer is not running is primarily due to Ruby heap fragmentation. With the number of non-empty heap pages increasing over time, so does RSS, and this memory can never be returned to the OS.
I think it would be useful to have a proper metric for the degree of heap fragmentation in Puma and Sidekiq processes. It can be derived from existing GC stat metrics as:
1 - (objects_alive / (heap_pages_with_live_objects * OBJECT_SLOTS_PER_HEAP_PAGE))
This yields a percentage or degree of fragmentation of the Ruby-managed heap (i.e. the object space.) It does not account for memory fragmentation at the allocator / OS level, though that problem should be largely solved by using jemalloc.
Originally discussed here: &8105 (comment 990438364)
Outside of it being a useful metric to track, we may be able to utilize this to issue OOM kill events should a process continue to see heap fragmentation over extended periods of time, since this results in real and sustained memory used by the Ruby VM that we cannot return to the OS kernel.
Screenshots or screen recordings
This is from my local env. The metric is very spiky, especially during application start. We need to see how it behaves in production since a local dev env is not representative of production GC activity.
How to set up and validate locally
Just run Puma and/or Sidekiq and look at Prometheus, or pull metrics directly via e.g. /-/metrics
on Puma
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.