Skip to content

Add heap fragmentation metric

Matthias Käppler requested to merge 365252-measure-heap-fragmentation into master

What does this MR do and why?

Related to #365252 (closed)

In &8105 we found that heap growth when the Puma worker killer is not running is primarily due to Ruby heap fragmentation. With the number of non-empty heap pages increasing over time, so does RSS, and this memory can never be returned to the OS.

I think it would be useful to have a proper metric for the degree of heap fragmentation in Puma and Sidekiq processes. It can be derived from existing GC stat metrics as:

1 - (objects_alive / (heap_pages_with_live_objects * OBJECT_SLOTS_PER_HEAP_PAGE))

This yields a percentage or degree of fragmentation of the Ruby-managed heap (i.e. the object space.) It does not account for memory fragmentation at the allocator / OS level, though that problem should be largely solved by using jemalloc.

Originally discussed here: &8105 (comment 990438364)

Outside of it being a useful metric to track, we may be able to utilize this to issue OOM kill events should a process continue to see heap fragmentation over extended periods of time, since this results in real and sustained memory used by the Ruby VM that we cannot return to the OS kernel.

Screenshots or screen recordings

This is from my local env. The metric is very spiky, especially during application start. We need to see how it behaves in production since a local dev env is not representative of production GC activity.

Screenshot_from_2022-06-30_16-45-09

How to set up and validate locally

Just run Puma and/or Sidekiq and look at Prometheus, or pull metrics directly via e.g. /-/metrics on Puma

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Matthias Käppler

Merge request reports

Loading