Trigger stackprof by sending a SIGUSR2 signal
What does this MR do?
This is a proposed implementation of #225473 (closed).
Problem statement
It's currently quite difficult to see which ruby code is spending a lot of time on CPU, and to do so safely in production (e.g. gitlab.com).
Proposed solution
It would be great to have a low-overhead sampling profiler such as stackprof available in production.
Implementation notes
The main design decision is to use a SIGUSR2 signal to trigger profiling. This was chosen for a few reasons:
- By using a signal instead of an endpoint, it can be applied to both rails as well as sidekiq processes.
- Since we run many web processes via puma cluster, this allows targetting specific worker processes, or easily targeting all of them via
pkill -USR2 puma:
. - Synthetic per-endpoint profiling (possibly even using the
StackProf.run
API) is very misleading in a production environment, because it implies that only a single "request" is being profiled, whereas stacks from the entire process are being sampled. By modelling profiling as a process-wide time-based (or manually stoppable) operation, the UI is aligned with the implementation.
The SIGUSR2 signal was chosen as it is the same one recommended by gperftools, the ancestor of stackprof. There is a minor clash with puma signals, but I believe this to be acceptable as SIGUSR1
has the same behaviour in our configuration.
As a consequence of using a signal, we need to somehow make the code interrupt and thread-safe. We can use a pipe as a signalling mechanism, and handle the profiling in a separate thread.
Stackprof works by setting a timer which will collect stacks at a given frequency. The default frequency is 1khz (1000 samples per second), I lowered it to 100hz (100 samples per second), but this can be overridden via an env variable.
The sampled stacks are held in memory until StackProf.results
is called. At that point they are written out to disk, and can be garbage collected from memory.
The first SIGUSR2 will start profiling, on the second SIGUSR2, profiling is stopped and samples are written to disk. These samples can potentially use a lot of memory. In order to avoid unbounded growth, the profiler will timeout after 30 seconds and automatically stop. This should safeguard against forgetting to stop the profile.
Because the puma master has a process name of shape puma 4.3.3.gitlab.2 (unix:///Users/igor/code/gitlab-development-kit/gitlab.socket) [gitlab-puma-worker]
, but the workers have puma: cluster worker 0: 61472 [gitlab-puma-worker]
, we can use puma:
to select only workers.
To initiate profile capture on all puma workers, run:
$ pkill -USR2 puma:
This will profile for 30 seconds (or until a second SIGUSR2 is sent) and then write the samples out to $TMPDIR/stackprof.$PID.$RANDOM.profile
.
These profiles can then be processed via the stackprof
CLI and flamegraph.pl
:
$ bundle exec stackprof --stackcollapse /tmp/stackprof.55769.c6c3906452.profile | flamegraph.pl > flamegraph.svg
This will produce a flamegraph like the one you see below, and this flamegraph will represent stacks which were on-CPU (unlike rbspy).
Screenshots
A sample flamegraph from profiling gdk locally.
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
cc @mkaeppler @ayufan @andrewn @stanhu @smcgivern @cmiskell @msmiley