Run jemalloc stats report by timer
What does this MR do and why?
PoC for #362900 (closed).
On a regular basis (timer-based) pulls Jemalloc Stats dump from every Puma worker.
FF rollout issue: #367845 (closed)
SRE rollout support issue: gitlab-com/gl-infra/delivery#2486 (closed)
Next steps
- This MR (1) goes through reviews, including SRE
- I asked John to help with the
emptyDir
k8s configuration for.com
- We could merge this MR when it's OK from the code perspective, as it wouldn't be enabled until we added ENV var + flipped FF. So it shouldn't be blocked.
- After it's merged, I'll open an MR (2) to add
GITLAB_DIAGNOSTIC_REPORTS_ENABLED
into our staging or canaries - After the MR (2) with ENV vars is merged and we redeployed, we could activate the FF, keeping an eye on the metrics (although I don't expect to see anything out of the order on
canary
) - When it's OK on
canary
, and we could confirm that reports are being generated, I will open an MR (3) to addGITLAB_DIAGNOSTIC_REPORTS_ENABLED
toproduction
. Once again we'll go with the FF activation now on prod. - In parallel, we could work on the reports upload feature
TODO
First iteration
-
Run CPU utilization test with and without reports (and vary the frequency) -
Cover everything with specs -
Restrict the growth of the report dir (in code or in volume config or both) -
Consider adding an ENV var switch to enable reports per-node -
Ask SRE to pull Jemalloc reports from production. Both for Puma and Sidekiq workers. Note the size of the report and how long it takes to produce it. Request issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15993 -
Consider and decide if we want to use per-process thread or a centralized master thread which would signal the workers to report (more in !91283 (comment 1011494980)) -
Investigate how this affects performance. Set the aggressive frequency of the report generation, and check how it hits the resources. Via GPT or ab
over a single endpoint. Results: https://docs.google.com/document/d/1ODuAvY5dKnuftgpwis3KK3-4ighpQRsma-mVp1l8cR8/edit?usp=sharing -
Make the imports folder configurable. ENV variable + safe default (e.g. /tmp
) -
Try to reuse Daemon
class for the implementation of the Timer -
Include logical worker id alongside PID in the report filename -
Run initializer only for Puma/Sidekiq (currently on a general initializer, which would run even in rails c
) -
Investigate and fix CI failures ( review-deploy
is failing)
Strecth goals / Follow-ups
- [-] we may want to profile the report generation in-depth (code execution). More details could be found in the MR which introduced the report.
- [-] gzip reports (concern: CPU-heavy operation, would need additional performance tests)
-
configure automatic reports cleanups - [-] add another report
Risks & Performance
Storage:
- We'll put additional restriction to the dir (see the discussion on the
emptyDir
in the comments) - On production, each report was around 2.5 MB (more details).
- Running every hour, it'll be ~60 MB/day if no cleanups are done.
System performance:
- On production, each report took 2-10 seconds (more details)
- To test it locally, I run GCK in production mode.
- All configs were default set by GCK (2 puma workers).
- I run the apache benchmark with
ab -t 300 -c 8 "http://localhost:3000/api/v4/projects"
. - Here are full results with no reporting, reporting every 1, 10, 30 seconds: https://docs.google.com/document/d/1ODuAvY5dKnuftgpwis3KK3-4ighpQRsma-mVp1l8cR8/edit?usp=sharing
- Based on them, we shouldn't expect any visible performance impact (especially taking into account that we are not going to run them too frequently)
- We should keep in mind that GCK reports are generated much faster (~1s GCK vs 5-10s on prod) and take less space (< 1 mb GCK vs ~2.5 MB on prod)
- Still, running reports every hour shouldn't make any visible impact
How to set up and validate locally
I suggest testing locally with GCK.
The reason is to pull the actual Jemalloc report, libjemalloc
must be on LD_PRELOAD
(more).
It is already configured in GCK this way.
- Pull this branch:
362900-jemalloc-stats-report
- Set smaller timeouts in
JemallocStats
, e.g.10
(seconds) each - Open the path (currently:
tmp/
- could be changed, refer to the code) and check that the reports are here.
Screenshots or screen recordings
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #362900 (closed)
Edited by Aleksei Lipniagov