[POC] Export Sidekiq metrics from separate server process
Goal
Implement a POC demonstrating that the approach outlined in &6409 (closed) works.
Approach
- Make
SidekiqExporter
(and its dependencies) available toSidekiq::CLI
(it runs outside the Rails context currently)- This is the key challenge here, since it accesses
Settings
and other things specific to the Rails app
- This is the key challenge here, since it accesses
- Make
sidekiq-cluster
process spawn a metrics server process and bind to a different port (e.g.3809
;3807
is where sidekiq procs currently attempt to bind)
Outcome
- Existing worker(s) continue to export metrics via
3807
to allow for a smooth rollout - Server process exports the same metrics via a different port (set from env variable)
- Metrics must be identical, processes must not interfere with one another
- Specifically, workers never remove anything from the shared metrics directory; only the server process does. This will address the problem where metrics disappear because restarting workers attempt to wipe data from it.
This should lay the foundation for a clean cut-over later.
Notes / Assumptions / Questions
- All environments launch
sidekiq-cluster
rather than individual SK processes- Should be the case as of recently so that we either go through
bin/background_jobs
(GCK/GDK, from source) or through a service wrapper (Omnibus, CNG via gitlab-org/build/CNG!819 (merged))
- Should be the case as of recently so that we either go through
- How will clean-up logic be handled? This was one of the points of friction
- By moving all clean-up logic to the parent process, the problem of workers deleting each other's metrics should be solved because:
- The server process starts before any child processes (i.e. workers) are spawned; each worker can therefore assume to operate on a clean directory
- As soon as any child process dies, all children will be terminated and the parent process quits. Restarting it is therefore equivalent to a cold start and will also wipe all metrics.
- More context here: #27818 (comment 251044947)
- By moving all clean-up logic to the parent process, the problem of workers deleting each other's metrics should be solved because:
Edited by Matthias Käppler