Use sidekiq-cluster in GDK
Broken out of #34396 (closed)
Need
We are currently not dog-fooding sidekiq-cluster
to developers even though it is used both by large customers and on gitlab.com. While it is available through omnibus containers, that does not seem to be what developers are using to work on GitLab. Rather, they use the gdk
, which runs a single-process sidekiq instance via the bin/background_jobs
script. Because we are bypassing sidekiq-cluster most of the time in development, we miss issues related to multi-process scenarios (#33125 (closed), https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8194, #37781 (closed), #33133 (closed)) and detect them late, usually in production.
Furthermore, we are planning to move process supervision functions out of GitLab itself and into an external supervisor (most likely sidekiq-cluster itself), which will lead to more code paths not being covered during development and local testing unless sidekiq-cluster is used.
The proposal is therefore to stop using single-process sidekiqs in development when using the GDK and use sidekiq-cluster instead with a minimum of 2 processes.
Approach
We should change or replace bin/background_jobs
to run the ee/bin/sidekiq-cluster
script instead of doing a bundle exec sidekiq
, which only runs a single instance. We should try to accomplish this without changing the GDK itself to make this a "drop-in" replacement, if at all possible. (GDK also maintains runit
run scripts that wrap around bin/background_jobs
.) Implementation wise we should remain as close as possible to what omnibus installations do to start a sidekiq cluster so as to reduce configuration and maintenance drift.
An open question is how to address the difference in queue configuration and mapping between 1P sidekiq and a cluster. With 1P, we currently use sidekiq_queues.yml
, which contains all queues mapped to their priorities. This file is not used by sidekiq-cluster
; it instead uses a somewhat byzantine approach to configure queues:
- read comma separated "queue groups" from the CLI (
a, b,c
) - load all known queues from
all_queues.yml
and explode queues read via 1) (a,a:x b,b:x,b:y
) - count queues occurrences per group; these will become the priorities sidekiq will use
- generate a long CLI argument string carrying all queues and prios and call sidekiq (
-qa,1 -qa:x,1 -qb,1 -qb:x,1 -q b:y1
)
There is an issue out for proposal to change how workers are assigned to queues based on a selector syntax: gitlab-com/gl-infra/scalability#45 (closed)
Meanwhile, we can probably move forward with something simpler without undermining that proposal, such as specifying the cluster scale directly via the CLI and having each worker process all queues (considering that this is for non-production use cases only.)
Non-goals:
- This does not mean we will make
sidekiq-cluster
available in Core distributions in this issue, only to developers - This issue does not cover changes in gitlab-compose-kit
- There are some ergonomic issues with mapping queues to processes started by sidekiq-cluster; we won't solve these in this issue
Benefits
- Better coverage of our multi-proc job setup during dev time and earlier detection of regressions
- Less drift between production and development setups in terms of configuration
- Easier switch to make sidekiq-cluster available in Core tier once we decide to do it
Competition
I haven't found a good alternative solution, please leave feedback for other approaches.