Use sidekiq-cluster in GDK

Need

We are currently not dog-fooding sidekiq-cluster to developers even though it is used both by large customers and on gitlab.com. While it is available through omnibus containers, that does not seem to be what developers are using to work on GitLab. Rather, they use the gdk, which runs a single-process sidekiq instance via the bin/background_jobs script. Because we are bypassing sidekiq-cluster most of the time in development, we miss issues related to multi-process scenarios (#33125 (closed), https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8194, #37781 (closed), #33133 (closed)) and detect them late, usually in production.

Furthermore, we are planning to move process supervision functions out of GitLab itself and into an external supervisor (most likely sidekiq-cluster itself), which will lead to more code paths not being covered during development and local testing unless sidekiq-cluster is used.

The proposal is therefore to stop using single-process sidekiqs in development when using the GDK and use sidekiq-cluster instead with a minimum of 2 processes.

Approach

We should change or replace bin/background_jobs to run the ee/bin/sidekiq-cluster script instead of doing a bundle exec sidekiq, which only runs a single instance. We should try to accomplish this without changing the GDK itself to make this a "drop-in" replacement, if at all possible. (GDK also maintains runit run scripts that wrap around bin/background_jobs.) Implementation wise we should remain as close as possible to what omnibus installations do to start a sidekiq cluster so as to reduce configuration and maintenance drift.

An open question is how to address the difference in queue configuration and mapping between 1P sidekiq and a cluster. With 1P, we currently use sidekiq_queues.yml, which contains all queues mapped to their priorities. This file is not used by sidekiq-cluster; it instead uses a somewhat byzantine approach to configure queues:

read comma separated "queue groups" from the CLI (a, b,c)
load all known queues from all_queues.yml and explode queues read via 1) (a,a:x b,b:x,b:y)
count queues occurrences per group; these will become the priorities sidekiq will use
generate a long CLI argument string carrying all queues and prios and call sidekiq (-qa,1 -qa:x,1 -qb,1 -qb:x,1 -q b:y1)

There is an issue out for proposal to change how workers are assigned to queues based on a selector syntax: gitlab-com/gl-infra/scalability#45 (closed)

Meanwhile, we can probably move forward with something simpler without undermining that proposal, such as specifying the cluster scale directly via the CLI and having each worker process all queues (considering that this is for non-production use cases only.)

Non-goals:

This does not mean we will make sidekiq-cluster available in Core distributions in this issue, only to developers
This issue does not cover changes in gitlab-compose-kit
There are some ergonomic issues with mapping queues to processes started by sidekiq-cluster; we won't solve these in this issue

Benefits

Better coverage of our multi-proc job setup during dev time and earlier detection of regressions
Less drift between production and development setups in terms of configuration
Easier switch to make sidekiq-cluster available in Core tier once we decide to do it

Competition

I haven't found a good alternative solution, please leave feedback for other approaches.

Edited Jan 03, 2020 by Matthias Käppler