Bring `sidekiq-cluster` script to Core
Problem Statement
Currently, running sidekiq in clustered mode (i.e. spawning more than one worker process) is technically only available as part of GitLab EE distributions, and for self-managed environments only in the Starter+ tiers. Because of that, when booting sidekiq up in a development env with the GDK, the least common denominator is assumed, which is to run sidekiq in a single-process setup. That can be a problem, because it means there is a divergence between the environment developers work on, and what will actually run in production (i.e. gitlab.com and higher-tier self-managed envs). We have already seen problems in production that went unnoticed and are specific to a multi-process sidekiq setup, such as race conditions between worker processes to initialize prometheus, as well as initialization races leading to crashes.
We should consider:
- Making
sidekiq-cluster
available as part of Core and make it a 1-process "cluster" - In development, use
sidekiq-cluster
to run sidekiq locally (e.g. with 2 processes)
This would include changing the background_jobs script we use to boot up sidekiq locally to utilize sidekiq-cluster
.
That way we get better coverage of code that is actually used by the majority of GitLab deployments, particularly gitlab.com. I also see this being aligned with the recent change to focus on availability over velocity. An open question is how to restrict users who are not eligible to run more than 1 sidekiq process from actually doing so.
Reach
Personas:
- Sasha (engineer) - because they get more confidence in making changes to sidekiq and testing them in an environment closer to production
- Devon (devops) - because it simplifies how we operate sidekiq in any environment
Reach 10.0 = Impacts the vast majority (~80% or greater) of our users, prospects, or customers.
Impact
2.0 = High impact
Confidence
80% = Medium confidence
Effort
Medium to High. Below is a break down of what I think would need to happen, but we can roll this out in multiple stages.
Terms:
1P = single-process nP = clustered setup
GitLab (the app)
Roughly in this order (all of this needs to happen):
- Find a way to consolidate queue configuration between 1P and nP setups, since they use vastly different approaches currently
- Start using sidekiq-cluster via GDK. This should be a relatively straight-forward first step towards alignment and putting sidekiq-cluster on a hot-path for ongoing development.
- Revert changes in !11001 (merged) to move the script back under
bin/
- Rewrite
bin/background_jobs
wrapbin/sidekiq-cluster
- Update any remaining documentation if necessary
Omnibus
I think these steps can happen incrementally, and only the first one is necessary for an MVC (minimum viable change):
- Revert changes in omnibus-gitlab!3216 (merged)
- Use
sidekiq-cluster
by default (it's currently disabled by default). This means that we should probably also provide sensible defaults for queue grouping, which we don't currently. We ask the user to manually fill insidekiq_cluster['queue_groups']
instead. See also "consolidate queue config" above. - Deprecate or remove
sv-sidekiq-run
i.e. 1P setups. If you want a single process, just create a cluster from 1 queue group. With a better approach to configuring sidekiq-cluster, this should be simple to do.
"From Source" installations
We also allow users to install and run GitLab from source. This comes with a significant amount of configuration and setup overhead for users, but it is an option we provide. These users would be affected by this change, because they use the bin/background_jobs
script as part of an init.d
supervision script we provide, and which under the current proposal would now launch sidekiq-cluster
instead. This may or may not be a drop-in replacement, depending on how much the background_jobs
script will have to change and/or if additional variables need to be set e.g. through the environment.
GDK
If we keep bin/background_jobs
to simply point to sidekiq-cluster then I don't think any changes are required here. Otherwise we'd have to change the service run script to point directly to sidekiq-cluster.
Related Product issue: https://gitlab.com/gitlab-com/Product/issues/574
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.