Allow Sidekiq jobs to use readonly database replicas
Currently Sidekiq
always use primary
, but not always needs. This means that all of Sidekiq's database traffic will hit the primary, whereas only some web database traffic will hit the primary. From https://dashboards.gitlab.net/d/000000144/postgresql-overview, we can see that none of the replicas are used as much as the primary, but all of them are in the same ballpark.
Overall, our metrics suggest that we spend more database time in web transactions (green line) than Sidekiq jobs (orange line), but Sidekiq is still a significant percentage:
We currently have no way to distinguish whether the given worker
requires read-only or read-write access to data. It seems that
if we would start annotating workers, we could call for majority
of time Replicas instead, for operations that do not require read-write
and super up-to date data, like:
- all notifications
- all webhooks
- all ...
This would allow us to remove a number of SELECT
statements from master
.
groupscalability is spending a lot of effort of annotating workers, maybe following the same pattern we could do the same.
Proposed solution
- We should be able to define the
data consistency
requirement for a worker:
-
always
: the worker is required to use primary (a default) -
sticky
: worker would use replica as long as possible, but would switch to primary either on write or long replication lag: use on jobs that require to be executed as fast as possible -
delayed
: worker would switch to primary only on write, would use replica always if there's a long replication lag the job will be delayed, and only if the replica is not up to date on the next retry, it will switch to the primary. It should be used on jobs where we are fine to delay the execution of a given job, due to their importance:expire caches
, orexecute hooks
...
It is also possible to control data consistency
configuration with the feature flag for each worker:
data_consistency :delayed, feature_flag: load_balancing_for_build_hooks_worker
- In order to be safer, we should be able to control LoadBalancing for the Sidekiq by setting the ENV variable
ENABLE_LOAD_BALANCING_FOR_SIDEKIQ
to 'true'
Rollout plan:
Rollout plan:
- For GitLab.com, possibly Omnibus too: we should make sure that the pgbouncer nodes for the read-only replicas are configured with a sidekiq pool. At present, (iirc) only the primary has a Sidekiq pool (since the replica pool would be unused). - @jarv opened a Charts issue gitlab-org/charts/gitlab#2619 (closed) for this, once this is done we will need to add an option to allow Sidekiq to use the loadbalancing config.
- We will also need to configure the read-replica pgbouncer pools on the
patroni
nodes https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12871 - Enable load balancing for Sidekiq by setting ENV[
ENABLE_LOAD_BALANCING_FOR_SIDEKIQ
]='true'. This will enable load balancing, but we will still always useprimary
database for all workers since workers data_consistency will default to:always
- In #324232 (closed) we will configure BuildHooksWorker data_consistency to
:delayed
, controlled by the feature flag:load_balancing_for_build_hooks_worker
- Rollout of a feature flag:
load_balancing_for_build_hooks_worker
- If everything is fine, we will proceed with updating other workers listed in &5592 (closed)