Infradev: Sidekiq shard/DB saturation Security::ScanResultPolicies::AddApproversToRulesWorker
Summary
During incident gitlab-com/gl-infra/production#17692 (closed), an ultimate group was shared with gitlab-org
, this triggered the creation of 4.2M Security::ScanResultPolicies::AddApproversToRulesWorker
jobs that saturated both the Sidekiq shard and database connections.
More details and logs in the incident issue.
Impact
Sidekiq catchall
shard saturation, increasing job queueing and execution latency.
Database connections were also saturated, further creating pressure on the shard/queue as jobs waited for DB connections.
Recommendation
Review worker batching strategy, from a quick glance, we are creating 1 job per project and then batch processing 100 users at a time.
Batch processing multiple projects per job could help reduce the number of jobs.
Overall as this kind of jobs are neither critical or urgent, we should think about throttling execution due to the potential risk of saturation when it comes to group wide policies.
Verification
Security::ScanResultPolicies::AddApproversToRulesWorker
have an upper limit on how many jobs are created/scheduled during a certain interval, either by batching projects (reducing number of jobs) and/or throttling job creation.