Drop Sidekiq jobs based on feature flag
What does this MR do and why?
This adds a new feature to DeferJobs (now SkipJobs
) middleware to drop the job entirely
instead of deferring them. This is useful during an incident where we
know we don't want to resume the job execution, whereas deferring them
piles up the ScheduledSet and potentially leads to thunderring herd
problem once the jobs are released.
Drop Sidekiq jobs based on a new feature flag drop_sidekiq_jobs_{worker_name}
.
When a job is dropped:
- Logs with a job status
dropped
- Increments the
sidekiq_jobs_skipped_total
metric counter.
Other changes:
- The
DeferJobs
middleware name is changed toSkipJobs
due to the fact the middleware can either drop or defer the jobs. - Some refactoring in
SkipJobs
middleware - Change
sidekiq_jobs_deferred_total
counter tosidekiq_jobs_skipped_total
withaction
label of eitherdropped
ordeferred
Related runbooks MR update gitlab-com/runbooks!5950 (merged)
Resolves gitlab-com/gl-infra/scalability#2384 (closed)
How to set up and validate locally
- Check out this branch and restart Sidekiq
gdk restart rails-background-jobs
- In Rails console, enable the dropping jobs FF:
[2] pry(main)> Feature.enable(:"drop_sidekiq_jobs_Chaos::SleepWorker")
true
- Keep track of sidekiq logs
gdk tail rails-background-jobs
- Perform a job with
Chaos::SleepWorker.perform_async(1)
. - You should see the JSON log with
job_status: dropped
:
2023-06-20_10:12:03.19963 rails-background-jobs : {"severity":"INFO","time":"2023-06-20T10:12:03.199Z","retry":3,"queue":"default","backtrace":true,"version":0,"queue_namespace":"chaos","args":["5"],"class":"Chaos::SleepWorker","jid":"65a90eac9b58596fce382bf9","created_at":"2023-06-20T10:12:03.112Z","correlation_id":"0748eb5dc63aca9eea44dec8aa7c6956","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:default:f4c11a9e1a6f757ee1fa238573132dc4603850fb406ece5b923b61ba12942078","size_limiter":"validated","enqueued_at":"2023-06-20T10:12:03.131Z","job_size_bytes":3,"pid":42235,"message":"Chaos::SleepWorker JID-65a90eac9b58596fce382bf9: dropped: 0.066827 sec","job_status":"dropped","scheduling_latency_s":0.000739,"redis_calls":3,"redis_duration_s":0.000254,"redis_read_bytes":215,"redis_write_bytes":257,"redis_feature_flag_calls":1,"redis_feature_flag_duration_s":8.9e-05,"redis_feature_flag_read_bytes":213,"redis_feature_flag_write_bytes":71,"redis_queues_calls":2,"redis_queues_duration_s":0.000165,"redis_queues_read_bytes":2,"redis_queues_write_bytes":186,"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_ci_count":0,"db_main_replica_count":0,"db_ci_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_ci_cached_count":0,"db_main_replica_cached_count":0,"db_ci_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_ci_wal_count":0,"db_main_replica_wal_count":0,"db_ci_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_ci_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_ci_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_ci_duration_s":0.0,"db_main_replica_duration_s":0.0,"db_ci_replica_duration_s":0.0,"cpu_s":0.00167,"worker_id":"sidekiq_0","rate_limiting_gates":[],"duration_s":0.066827,"completed_at":"2023-06-20T10:12:03.199Z","load_balancing_strategy":"primary","db_duration_s":0.0}
- To check the metric, ensure GDK root's
gdk.yml
has sidekiq_exporter enabled and prometheus enabled, thengdk restart
prometheus:
enabled: true
gitlab:
rails_background_jobs:
sidekiq_exporter_enabled: true
- Check that counter is incremented:
$ curl gdk.test:3807/metrics | rg skipped
# HELP sidekiq_jobs_skipped_total Multiprocess metric
# TYPE sidekiq_jobs_skipped_total counter
sidekiq_jobs_skipped_total{action="dropped",worker="Chaos::SleepWorker"} 1
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Edited by Gregorius Marco