Skip to content

Drop Sidekiq jobs based on feature flag

Gregorius Marco requested to merge mg-blackhole-sidekiq-jobs into master

What does this MR do and why?

This adds a new feature to DeferJobs (now SkipJobs) middleware to drop the job entirely instead of deferring them. This is useful during an incident where we know we don't want to resume the job execution, whereas deferring them piles up the ScheduledSet and potentially leads to thunderring herd problem once the jobs are released. Drop Sidekiq jobs based on a new feature flag drop_sidekiq_jobs_{worker_name}.

When a job is dropped:

  • Logs with a job status dropped
  • Increments the sidekiq_jobs_skipped_total metric counter.

Other changes:

  • The DeferJobs middleware name is changed to SkipJobs due to the fact the middleware can either drop or defer the jobs.
  • Some refactoring in SkipJobs middleware
  • Change sidekiq_jobs_deferred_total counter to sidekiq_jobs_skipped_total with action label of either dropped or deferred

Related runbooks MR update gitlab-com/runbooks!5950 (merged)

Resolves gitlab-com/gl-infra/scalability#2384 (closed)

How to set up and validate locally

  1. Check out this branch and restart Sidekiq gdk restart rails-background-jobs
  2. In Rails console, enable the dropping jobs FF:
[2] pry(main)> Feature.enable(:"drop_sidekiq_jobs_Chaos::SleepWorker")
true
  1. Keep track of sidekiq logs gdk tail rails-background-jobs
  2. Perform a job with Chaos::SleepWorker.perform_async(1).
  3. You should see the JSON log with job_status: dropped:
2023-06-20_10:12:03.19963 rails-background-jobs : {"severity":"INFO","time":"2023-06-20T10:12:03.199Z","retry":3,"queue":"default","backtrace":true,"version":0,"queue_namespace":"chaos","args":["5"],"class":"Chaos::SleepWorker","jid":"65a90eac9b58596fce382bf9","created_at":"2023-06-20T10:12:03.112Z","correlation_id":"0748eb5dc63aca9eea44dec8aa7c6956","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:default:f4c11a9e1a6f757ee1fa238573132dc4603850fb406ece5b923b61ba12942078","size_limiter":"validated","enqueued_at":"2023-06-20T10:12:03.131Z","job_size_bytes":3,"pid":42235,"message":"Chaos::SleepWorker JID-65a90eac9b58596fce382bf9: dropped: 0.066827 sec","job_status":"dropped","scheduling_latency_s":0.000739,"redis_calls":3,"redis_duration_s":0.000254,"redis_read_bytes":215,"redis_write_bytes":257,"redis_feature_flag_calls":1,"redis_feature_flag_duration_s":8.9e-05,"redis_feature_flag_read_bytes":213,"redis_feature_flag_write_bytes":71,"redis_queues_calls":2,"redis_queues_duration_s":0.000165,"redis_queues_read_bytes":2,"redis_queues_write_bytes":186,"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_ci_count":0,"db_main_replica_count":0,"db_ci_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_ci_cached_count":0,"db_main_replica_cached_count":0,"db_ci_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_ci_wal_count":0,"db_main_replica_wal_count":0,"db_ci_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_ci_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_ci_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_ci_duration_s":0.0,"db_main_replica_duration_s":0.0,"db_ci_replica_duration_s":0.0,"cpu_s":0.00167,"worker_id":"sidekiq_0","rate_limiting_gates":[],"duration_s":0.066827,"completed_at":"2023-06-20T10:12:03.199Z","load_balancing_strategy":"primary","db_duration_s":0.0}
  1. To check the metric, ensure GDK root's gdk.yml has sidekiq_exporter enabled and prometheus enabled, then gdk restart
prometheus:
  enabled: true
gitlab:
  rails_background_jobs:
    sidekiq_exporter_enabled: true
  1. Check that counter is incremented:
$ curl gdk.test:3807/metrics | rg skipped
# HELP sidekiq_jobs_skipped_total Multiprocess metric
# TYPE sidekiq_jobs_skipped_total counter
sidekiq_jobs_skipped_total{action="dropped",worker="Chaos::SleepWorker"} 1

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Gregorius Marco

Merge request reports

Loading