Defer sidekiq jobs based on database health status indicators (!121277) · Merge requests · GitLab.org / GitLab

Prabakaran Murugesan requested to merge 404898-db-health-based-throttle-sidekiq-jobs into master May 19, 2023

What does this MR do and why?

This is part III of #404898 (closed). !121261 (merged) exposed variables needed to define the health check context, this MR performs the database health check and defers (re-queues) the sidekiq job if database signals to stop.

Approach:

We extend the existing Sidekiq DeferJobs middleware to perform a check on database health status as well.

Feature flag:

This feature is behind a FF (defer_sidekiq_workers_on_database_health_signal), rollout issue: #412990 (closed)

P.S: The initial effort was done in MR#119187 which has the entire changeset and became huge, so split it into smaller MRs to have more control.

How to set up and validate locally

Enable 'defer_sidekiq_workers_on_database_health_signal' FF from the rails console, Feature.enable(:defer_sidekiq_workers_on_database_health_signal).
Create a new worker or choose an existing worker, for this I am using 'Chaos::SleepWorker'.

Let's set 'database_health_check_attrs' for the worker, eg:

 # frozen_string_literal: true

 module Chaos
   class SleepWorker # rubocop:disable Scalability/IdempotentWorker
     ...
     ...
     defer_on_database_health_signal :gitlab_main, 1.minute, [:users]

     def perform(duration_s)
       Gitlab::Chaos.sleep(duration_s)
     end
   end
 end

Chaos::SleepWorker.defer_on_database_health_signal? should be returning positive now.
Make defer_job_by_database_health_signal? to return true locally (so that we don't have to prepare for actual db health status evaluation)
reload! the local rails console if you are using the same console.
tail jobs logs, gdk tail rails-background-jobs | grep Chaos::SleepWorker
- If needed please restart gdk restart rails-background-jobs locally

performing the job, should schedule it after a minute instead of executing immediately.

pry(main)> Chaos::SleepWorker.perform_async(1)
=> "9e402fc977d7bcf32597fe91"

pry(main)> queue = Sidekiq::ScheduledSet.new
pry(main)> queue.map { |job| job }
=> [#<Sidekiq::SortedEntry:0x00000001495b7788
 @args=nil,
 @item=
 {"retry"=>3,
   "queue"=>"default",
   "backtrace"=>true,
   "version"=>0,
   "queue_namespace"=>"chaos",
   "class"=>"Chaos::SleepWorker",
   "args"=>[1],
   "jid"=>"d51303502cb5f6849488961b",
   "created_at"=>1685034663.062152,
   "correlation_id"=>"b8b450db286639352dd5195d6c85ce13",
   "meta.caller_id"=>"Chaos::SleepWorker",
   "meta.feature_category"=>"not_owned",
   "meta.root_caller_id"=>"Chaos::SleepWorker",
   "worker_data_consistency"=>"always",
   "size_limiter"=>"validated",
   "scheduled_at"=>1685034723.062093},
 @parent=#<Sidekiq::ScheduledSet:0x000000014950ccc0 @_size=1, @name="schedule">,
 @queue="default",
 @score=1685034723.062093,
 @value=
 "{\"retry\":3,\"queue\":\"default\",\"backtrace\":true,\"version\":0,\"queue_namespace\":\"chaos\",\"class\":\"Chaos::SleepWorker\",\"args\":[1],\"jid\":\"d51303502cb5f6849488961b\",\"created_at\":1685034663.062152,\"correlation_id\":\"b8b450db286639352dd5195d6c85ce13\",\"meta.caller_id\":\"Chaos::SleepWorker\",\"meta.feature_category\":\"not_owned\",\"meta.root_caller_id\":\"Chaos::SleepWorker\",\"worker_data_consistency\":\"always\",\"size_limiter\":\"validated\",\"scheduled_at\":1685034723.062093}">]

Note the 'jid' for later, and 'scheduled_at' is after a minute from 'created_at'
After a minute, we should be able to see logs coming in (7) - from rails-background-jobs

On executing the job now, it will again re-queue (because of (5)) but with different 'jid' than the previous one.

pry(main)> queue = Sidekiq::ScheduledSet.new
pry(main)> queue.map { |job| job }
# We should be able to see another job scheduled after a minute of prev job execution, with new 'jid'

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #404898 (closed)

Edited May 25, 2023 by Prabakaran Murugesan

Defer sidekiq jobs based on database health status indicators

What does this MR do and why?

Approach:

Feature flag:

How to set up and validate locally

MR acceptance checklist

Merge request reports