Introduce lock-free rescheduling for duplicate job
What does this MR do and why?
This MR removes the use of dedup locks and introduces a signalling key for until_executed
server middleware to determine whether to reschedule jobs. This is a lock-free way to fix the race condition in rescheduling jobs for until_executed, reschedule_once
workers.
Using a simple key instead of a msgpacked string, we can easily use multi
to transactionally check and delete the signalling key. Without needing the deduplicated
key in the msgpacked string, we can delete the cookie key before performing the transaction above. This way we guarantee that the job is scheduled at least once (either from server or client middleware) during a data race.
This approach only modifies the server middleware's rescheduling logic, leaving the client middleware's deduplication logic completely unchanged. This helps to prevent broken deduplication logic (lock contention under high loads) that caused gitlab-com/gl-infra/production#18264 (comment 1995096958).
Scalability considerations:
- No exclusive lease usage. This approach does not add locks.
- Redis commands introduced are
del
andsetnx
, both runs very fast on the Redis server. - No change to client-middleware deduplication behaviour.
Background of ideas: #471245 (comment 1996229546)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
Before | After |
---|---|
How to set up and validate locally
With feature flag enabled and cron disabled (gdk.test:3000/admin/sidekiq/cron
). To be honest here, I'm not entirely sure if this is a good way to validate the protection against a data race on the rescheduling logic.
- Apply the diff. The between dependent Redis commands help create more instances of data races while the cache counters help keep track. While repeating the test, I noticed that it the log does not pipe cleanly into the text file. I opted to use this approach instead where I incremented a counter in the cache database:
Click to expand
diff --git a/lib/gitlab/sidekiq_logging/deduplication_logger.rb b/lib/gitlab/sidekiq_logging/deduplication_logger.rb
index 85d7da048377..527afd27eef3 100644
--- a/lib/gitlab/sidekiq_logging/deduplication_logger.rb
+++ b/lib/gitlab/sidekiq_logging/deduplication_logger.rb
@@ -15,6 +15,7 @@ def deduplicated_log(job, deduplication_type, deduplication_options = {})
payload.merge!(
deduplication_options.compact.transform_keys { |k| "deduplication.options.#{k}" })
+ Gitlab::Redis::Cache.with { |c| c.incr("expt:deduped") } if job['rescheduled_once']
Sidekiq.logger.info payload
end
@@ -30,6 +31,8 @@ def rescheduled_log(job)
payload['job_status'] = 'rescheduled'
payload['message'] = "#{base_message(payload)}: rescheduled"
+ Gitlab::Redis::Cache.with { |c| c.incr("expt:rescheduled") }
+
Sidekiq.logger.info payload
end
end
diff --git a/lib/gitlab/sidekiq_logging/structured_logger.rb b/lib/gitlab/sidekiq_logging/structured_logger.rb
index 4754417639f1..cf1d31e6cf81 100644
--- a/lib/gitlab/sidekiq_logging/structured_logger.rb
+++ b/lib/gitlab/sidekiq_logging/structured_logger.rb
@@ -86,6 +86,8 @@ def log_job_done(job, started_time, payload, job_exception = nil)
'done'
end
+ Gitlab::Redis::Cache.with { |c| c.incr("expt:done") }
+
payload['message'] = "#{message}: #{job_status}: #{payload['duration_s']} sec"
payload['job_status'] = job_status
payload['job_deferred_by'] = job['deferred_by'] if job['deferred']
diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
index 0deb1686b9c9..1b36ce881cfd 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
@@ -35,6 +35,8 @@ def deduplicate?(job)
return false unless check! && duplicate_job.duplicate?
+ sleep 0.01
+
job['duplicate-of'] = duplicate_job.existing_jid
return false unless duplicate_job.idempotent? # only dedup idempotent jobs
diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
index fce677492565..3d0f9762827f 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
@@ -22,10 +22,12 @@ def perform(job)
should_reschedule = if Feature.enabled?(:use_sidekiq_dedup_signaling, type: :beta) # rubocop:disable Gitlab/FeatureFlagWithoutActor -- global flags
# use the deduplicating-signalling key
duplicate_job.delete!
+ sleep 0.01
duplicate_job.check_reschedule_and_delete
else
# use the deduplicated flag in cookie
res = duplicate_job.should_reschedule?
+ sleep 0.01
duplicate_job.delete!
res
end
- Run on rails console
[1] pry(main)> count = []
=> []
[2] pry(main)> threads = []
=> []
[3] pry(main)> 10.times {
[3] pry(main)* threads << Thread.new {100.times { count << PipelineProcessWorker.perform_async(123456) != nil } }
[3] pry(main)* }
=> 10
[4] pry(main)> threads.each(&:join)
=> [#<Thread:0x000000016ebfa828 (pry):4 dead>,
#<Thread:0x000000016ebfa670 (pry):4 dead>,
#<Thread:0x000000016ebfa530 (pry):4 dead>,
#<Thread:0x000000016ebfa418 (pry):4 dead>,
#<Thread:0x000000016ebfa2b0 (pry):4 dead>,
#<Thread:0x000000016ebfa1c0 (pry):4 dead>,
#<Thread:0x000000016ebfa0a8 (pry):4 dead>,
#<Thread:0x000000016ebf9f18 (pry):4 dead>,
#<Thread:0x000000016ebf9e28 (pry):4 dead>,
#<Thread:0x000000016ebf9cc0 (pry):4 dead>]
[5] pry(main)> count.size
=> 1000
[6] pry(main)> count.compact.count
=> 78
- Check the 3 counter keys. It should add up to the following formula (note that this differs by runs).
redis /Users/sylvesterchin/work/gitlab-development-kit/redis/redis.socket[2]> mget expt:deduped expt:rescheduled expt:done
1) "75"
2) "99"
3) "102"
-----
"rescheduled" - "deduplicated" + count = "done"
99-75+78 = 102
Note: Rescheduled jobs could get deduplicated too. This can be seen in the logs.
- Repeating this with feature flag disabled yields the same result. This shows that dedup logic works as intended. Note: i've ran this many times with varying amount and random sleeps injected but both approaches would add up correctly. The race condition can only be replicated with intentionally timed sleeps (see below).
Race condition validation
We can re-use !158329 (merged) to show that the existing solution will create a race condition while this approach does not.
Click to show diff
diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
index 0deb1686b9c9..aa7eb1276efe 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/deduplicates_when_scheduling.rb
@@ -39,6 +39,9 @@ def deduplicate?(job)
return false unless duplicate_job.idempotent? # only dedup idempotent jobs
+ puts "SLEEP BEFORE SETTING DEDUP"
+ sleep 1
+
duplicate_job.update_latest_wal_location!
duplicate_job.set_deduplicated_flag!
diff --git a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
index fce677492565..107d3bdaf3f5 100644
--- a/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
+++ b/lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executed.rb
@@ -19,17 +19,25 @@ def perform(job)
# early return since not reschedulable. ensure block to handle cleanup.
return unless duplicate_job.reschedulable?
+ puts "BEFORE CHECKING RESCHEDULE" # rubocop:disable Rails/Output -- debug
+ sleep 1
+
should_reschedule = if Feature.enabled?(:use_sidekiq_dedup_signaling, type: :beta) # rubocop:disable Gitlab/FeatureFlagWithoutActor -- global flags
# use the deduplicating-signalling key
duplicate_job.delete!
+ puts "AFTER DELETING"
+ sleep 1
duplicate_job.check_reschedule_and_delete
else
# use the deduplicated flag in cookie
res = duplicate_job.should_reschedule?
+ puts "BEFORE DELETING"
+ sleep 1
duplicate_job.delete!
res
end
+ puts "RESCHEDULE?: #{should_reschedule}"
job_deleted = true
duplicate_job.reschedule if should_reschedule
- Steps to reproduce the problem can be found in the past MR linked above
- To verify fix, enable the feature flag
-
Scenario 1: client does dedup check after server deletes cookie Trigger 2
PipelineProcessWorker.perform_async(123456)
jobs on the console. A cue would be to trigger the 2nd job afterBEFORE CHECKING RESCHEDULE
and beforeAFTER DELETING
(this shows up after the server acquires the excl lease). - The client will schedule the job since the server deletes the cookie key and no deduplication happens:
# 1st job
024-07-14_13:20:25.23358 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-14_13:20:26.23981 rails-background-jobs : AFTER DELETING
...
2024-07-14_13:20:26.70534 rails-background-jobs : BEFORE CHECKING RESCHEDULE # 2nd job. this is fine since the the first job has completed the worker logic
2024-07-14_13:20:27.24511 rails-background-jobs : RESCHEDULE?: false # 1st job
...
2024-07-14_13:20:27.71332 rails-background-jobs : AFTER DELETING # 2nd job
2024-07-14_13:20:28.71956 rails-background-jobs : RESCHEDULE?: false # 2nd job
-
Scenario 2: client does dedup check before server deletes cookie: Run
PipelineProcessWorker.perform_async(123456); PipelineProcessWorker.perform_async(123456)
to quickly trigger 2 jobs on the console. Observe that server reschedules the job.
# 1st job
2024-07-14_13:25:30.25530 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-14_13:25:31.26005 rails-background-jobs : AFTER DELETING
2024-07-14_13:25:32.26398 rails-background-jobs : RESCHEDULE?: true
...
# 2nd job
2024-07-14_13:25:32.33674 rails-background-jobs : BEFORE CHECKING RESCHEDULE
2024-07-14_13:25:33.34506 rails-background-jobs : AFTER DELETING
2024-07-14_13:25:34.34886 rails-background-jobs : RESCHEDULE?: false