Skip to content

Allow migrating scheduled and to-be-retried Sidekiq jobs

We now allow Sidekiq worker routing to be configured by administrators. For example, they can say 'all jobs go to the default queue', or 'project export and import workers share a queue'. Right now, the only really useful case is to re-route jobs to the default queue, but we will support other options in future.

Migrating this sounds simple: listen to the old and new queues, update the worker routing, wait for the old queue to be empty, and stop listening to the old queue. But there's a catch: Sidekiq maintains two sorted sets with jobs that are to be run in the future. There is the scheduled set (for jobs that use perform_in or perform_at or similar, where we choose to run a job in the future) and the retry set (after failing, a job will get retried with some back-off).

Both of those sets are 'global' - there isn't one for each possible destination queue. That means that the set entries themselves contain information about their destination queue. And in the migration case above, the destination queue might be the old queue and no longer listened to.

This adds two Rake tasks (one for the retry set and one for the scheduled set) to allow administrators to rewrite the job data in those sorted sets

It uses these Redis commands:

  1. ZSCAN to iterate over the sets. This is O(1) per call, and provides useful guarantees about iterating over a set that may be changing as it's operated on.
  2. ZREM to remove the old job hash. This is O(log(N)) per call, where N is the number of elements in the set.
  3. ZADD to add the new job hash with the new queue name. This is also O(log(N)) per call.

ZREM and ZADD will each be called once per item to be migrated, so there may be many invocations of these commands during this task's run.

Testing

To test this, a simple way involving no local config changes is to run this in a console:

10000.times { |i| AuthorizedProjectsWorker.perform_in(i.minutes, 0) }
10000.times { |i| PostReceive.perform_in(i.minutes, 0) }
Gitlab::SidekiqMigrateJobs.new('schedule').execute('PostReceive' => 'default')

(You don't need to be running Sidekiq.)

And then run the task, which will migrate PostReceive jobs back from the default queue to the post_receive queue:

$ bundle exec rake gitlab:sidekiq:migrate_jobs:schedule
I, [2021-05-10T19:25:41.330799 #64971]  INFO -- : Processing schedule set. Estimated size: 19998.
I, [2021-05-10T19:25:41.424507 #64971]  INFO -- : In progress. Scanned records: 1000. Migrated records: 485.
I, [2021-05-10T19:25:41.502785 #64971]  INFO -- : In progress. Scanned records: 2000. Migrated records: 977.
I, [2021-05-10T19:25:41.612464 #64971]  INFO -- : In progress. Scanned records: 3000. Migrated records: 1449.
I, [2021-05-10T19:25:41.694336 #64971]  INFO -- : In progress. Scanned records: 4000. Migrated records: 1888.
I, [2021-05-10T19:25:41.842944 #64971]  INFO -- : In progress. Scanned records: 5000. Migrated records: 2365.
I, [2021-05-10T19:25:42.017017 #64971]  INFO -- : In progress. Scanned records: 6000. Migrated records: 2792.
I, [2021-05-10T19:25:42.229430 #64971]  INFO -- : In progress. Scanned records: 7000. Migrated records: 3223.
I, [2021-05-10T19:25:42.352093 #64971]  INFO -- : In progress. Scanned records: 8000. Migrated records: 3667.
I, [2021-05-10T19:25:42.429180 #64971]  INFO -- : In progress. Scanned records: 9000. Migrated records: 4101.
I, [2021-05-10T19:25:42.505926 #64971]  INFO -- : In progress. Scanned records: 10000. Migrated records: 4503.
I, [2021-05-10T19:25:42.592300 #64971]  INFO -- : In progress. Scanned records: 11000. Migrated records: 4902.
I, [2021-05-10T19:25:42.662101 #64971]  INFO -- : In progress. Scanned records: 12000. Migrated records: 5299.
I, [2021-05-10T19:25:42.734463 #64971]  INFO -- : In progress. Scanned records: 13000. Migrated records: 5712.
I, [2021-05-10T19:25:42.822835 #64971]  INFO -- : In progress. Scanned records: 14000. Migrated records: 6130.
I, [2021-05-10T19:25:42.971456 #64971]  INFO -- : In progress. Scanned records: 15000. Migrated records: 6530.
I, [2021-05-10T19:25:43.034188 #64971]  INFO -- : In progress. Scanned records: 16000. Migrated records: 6911.
I, [2021-05-10T19:25:43.099864 #64971]  INFO -- : In progress. Scanned records: 17000. Migrated records: 7298.
I, [2021-05-10T19:25:43.177007 #64971]  INFO -- : In progress. Scanned records: 18000. Migrated records: 7666.
I, [2021-05-10T19:25:43.243486 #64971]  INFO -- : In progress. Scanned records: 19000. Migrated records: 8026.
I, [2021-05-10T19:25:43.305267 #64971]  INFO -- : In progress. Scanned records: 20000. Migrated records: 8353.
I, [2021-05-10T19:25:43.365154 #64971]  INFO -- : In progress. Scanned records: 21000. Migrated records: 8680.
I, [2021-05-10T19:25:43.432905 #64971]  INFO -- : In progress. Scanned records: 22000. Migrated records: 9028.
I, [2021-05-10T19:25:43.509522 #64971]  INFO -- : In progress. Scanned records: 23000. Migrated records: 9349.
I, [2021-05-10T19:25:43.614153 #64971]  INFO -- : In progress. Scanned records: 24000. Migrated records: 9685.
I, [2021-05-10T19:25:43.679748 #64971]  INFO -- : In progress. Scanned records: 25000. Migrated records: 9993.
I, [2021-05-10T19:25:43.681029 #64971]  INFO -- : Done. Scanned records: 25025. Migrated records: 9999.

Related issues

For gitlab-com/gl-infra/scalability#1015 (closed).

Edited by Sean McGivern

Merge request reports

Loading