Enqueuer job: fix the re enqueue
💈 Context
We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend.
At the core of the rails part lies the Enqueuer worker. Its responsibility is: find the next eligible image repository to migrate and call the container registry to start/retry the migration.
To make things capacity
. Those are like slots that ongoing migrations can take. For example, let's say we have a capacity
of 10
. The Enqueuer has to start the migration on 10
image repositories.
(A) How we achieve that? Simply by re enqueuing a job at the end of the Enqueuer #perform
if the current load and the current capacity allows it. Using our example again, the Enqueuer will "chain" 9 executions after the first one.
(B) In !83091 (merged), we extended the deduplication with until_executed
. The reason was that we noticed on staging that we could have situations where multiple jobs could inter weave their executions (See #356130 (closed)) and that's not something we want.
Now combine (B) with (A) and what happens? Well, simple: the re_enqueue is executed but because deduplication until_executed
is in place = that re_enqueue is immediately rejected = the Enqueuer is not properly filling ongoing migrations until reaching capacity = it's like we have a forced capacity of 1
. This is issue #356433 (closed).
This MR tries to fix the situation with:
- Use
until_executing
for the deduplication so that the re_enqueue happening at the end of the#perform
is successful. - Use an exclusive lease so that we guarantee that two parallel jobs can't run together. One of them will simply end with a no op.
🔬 What does this MR do and why?
- Use
until_executing
deduplication - Use an exclusive lease in the
#perform
function - Update the related spec
🖼 Screenshots or screen recordings
n / a
📸 How to set up and validate locally
We can't really validate the until_executing
deduplication but we can check the exclusive lease usage.
- Update the
#perform
method to:def perform try_obtain_lease do sleep 60 * 5 end end
- Update the background logs script (
Procfile
in GDK) to haveSIDEKIQ_WORKERS=2
and make sure that when you start your background workers, you get:Starting cluster with 2 processes
- This is important as the first process will handle the first job and sleep for 5 minutes.
- Tail the background jobs logs:
$ gdk tail rails-background-jobs
- In a rails console, enqueue the first job
ContainerRegistry::Migration::EnqueuerWorker.perform_async => "ae41adf62044c8a8456db633"
- Wait for the
start
message:{"severity":"INFO","time":"2022-03-23T15:35:12.234Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"ae41adf62044c8a8456db633","job_status":"start"}
- Enqeue the second job:
ContainerRegistry::Migration::EnqueuerWorker.perform_async => "f2758b4400237162ab8bac75"
- The second job immediately ends because of the lease taken:
{"severity":"INFO","time":"2022-03-23T15:36:35.608Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"f2758b4400237162ab8bac75","job_status":"start",} {"severity":"INFO","time":"2022-03-23T15:36:36.177Z","class":"ContainerRegistry::Migration::EnqueuerWorker","jid":"f2758b4400237162ab8bac75","message":"ContainerRegistry::Migration::EnqueuerWorker JID-f2758b4400237162ab8bac75: done: 0.568867 sec","job_status":"done",}
🚥 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.