Skip to content

Enqueuer Worker: update the lease key

David Fernandez requested to merge 361445-scope-lease-ley into master

🎛 Context

We're currently implementing a data migration on the Container Registry. This migration is going to be driven by the rails backend. For all the nitty-gritty details, see &7316 (comment 897867569).

At the core of the rails logic lies the Enqueuer worker. Its job is to find the next image repository and start the migration on the Container Registry side.

Because this background job can be enqueued by multiple sources (itself, cron schedule or when a migration hits a final state), we used deduplication but also an exclusive lease to make sure that two executions can't work on the same image repository.

The problem is the lease key used. It's a fixed one which means that all executions are going to use the same one = we enforce a serial execution even if we have multiple jobs at the same time.

On an other note, the lease key can be left behind during restarts or shutdowns. When that happens, the whole migration pauses for the lease duration which is 30 minutes.

Both aspects above are too restricting. What we need to guarantee here is simply that 2 jobs don't work on the same image repository. If we have 2 jobs and they work on image repository A and B, then we can let them run in parallel.

That's issue #361445 (closed).

The solution suggested by this MR is quite simple: use the image repository id in the lease key.

The expected effect of this MR is a higher throughput because we allow parallel executions (as long as they don't work on the same image repository). We will not go into details but we have a "capacity" in place so that the throughput has a limit no matter what happens. This was put in place to avoid flooding the Container Registry with requests for migrations.

🔬 What does this MR do and why?

  • Update the Enqueuer lease key to take the selected image repository id into account.
  • Took this opportunity to update the generated logs:
    • Log when an execution ends in a no op one because the lease is taken.
    • Log when no image repository is selected.
  • Updated the related specs.

The migration is currently gated behind several feature flag and it's happening only on gitlab.com. That's why no changelog was added to this MR.

📺 Screenshots or screen recordings

n / a

How to set up and validate locally

  1. Have a GDK ready with the registry setup.
  2. Create a dummy image repository in a rails console that will be picked up for migration:
    image = FactoryBot.create(:container_repository, project: Project.first, created_at: 3.years.ago)
  3. Run the background worker with the lease already taken:
    lease_key = "container_registry:migration:enqueuer_worker:for:#{image.id}"
    Gitlab::ExclusiveLease.new(lease_key, timeout: 30.minutes).try_obtain
    
    ContainerRegistry::Migration::EnqueuerWorker.new.perform_async
  4. You will see this in the Sidekiq logs (truncated here for readability):
    {"class":"ContainerRegistry::Migration::EnqueuerWorker","job_status":"done","extra.container_registry_migration_enqueuer_worker.lease_already_taken":true}
    • Notice the extra.container_registry_migration_enqueuer_worker.lease_already_taken key set to true.

Now, let's see if we can display the log about no image repositories:

  1. remove the image from being picked up
    image.update!(created_at: 5.seconds.ago)
  2. Re-execute the Enqueuer:
    ContainerRegistry::Migration::EnqueuerWorker.new.perform_async
  3. Check the logs:
    {"class":"ContainerRegistry::Migration::EnqueuerWorker","job_status":"done","extra.container_registry_migration_enqueuer_worker.no_container_repository_found":true}
    • Notice the extra.container_registry_migration_enqueuer_worker.no_container_repository_found key set to true.

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports

Loading