Tell runner to backoff while executing database migrations

Executing migrations that acquire locks on the ci_builds table is becoming increasingly challenging because the table is under continuous use from runners and Sidekiq. Migrations that execute well in the testing and staging environments fail unpredictably when they hit production. In the latest occurrence gitlab-com/gl-infra/production#8521 (closed) we were competing with the runners that were trying to upload artifacts, which caused deadlocks.

This is a problem for CI data partitioning because we can merge only one MR at a time and wait for it to be successfully executed in production.

Proposal

When a migration that needs an expensive lock on the ci_builds/metadata tables is started, return a 429 response to all runner requests with Retry-After: 300 header.

We can add a database helper to annotate the migrations that need this functionality:

class MyMigration < Gitlab::Database::Migration[2.1]
  enable_runner_backoff!
   
  def up
    execute("LOCK TABLE #{TARGET_TABLE_NAME}, #{SOURCE_TABLE_NAME} IN SHARE ROW EXCLUSIVE MODE")
    ...
  end
end

And for the API:

module API
  module Ci
    class Runner < ::API::Base
 
      before do
        if locking_migrations_in_progress?
          set_status_code_in_env(429)
          error!({}, 429, { 'Retry-After' => 5.minutes })
        end
      end
...

I think we can use a Redis key to communicate between the migrations and API.

Unknowns

Do the runners respect 429s? Or retry regardless of the Retry-After value and give up after X attempts?

Edited Mar 23, 2023 by Marius Bobin