Tell runner to backoff while executing database migrations
Executing migrations that acquire locks on the ci_builds
table is becoming increasingly challenging because the table is under continuous use from runners and Sidekiq. Migrations that execute well in the testing and staging environments fail unpredictably when they hit production. In the latest occurrence gitlab-com/gl-infra/production#8521 (closed) we were competing with the runners that were trying to upload artifacts, which caused deadlocks.
This is a problem for CI data partitioning because we can merge only one MR at a time and wait for it to be successfully executed in production.
Proposal
When a migration that needs an expensive lock on the ci_builds/metadata
tables is started, return a 429
response to all runner requests with Retry-After: 300
header.
We can add a database helper to annotate the migrations that need this functionality:
class MyMigration < Gitlab::Database::Migration[2.1]
enable_runner_backoff!
def up
execute("LOCK TABLE #{TARGET_TABLE_NAME}, #{SOURCE_TABLE_NAME} IN SHARE ROW EXCLUSIVE MODE")
...
end
end
And for the API:
module API
module Ci
class Runner < ::API::Base
before do
if locking_migrations_in_progress?
set_status_code_in_env(429)
error!({}, 429, { 'Retry-After' => 5.minutes })
end
end
...
I think we can use a Redis key to communicate between the migrations and API.
Unknowns
Do the runners respect 429
s? Or retry regardless of the Retry-After
value and give up after X attempts?