Adjust sub-batch size for failed Batched Background Migration Jobs
requested to merge 377308-adjust-the-sub_batch_size-in-background-migrations-if-we-get-a-query-timeout-exception into master
What does this MR do and why?
Overview
Reduces the sub_batch_size
from BatchedMigrationJob
when a timeout happens during sub batch processing.
It rescues the following exceptions:
ActiveRecord::StatementTimeout
ActiveRecord::ConnectionTimeoutError
ActiveRecord::AdapterTimeout
ActiveRecord::LockWaitTimeout
ActiveRecord::QueryCanceled
Solves #377308 (closed)
Feature Flag Issue: #393556 (closed)
Details
If a timeout happens while processing each_sub_batch, a Gitlab::Database::BackgroundMigration::SubBatchTimeoutError
error will be raised.
This error will be rescued by the migration wrapper and processed by BatchedJob#reduce_sub_batch_size!
, which will reduce the sub batch size in 25%:
-
BatchedJob#sub_batch_size
will never goes lower thanbatch_size
-
BatchedJob#sub_batch_size
will be reduced 2 times - or 44% - before the cycle being reset by BatchedJob#split_and_retry! -
- After
BatchedJob#attempts
being reset to 0, the cycle will start over again.
- After
- The cycle happens while changing the state of
BatchedMigrationJob
to:failed
How to set up and validate locally
- Create a new background migration:
rails g post_deployment_migration AdjustSubBatchSizeOnTimeout
Example
class AdjustSubBatchSizeOnTimeout < Gitlab::Database::Migration[2.1]
MIGRATION = 'AdjustSubBatchSizeOnTimeout'
TABLE_NAME = :issues
BATCH_COLUMN = :id
BATCH_SIZE = 500
SUB_BATCH_SIZE = 150
restrict_gitlab_migration gitlab_schema: :gitlab_main
def up
queue_batched_background_migration(
MIGRATION,
TABLE_NAME,
BATCH_COLUMN,
batch_size: BATCH_SIZE,
sub_batch_size: SUB_BATCH_SIZE,
job_interval: 2.minutes
)
end
def down
delete_batched_background_migration(MIGRATION, TABLE_NAME, BATCH_COLUMN, [])
end
end
- Create a new class to process the migration:
Example
module Gitlab
module BackgroundMigration
class AdjustSubBatchSizeOnTimeout < BatchedMigrationJob
operation_name :update_all
feature_category :database
def perform
each_sub_batch do |_|
Issue.transaction do
Issue.connection.execute 'SET statement_timeout = 10'
issue = Issue.lock.find(1)
Logger.new($stdout).info('Lock on Issue(1) for 10min.')
issue.connection.execute('SELECT * FROM pg_sleep(600);')
end
end
end
end
end
end
- Run
rails db:migrate
. On the output, check for:
Caused by:
PG::QueryCanceled: ERROR: canceling statement due to statement timeout
- Open the console and check for the first created retriable job and check it's sub_batch_size. Should be reduced by 25%:
base_model = Gitlab::Database.database_base_models[:main]
migration = Gitlab::Database::BackgroundMigration::BatchedMigration.active_migration(connection: base_model.connection)
retriable_job = migration.batched_jobs.retriable.first
retriable_job.status
=> 2 #failed
retriable_job.sub_batch_size
=> 112 # 150 - 25% = 112,5
- Re-try failed job
migration_wrapper = Gitlab::Database::BackgroundMigration::BatchedMigrationWrapper.new(connection: base_model.connection)
migration_wrapper.perform(retriable_job)
retriable_job.status
=> 2 #failed
retriable_job.sub_batch_size
=> 112 # 150 - 25% = 84
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #377308 (closed)
Edited by Leonardo da Rosa