Adds How-to section to BBM docs
What does this MR do and why?
It adds a how-to
section to BBM docs.
- Re-organize some topics under this new section:
-
- Generate a batched background migration
-
- Enqueue a batched background migration
-
- Use job arguments
-
- Use filters
-
- Access data for multiple databases
-
- Re-queue batched background migrations
-
- Batch over non-distinct columns
-
-
🆕 Calculate overall time estimation of a batched background migration
-
-
- Cleaning up a batched background migration
- Adds a new section called
Calculate overall time estimation of a batched background migration
Changes
How to
Generate a batched background migration
The custom generator batched_background_migration
scaffolds necessary files and
accepts table_name
, column_name
, and feature_category
as arguments. Usage:
bundle exec rails g batched_background_migration my_batched_migration --table_name=<table-name> --column_name=<column-name> --feature_category=<feature-category>
This command creates the following files:
db/post_migrate/20230214231008_queue_my_batched_migration.rb
spec/migrations/20230214231008_queue_my_batched_migration_spec.rb
lib/gitlab/background_migration/my_batched_migration.rb
spec/lib/gitlab/background_migration/my_batched_migration_spec.rb
Enqueue a batched background migration
Queueing a batched background migration should be done in a post-deployment
migration. Use this queue_batched_background_migration
example, queueing the
migration to be executed in batches. Replace the class name and arguments with the values
from your migration:
queue_batched_background_migration(
JOB_CLASS_NAME,
TABLE_NAME,
JOB_ARGUMENTS,
JOB_INTERVAL
)
NOTE:
This helper raises an error if the number of provided job arguments does not match
the number of job arguments defined in JOB_CLASS_NAME
.
Make sure the newly-created data is either migrated, or saved in both the old and new version upon creation. Removals in turn can be handled by defining foreign keys with cascading deletes.
Use job arguments
BatchedMigrationJob
provides the job_arguments
helper method for job classes to define the job arguments they need.
Batched migrations scheduled with queue_batched_background_migration
must use the helper to define the job arguments:
queue_batched_background_migration(
'CopyColumnUsingBackgroundMigrationJob',
TABLE_NAME,
'name', 'name_convert_to_text',
job_interval: DELAY_INTERVAL
)
NOTE:
If the number of defined job arguments does not match the number of job arguments provided when
scheduling the migration, queue_batched_background_migration
raises an error.
In this example, copy_from
returns name
, and copy_to
returns name_convert_to_text
:
class CopyColumnUsingBackgroundMigrationJob < BatchedMigrationJob
job_arguments :copy_from, :copy_to
operation_name :update_all
def perform
from_column = connection.quote_column_name(copy_from)
to_column = connection.quote_column_name(copy_to)
assignment_clause = "#{to_column} = #{from_column}"
each_sub_batch do |relation|
relation.update_all(assignment_clause)
end
end
end
Use filters
By default, when creating background jobs to perform the migration, batched background migrations
iterate over the full specified table. This iteration is done using the
PrimaryKeyBatchingStrategy
. If the table has 1000 records
and the batch size is 100, the work is batched into 10 jobs. For illustrative purposes,
EachBatch
is used like this:
# PrimaryKeyBatchingStrategy
Namespace.each_batch(of: 100) do |relation|
relation.where(type: nil).update_all(type: 'User') # this happens in each background job
end
In some cases, only a subset of records must be examined. If only 10% of the 1000 records need examination, apply a filter to the initial relation when the jobs are created:
Namespace.where(type: nil).each_batch(of: 100) do |relation|
relation.update_all(type: 'User')
end
In the first example, we don't know how many records will be updated in each batch. In the second (filtered) example, we know exactly 100 will be updated with each batch.
BatchedMigrationJob
provides a scope_to
helper method to apply additional filters and achieve this:
-
Create a new migration job class that inherits from
BatchedMigrationJob
and defines the additional filter:class BackfillNamespaceType < BatchedMigrationJob scope_to ->(relation) { relation.where(type: nil) } operation_name :update_all feature_category :source_code_management def perform each_sub_batch do |sub_batch| sub_batch.update_all(type: 'User') end end end
NOTE: For EE migrations that define
scope_to
, ensure the module extendsActiveSupport::Concern
. Otherwise, records are processed without taking the scope into consideration. -
In the post-deployment migration, enqueue the batched background migration:
class BackfillNamespaceType < Gitlab::Database::Migration[2.1] MIGRATION = 'BackfillNamespaceType' DELAY_INTERVAL = 2.minutes restrict_gitlab_migration gitlab_schema: :gitlab_main def up queue_batched_background_migration( MIGRATION, :namespaces, :id, job_interval: DELAY_INTERVAL ) end def down delete_batched_background_migration(MIGRATION, :namespaces, :id, []) end end
NOTE:
When applying additional filters, it is important to ensure they are properly covered by an index to optimize EachBatch
performance.
In the example above we need an index on (type, id)
to support the filters. See the EachBatch
documentation for more information.
Access data for multiple databases
Background Migration contrary to regular migrations does have access to multiple databases
and can be used to efficiently access and update data across them. To properly indicate
a database to be used it is desired to create ActiveRecord model inline the migration code.
Such model should use a correct ApplicationRecord
depending on which database the table is located. As such usage of ActiveRecord::Base
is disallowed as it does not describe a explicitly database to be used to access given table.
# good
class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
class Project < ::ApplicationRecord
self.table_name = 'projects'
end
class Build < ::Ci::ApplicationRecord
self.table_name = 'ci_builds'
end
end
# bad
class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
class Project < ActiveRecord::Base
self.table_name = 'projects'
end
class Build < ActiveRecord::Base
self.table_name = 'ci_builds'
end
end
Similarly the usage of ActiveRecord::Base.connection
is disallowed and needs to be
replaced preferably with the usage of model connection.
# good
Project.connection.execute("SELECT * FROM projects")
# acceptable
ApplicationRecord.connection.execute("SELECT * FROM projects")
# bad
ActiveRecord::Base.connection.execute("SELECT * FROM projects")
Re-queue batched background migrations
If one of the batched background migrations contains a bug that is fixed in a patch release, you must requeue the batched background migration so the migration repeats on systems that already performed the initial migration.
When you requeue the batched background migration, turn the original
queuing into a no-op by clearing up the #up
and #down
methods of the
migration performing the requeuing. Otherwise, the batched background migration is
queued multiple times on systems that are upgrading multiple patch releases at
once.
When you start the second post-deployment migration, delete the previously batched migration with the provided code:
delete_batched_background_migration(MIGRATION_NAME, TABLE_NAME, COLUMN, JOB_ARGUMENTS)
Batch over non-distinct columns
The default batching strategy provides an efficient way to iterate over primary key columns. However, if you need to iterate over columns where values are not unique, you must use a different batching strategy.
The LooseIndexScanBatchingStrategy
batching strategy uses a special version of EachBatch
to provide efficient and stable iteration over the distinct column values.
This example shows a batched background migration where the issues.project_id
column is used as
the batching column.
Database post-migration:
class ProjectsWithIssuesMigration < Gitlab::Database::Migration[2.1]
MIGRATION = 'BatchProjectsWithIssues'
INTERVAL = 2.minutes
BATCH_SIZE = 5000
SUB_BATCH_SIZE = 500
restrict_gitlab_migration gitlab_schema: :gitlab_main
disable_ddl_transaction!
def up
queue_batched_background_migration(
MIGRATION,
:issues,
:project_id,
job_interval: INTERVAL,
batch_size: BATCH_SIZE,
batch_class_name: 'LooseIndexScanBatchingStrategy', # Override the default batching strategy
sub_batch_size: SUB_BATCH_SIZE
)
end
def down
delete_batched_background_migration(MIGRATION, :issues, :project_id, [])
end
end
Implementing the background migration class:
module Gitlab
module BackgroundMigration
class BatchProjectsWithIssues < Gitlab::BackgroundMigration::BatchedMigrationJob
include Gitlab::Database::DynamicModelHelpers
operation_name :backfill_issues
def perform
distinct_each_batch do |batch|
project_ids = batch.pluck(batch_column)
# do something with the distinct project_ids
end
end
end
end
end
NOTE:
Additional filters defined with scope_to
are ignored by LooseIndexScanBatchingStrategy
and distinct_each_batch
.
Calculate overall time estimation of a batched background migration
It's possible to estimate how long a BBM will take to complete. GitLab already provides an estimation through the db:gitlabcom-database-testing
pipeline.
This estimation is built based on sampling production data in a test environment and represents the max time that the migration could take and, not necessarily,
the actual time that the migration will take. In certain scenarios, estimations provided by the db:gitlabcom-database-testing
pipeline may not be enough to
calculate all the singularities around the records being migrated, making further calculations necessary. As it made necessary, the formula
interval * number of records / max batch size
can be used to determine an approximate estimation of how long the migration will take.
Where interval
and max batch size
refer to options defined for the job, and the total tuple count
is the number of records to be migrated.
Cleaning up a batched background migration
NOTE: Cleaning up any remaining background migrations must be done in either a major or minor release. You must not do this in a patch release.
Because background migrations can take a long time, you can't immediately clean things up after queueing them. For example, you can't drop a column used in the migration process, as jobs would fail. You must add a separate post-deployment migration in a future release that finishes any remaining jobs before cleaning things up. (For example, removing a column.)
To migrate the data from column foo
(containing a big JSON blob) to column bar
(containing a string), you would:
- Release A:
- Create a migration class that performs the migration for a row with a given ID.
- Update new rows using one of these techniques:
- Create a new trigger for simple copy operations that don't need application logic.
- Handle this operation in the model/service as the records are created or updated.
- Create a new custom background job that updates the records.
- Queue the batched background migration for all existing rows in a post-deployment migration.
- Release B:
- Add a post-deployment migration that checks if the batched background migration is completed.
- Deploy code so that the application starts using the new column and stops to update new records.
- Remove the old column.
Bump to the import/export version may be required, if importing a project from a prior version of GitLab requires the data to be in the new format.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #388789 (closed)