Resolve "Add a table to store batched jobs state changes"
What does this MR do and why?
We only have the last snapshot of the batched background migration job object. We are not able to see the event transition history. Not having this information recorded is terrible because sometimes we need to debug issues that happened in the past, and we don't have any records. Also, this data could be helpful for product managers. They will be able to extract information about what happened during the execution of a batched background migration.
Case 1 - job fails:
Current behavior:
gitlab/lib/gitlab/database/background_migration/batched_migration_wrapper.rb
def perform(batch_tracking_record)
start_tracking_execution(batch_tracking_record)
execute_batch(batch_tracking_record)
batch_tracking_record.status = :succeeded
rescue Exception # rubocop:disable Lint/RescueException
batch_tracking_record.status = :failed
raise
ensure
finish_tracking_execution(batch_tracking_record)
track_prometheus_metrics(batch_tracking_record)
end
When an exception is raised, we move the state to failed, but we don't store the exception error/message. In this MR, I am creating a new table to store the transitions and possible errors.
Example:
Imagine that we have a batched
job running, and for some reason, the job fails.
We will create a record with the following information:
-
previous_status
: running -
next_status
: failed -
exception
: in this field, we can pass useful information like error_name and error_message.
Note:
A job can fail multiple times (we have a retry mechanism implemented). For each failure, we should store the error message. Different problems can happen.
Case 2 - other transitions!
To understand the whole picture, when we need to debug a batched background migration job, we need to store all of the transitions during the job runtime. Otherwise, we will only have access to the last state of the job. Example:
Job X history:
-
previous_status
: running,next_status
: failed -
previous_status
: running,next_status
: pending
up migration:
== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: migrating
-- create_table(:batched_background_migration_job_transition_logs, {})
-- quote_column_name(:exception_class)
-> 0.0000s
-- quote_column_name(:exception_message)
-> 0.0000s
-> 0.0063s
== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: migrated (0.0063s)
down migration:
== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: reverting
-- drop_table(:batched_background_migration_job_transition_logs, {})
-> 0.0059s
== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: reverted (0.0082s)
Screenshots or screen recordings
These are strongly recommended to assist reviewers and reduce the time to merge your change.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #346271 (closed)