Skip to content

Resolve "Add a table to store batched jobs state changes"

Diogo Frazão requested to merge 346271-state-change-history-batchedjob-2 into master

What does this MR do and why?

We only have the last snapshot of the batched background migration job object. We are not able to see the event transition history. Not having this information recorded is terrible because sometimes we need to debug issues that happened in the past, and we don't have any records. Also, this data could be helpful for product managers. They will be able to extract information about what happened during the execution of a batched background migration.

Case 1 - job fails:

Current behavior:

gitlab/lib/gitlab/database/background_migration/batched_migration_wrapper.rb

def perform(batch_tracking_record)
  start_tracking_execution(batch_tracking_record)

  execute_batch(batch_tracking_record)

  batch_tracking_record.status = :succeeded
rescue Exception # rubocop:disable Lint/RescueException
  batch_tracking_record.status = :failed

  raise
ensure
  finish_tracking_execution(batch_tracking_record)
  track_prometheus_metrics(batch_tracking_record)
end

When an exception is raised, we move the state to failed, but we don't store the exception error/message. In this MR, I am creating a new table to store the transitions and possible errors.

Example:

Imagine that we have a batched job running, and for some reason, the job fails.

We will create a record with the following information:

  • previous_status: running
  • next_status: failed
  • exception: in this field, we can pass useful information like error_name and error_message.

Note:

A job can fail multiple times (we have a retry mechanism implemented). For each failure, we should store the error message. Different problems can happen.

Case 2 - other transitions!

To understand the whole picture, when we need to debug a batched background migration job, we need to store all of the transitions during the job runtime. Otherwise, we will only have access to the last state of the job. Example:

Job X history:

  1. previous_status: running, next_status: failed
  2. previous_status: running, next_status: pending

up migration:

== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: migrating
-- create_table(:batched_background_migration_job_transition_logs, {})
-- quote_column_name(:exception_class)
   -> 0.0000s
-- quote_column_name(:exception_message)
   -> 0.0000s
   -> 0.0063s
== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: migrated (0.0063s)

down migration:

== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: reverting
-- drop_table(:batched_background_migration_job_transition_logs, {})
   -> 0.0059s
== 20211123135255 CreateBatchedBackgroundMigrationJobTransitionLogs: reverted (0.0082s)

Screenshots or screen recordings

These are strongly recommended to assist reviewers and reduce the time to merge your change.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #346271 (closed)

Edited by Diogo Frazão

Merge request reports

Loading