Batched background migration marked as finished, but there are failed jobs
This is extracted from #341663 (comment 691232113).
Customer reported that they have batched background migration marked as finished
:
gitlabrds=> SELECT * FROM batched_background_migrations WHERE table_name = 'ci_build_needs'\gx
-[ RECORD 1 ]-----+-----------------------------------------------
id | 7
created_at | 2021-08-17 17:05:43.264787+00
updated_at | 2021-08-18 03:00:05.094282+00
min_value | 1
max_value | 25201
batch_size | 20000
sub_batch_size | 1000
interval | 120
status | 3
job_class_name | CopyColumnUsingBackgroundMigrationJob
batch_class_name | PrimaryKeyBatchingStrategy
table_name | ci_build_needs
column_name | id
job_arguments | [["build_id"], ["build_id_convert_to_bigint"]]
total_tuple_count | 22649
pause_ms | 100
but some jobs are actually marked as failed
:
gitlabrds=> SELECT status, COUNT(*) FROM batched_background_migration_jobs WHERE batched_background_migration_id = 7 GROUP BY status;
status | count
--------+-------
2 | 1
3 | 1
(2 rows)
Indeed, there are number of rows that were not migrated:
gitlabrds=> CREATE INDEX CONCURRENTLY tmp_index_ci_build_needs_not_migrated ON ci_build_needs (build_id_convert_to_bigint) WHERE build_id_convert_to_bigint = 0;
CREATE INDEX
gitlabrds=> SELECT COUNT(*) FROM ci_build_needs WHERE build_id_convert_to_bigint = 0;
count
-------
3150
(1 row)
From a quick look at the related code, this should not be possible, but we may have some edge case, race condition - https://gitlab.com/gitlab-org/gitlab/-/blob/7202bb889de8525d0e395b0dd4eccc42425fca9b/lib/gitlab/database/background_migration/batched_migration_runner.rb#L118-122.