RCA: Stale database schema problem caused by `db/post_migrate/20230711093010_drop_default_partition_id_value_for_ci_tables.rb`
Problem
We had customer escalation that resulted in inability to create downstream pipelines after performing zero-downtime upgrade from 16.1 to 16.2.
The error logged by execution of Ci::CreateDownstreamPipelineWorker
was:
PG::NotNullViolation: ERROR: null value in column "source_partition_id" of relation "ci_sources_pipelines" violates not-null constraint
DETAIL: Failing row contains (2328526, 3928, 3912358, 3928, 3912288, 44676629, 100, null).
What happened?
- The application was running on 16.1.
- The database migrations for 16.2 were run.
- The application Puma/Sidekiq nodes were restarted.
- The database post migrations were run.
- The
Ci::CreateDownstreamPipelineWorker
started to fail withdata integrity error
. - The Puma/Sidekiq restart fixed issue.
Why it failed?
- The application once was loaded at step 3. read the DB structure to be
ci_sources_pipelines.source_partiotion_id default 100
. - The database post migration
20230711093010_drop_default_partition_id_value_for_ci_tables.rb
did change the default to:ci_sources_pipelines.source_partiotion_id default null
. - The https://gitlab.com/gitlab-org/gitlab/-/blob/v16.2.8-ee/app/models/ci/sources/pipeline.rb#L44 since it had in cache
default 100
, it was not setting the value. Since this value was default it was not send withINSERT INTO ci_sources_pipelines (source_partition_id)
as the application expected this to be set by the database viadefault
. - Once we restarted the application, the application read the database default to be
nil
. Making theCi::Sources::Pipelines#set_source_partition_id
to copysource_job.partition_id
value.
Why the existing mitigation failed?
- We had this issue that was caused by stale database schema cache recently: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/38.
- We identified this as an solution to the root cause: #412980 (closed).
- We implemented as a mitigation the: !121957 (merged).
- This mitigation did not help in this case since the queries were not failing with
ActiveRecord::StatementInvalid
, but ratherPG::NotNullViolation
. - To handle this case we would have to proper "forced" schema reload as proposed by #412980 (comment 1404785697) or #412980 (comment 1408831804).
- We forgot to add
source_partition_id
tocolumns_changing_defaults
: #427489 (comment 1592168571)
Possible solutions
- Implement pro-active schema reload across the cluster: #412980 (comment 1404785697) or #412980 (comment 1408831804).
- Forbid changing DDL (adding, changing, or removing columns) in post migrations.
- (New) Run CI tests with application having old DB structure, and be updated mid-way to new DB structure.
Edited by Kamil Trzciński (Back 2025-01-01)