database major upgrade - make promote_database in pg-upgrade.rb idempotent
Summary
Observed on a customer's environment, and during verification of the fix for #7841 (closed)
For pg-upgrade
to work on a Geo secondary Rails database server, it promotes the database from being a replica to read/write
pg_ctl -D #{@db_worker.data_dir} promote"
This is a one-way trip. There's no pg_ctl
command to reverse this.
If a Geo environment is part way through a PostgreSQL upgrade, the primary site will already be upgraded. So, it's not possible to re-establish the back-level secondary as a replica of the primary, since different major releases of PostgreSQL cannot be replicas of each other.
So, the second time through the code fails and the upgrade is reverted. See more output.
pg_ctl: cannot promote server; server is not in standby mode
Workaround
-
Back up the Omnibus code:
cd /opt/gitlab/embedded/service/omnibus-ctl cp -a pg-upgrade.rb pg-upgrade.rb_backup
-
Edit
pg-upgrade.rb
and remark out these lines in the functionpromote_database
, located at line 435 in 15.11.12#@db_worker.run_pg_command( # "#{base_path}/embedded/bin/pg_ctl -D #{@db_worker.data_dir} promote" #)
-
Re-run
gitlab-ctl pg-upgrade
-
Roll back the code change
cd /opt/gitlab/embedded/service/omnibus-ctl mv pg-upgrade.rb_backup pg-upgrade.rb
Steps to reproduce
- Run
pg-upgrade
on a Geo secondary. - Have it fail at any point between promoting the database and actually upgrading.
- Try to repeat the upgrade.
What is the current bug behavior?
If pg-upgrade
fails in a Geo secondary, it can leave the system in a state that then cannot be upgraded, since the Rails database is promoted and this code isn't idempotent.
What is the expected correct behavior?
Have the promote_database code either only run if the database is a replica, or trap the error about the database already being promoted, and return success.
@db_worker.run_pg_command(
"#{base_path}/embedded/bin/pg_ctl -D #{@db_worker.data_dir} promote"
)
Caution: this isn't the only use case for this code, eg: gitlab#300761 (closed)
Relevant logs
Relevant logs
Checking if PostgreSQL bin files are symlinked to the expected location: OK Starting the database Waiting 30 seconds to ensure tasks complete before PostgreSQL upgrade. See https://docs.gitlab.com/omnibus/settings/database.html#upgrade-packaged-postgresql-server for details If you do not want to upgrade the PostgreSQL server at this time, enter Ctrl-C and see the documentation for detailsPlease hit Ctrl-C now if you want to cancel the operation. ..............................Detected a Geo secondary node Upgrading the postgresql database Promoting the database STDOUT: STDERR: pg_ctl: cannot promote server; server is not in standby mode == Fatal error == There was an error promoting the database from standby, please check the logs and output. == Reverting ==
Details of package version
Omnibus 15.11.12 / 15.11.13 are currently the main version affected as the upgrade has to be done before upgrading to 16.0. For earlier 15.x releases pg-upgrade
fails for another reason
Environment details
- Omnibus Geo