gitlab-ctl pgb-notify issue(s) - first following a code path where it tries to unpause a not-paused database, secondly the rescue fails
Summary
Customer issue documented by two non public documents - an issue and a support ticket) contains the following output:
INFO -- : Running: gitlab-ctl pgb-notify --pg-database gitlabhq_production --newhost database7.example.com --user pgbouncer --hostuser gitlab-consul
ERROR -- : STDERR: Error running command: GitlabCtl::Errors::ExecutionError
ERROR -- : STDERR: ERROR: ERROR: database gitlabhq_production is not paused
This occurs when Patroni has failed over and consul attempts to reconfigure the PgBouncer database configuration and HUP
pgbouncer.
The automatic failover of PgBouncer isn't happening, customer's fixing it by restarting GitLab services on the PgBouncer nodes.
If my analysis below is roughly correct, then I think the write is working, but the next step resume_if_paused
is going wrong.
Steps to reproduce
One possibility is that this is broken on Omnibus 16.1.2 with PG 14.
Or, this is a more subtle issue, in which case we will need to gather more data if the problem is seen again.
What is the current bug behavior?
- I suspect that the code determines that
database_paused?
istrue
and so runspgbouncer_command("RESUME #{@database}")
- this seems the most likely reason for
ERROR: database gitlabhq_production is not paused
- this seems the most likely reason for
-
the rescue then also fails, generating
Error running command: GitlabCtl::Errors::ExecutionError
- This code within
notify
, I guess this is whatgitlab-ctl pgb-notify
maps to, viapgb.notify
- This code within
I guess that the rescue we see is actually the one guarding pgb
What is the expected correct behavior?
I think the database_paused?
code should have returned false, given the error that the database is not paused.
But then additionally, the inner rescue should work.
Relevant logs
Relevant logs
Details of package version
16.1.2, Postgresql 14.