Geo: GitLab's database load balancing doesn't work with standby clusters
GitLab ships with a Database Loadbalancing feature as per documentation in : https://docs.gitlab.com/ee/administration/database_load_balancing.html. While there is no obvious reason why it shouldn't work in general with a Geo secondary, it appears that we rely on some queries that can't be run on a Standby Cluster. So in order to verify whether a node is "healthy" on secondary node, it needs to compare the status against the Follower Leader. We use latency for that, but how we retrieve should be different when querying against a Standby Leader.
This command can replicate the current problem:
lb = Gitlab::Database::LoadBalancing::LoadBalancer.new(Gitlab::Database::LoadBalancing.hosts)
lb.primary_write_location
This will run the following :
ActiveRecord::Base.retrieve_connection.select_all("SELECT pg_current_wal_insert_lsn()::text AS location").first
When running this in a Standby Leader, we should get the following error:
ActiveRecord::StatementInvalid (PG::ObjectNotInPrerequisiteState: ERROR: recovery is in progress)
HINT: WAL control functions cannot be executed during recovery.
When running it in a Primary PostgreSQL node, we get something like this as response instead:
=> {"location"=>"0/800C4C28"}
For a Standby cluster we should probably rely on pg_last_wal_replay_lsn()
when querying the leader:
SELECT pg_last_wal_replay_lsn()::text AS location
Related Documentation:
- https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION
- https://www.postgresql.org/docs/12/warm-standby.html#CASCADING-REPLICATION
Testing and rollout
The change will be tested in Staging using an ENV based feature flag: USE_NEW_LOAD_BALANCER_QUERY
. This is planned to default to true as soon as it pass staging, and remove the feature flag at next release: #324657 (closed)