Add ops feature flags to control load balancer replication_lag_time
What does this MR do and why?
This MR adds two ops
feature flags to influence the application database load balancer to use replicas that would normally not be used due to their replication lag time exceeding max_replication_lag_time
. One doubles max_replication_lag_time
, and the other ignores it completely.
The intent is to make these available to be used to prevent an outage in the event the replicas cannot keep up with the WAL rate and the primary becomes saturated without available replicas.
-
load_balancer_double_replication_lag_time
should be tried first. -
load_balancer_ignore_replication_lag_time
should be a last resort.
Relates to: https://gitlab.com/gitlab-org/gitlab/-/issues/429935
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
How to set up and validate locally
- Configure GDK with database load balancing and at least one replica (see https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md).
- Using the rails console, perform a read query, verify it used the replica.
- Simulate 90 seconds of lag (set
recovery_min_apply_delay = '90s'
in the replicapostgres.conf
). - Using the rails console, perform a read query, verify it used the primary.
- Enable the
load_balancer_double_replication_lag_time
feature flag. - Using the rails console, perform a read query, verify it used the replica even though the replica is lagged.
The same test can be performed with more than 120s
of lag using the load_balancer_ignore_replication_lag_time
feature flag.