DB Load balancing sticking optimistically removes all sticking when only 1 replica is caught up
Problem
As described in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1402938200 we have a problem with the all_caught_up?
logic:
def all_caught_up?(namespace, id)
location = last_write_location_for(namespace, id)
return true unless location
@load_balancer.select_up_to_date_host(location).tap do |found|
ActiveSupport::Notifications.instrument(
'caught_up_replica_pick.load_balancing',
{ result: found }
)
unstick(namespace, id) if found
end
end
This logic is used to for our primary "sticking". When a user performs a write request to the database we store (in Redis) the LSN (write identifier) of the primary at that write location. Then we stick all future read queries such that we only use the primary unless we have found a caught up replica that we could use. This logic, however, is flawed because we clear the sticking once we find a single caught up replica. Once cleared all future requests by the user will happily use any replica. This can result in stale reads (eg. 404s) when there is a single replica that is slightly further behind than the others.
How to reproduce
- Using https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing_with_service_discovery.md you can set up your GDK with multiple replicas
- Using https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md#simulating-replication-delay configure one of your replicas with a 1 minute delay
- Create an issue and check the network tab for 404s
Solution
There are a few proposals described at https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23578#note_1406604435 . The Pessimistic unsticking
option seems like the easiest to implement but we might need to figure out if there is safe way to roll this out without impacting primary traffic too much. It won't be practical to use feature flags for this so we might consider another approach like rolling this out at a very specific time with low traffic to confirm primary load isn't impacted much. Or maybe we'll be able to use some metrics to gain confidence that this shouldn't change the primary write workload too much.