Is Database Load Balancing DNS refresh resilient to crashing the DNS reload thread
Possible problem
Recently we've been testing some changes for CI decomposition that involved moving read-only queries between Patroni hosts and it always seemed like there was way more delay than we expect. The DNS reloading code is meant to refresh DNS every 1 minute with a small random delay but at times we've been seeing the host lists not being updated for around an hour. 1 recent example in
From gitlab-com/gl-infra/production#7167 (closed) shows that when we added all Patroni hosts to the list of ci-db-replica.service.consul
it only seemed like 2 of them were receiving traffic for the first ~40 minutes before the rest started to receive a little traffic and after about an hour it settled down and the traffic was evenly spread.
We also validated at around 00:20
that db-replica.service.consul
was resolving all of the Patroni Main hosts so it seems the problem was a delay on the Rails side to reload DNS or something else going on.
We also experienced a similar issue in gitlab-com/gl-infra/production#7121 (comment 960546811) where reads were going to old replicas for longer than they should have.
Hypothesis (needs investigation)
Based on my understanding the load balancing refresh happens in a thread periodically. But I'm not sure if there is something monitoring this thread so if the thread gets killed at any point then the whole Rails process may be stuck with old DNS values for replicas and they will never be refreshed until the pod is restarted
Possible solution
If the above is even valid (requires more investigation) then maybe we should update GitLab health checks to somehow validate that the DNS reload thread is healthy so that K8s can kill the pod if something goes wrong with that thread.