Improve logging for database load balancer host_offline events
What does this MR do and why?
The database load balancer logs a host_offline
event when a host is marked offline after a replica status check, but currently no information about which health check failed or what the failure parameter was is logged along with the event.
In multiple incidents, we have looked to these log events for information about the timeline and topology of database failures. Having this information would be helpful in order to understand what sequence of events led to degraded performance.
This MR adds lag_time
to indicate the replication_lag_below_threshold?
check failed and the number of seconds pg_last_xact_replay_timestamp
was in the past, and it also adds lag_size
to indicate the data_is_recent_enough?
check failed and the number of bytes the replica was lagging behind the primary.
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #491265 (closed)