Improve logging for database load balancer host_offline events (!166542) · Merge requests · GitLab.org / GitLab

Matt Kasa requested to merge 491265-improve-logging-for-database-load-balancer-host_offline-events into master Sep 18, 2024

What does this MR do and why?

The database load balancer logs a host_offline event when a host is marked offline after a replica status check, but currently no information about which health check failed or what the failure parameter was is logged along with the event.

In multiple incidents, we have looked to these log events for information about the timeline and topology of database failures. Having this information would be helpful in order to understand what sequence of events led to degraded performance.

This MR adds lag_time to indicate the replication_lag_below_threshold? check failed and the number of seconds pg_last_xact_replay_timestamp was in the past, and it also adds lag_size to indicate the data_is_recent_enough? check failed and the number of bytes the replica was lagging behind the primary.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #491265 (closed)

Improve logging for database load balancer host_offline events

What does this MR do and why?

MR acceptance checklist

Merge request reports