Skip to content

Log aggregated primary changes in PerRepositoryElector

Sami Hiltunen requested to merge smh-log-failovers-in-per-repo into master

PerRepositoryElector doesn't currently log any primary changes, which makes it less observable compared to the other electors. This is due to the number of primary records increasing massively compared to the other electors due to having one primary for each repository. This makes it no longer feasible to log all changes individually as the logs would grow with the number of repositories on cluster.

This commit improves the situation by logging aggregated demotion and promotion counts for each storage. This allows for an overview of how many repositories a given storage lost its primary status due to a demotion and how many repositories a given storage became the primary for.

The aggregation has the downside of not having the exact information of which repositories' primaries were demoted and which storages got promoted. Ideally we'd log the individual demotions and promotions. In the future, we could do this with repository specific primaries as well once we switch to a lazy election approach from the table wide failover logic. Lazy elections would allow us to perform failovers only for repositories which need a functioning primary right now, namely when the repository is receiving a write. That would reduce the number of failovers to only the repositories which are being written to during the primary's outage, which would keep the logs more manageable again.

As an intermediary solution, this should suffice to give some observability into the failovers.

Rollout #3492 (closed)

Edited by Sami Hiltunen

Merge request reports

Loading