Skip to content

Retry failovers until they succeed in PerRepositoryElector

Sami Hiltunen requested to merge smh-retry-election into master

Failovers with repository specific primaries check a larger number of rows than failovers with SQL elector. This is due to each repository having their own primary record as opposed to having a single primary record per virtual storage. With a low number of records, the SQL elector can check all the primary records periodically. This is more problematic if we have a larger number of primary records. To avoid unnecessarily checking many row, the PerRepositoryElector only checks the primaries when there has been a change in Gitaly node health status. If the performing failovers fails, we'd only try it again after there's a change in a Gitaly node's health status. This could leave some repositories without healthy primaries if we fail to perform the failovers when we receive the health change event.

This commit fixes the problem by reattempting the failovers periodically if there was a health change and Praefect failed to run failovers. This way we don't end up waiting until the next health change events to try again.

Ideal fix to this would be to remove the background loop and instead elect a primary if necessary when we need one. That involves more work though, so as a quicker measure we'll just retry performing the failovers. #3207 (closed) is tracking this.

We are getting ready to roll out repository specific primaries to production so this ties the loose end in a manner that works.

Edited by Sami Hiltunen

Merge request reports

Loading