Skip to content

Extract health consensus logic into a view and query for it directly in the primary elector

Sami Hiltunen requested to merge smh-healthy-storages-view into master

HealthManager currently contains the logic for determining which Gitaly nodes are considered healthy by the Praefect nodes and which Praefect nodes are part of the quorum. While in itself the logic works fine, the consensus is returned from the database and passed in-memory to the components that require the consensus, namely the primary elector. The primary elector then runs the elections in a separate database transaction. In practice, this works ok. In theory, it is possible that the Praefect nodes perform elections using an outdated view of healthy nodes, which could result in the primary node flickering unnecessarily.

This MR extracts the consensus logic into a view. Using the view, we can directly get the health consensus in the primary elector without first bringing it into the memory.

This view will also be needed when implementing the lazy failover logic in praefect dataloss and read-only repository metric. Currently the repository is considered to be read-only if the primary stored in the database is outdated. With lazy failovers, the recorded primary being outdated doesn't mean the repository is currently in read-only mode as the repository could failover immediately if there's a request to it and a viable primary exists. To support this use case without duplicating our query logic, we need to extract the concept of a valid primary into a view. healthy_storages view is going to be a part of that view.

The HealthManager has to now perform two queries on health checks. Combining the updates to querying the consensus is no longer feasible as the CTE modifications are not visible in the tables during the same query. To workaround that limitation, the health checks are first updated and then queried immediately after. This should work fine as the important thing is to notice changes in the healths of the Gitaly nodes and trigger the election run. This works fine even when updating the health checks and querying the consensus is done in different transactions. PerRepositoryElector being the only consumer of the health consensus at the moment, we can remove the second step completely once the lazy failovers are implemented.

Related to #3207 (closed)

Merge request reports

Loading