Rollout repository specific primaries
Repository specific primaries stack is built with the assumption that we have database records of every repository stored on the virtual storage. This is needed by variable replication factor, making rollout of the repository specific primaries a pre-requisite to rolling out variable replication factor.
To summarize known open issues on the new request router and elector:
-
#3256 (closed)
- Handling of storage scoped mutators is not implemented. This affects
Namespace
service's operations which rename top-level directories in the storage. This is a potential blocker, but as far as I've understood, this should not be used anymore with legacy storage being dropped. Since GitLab.com has migrated to hashed storage already, we should be good. The current implementation simply returnsunimplemented
errors.
- Handling of storage scoped mutators is not implemented. This affects
-
#3259 (closed)
- Each Praefect uses the consensus of healthy nodes to do decide whether a node is healthy instead of the health of the local connection. This mostly means that a Praefect may not use a Gitaly in a transaction or replications if the majority of Praefect's think it is unhealthy. I don't think this should be a blocking problem, but likely something we want to address later so we don't unnecessarily leave out healthy nodes because the consensus is not healthy yet or vice versa.
-
#3207 (closed)
- Since each repository now has its own primary, there are more records to update on failovers. This is mostly a performance optimization. Given failovers should be fairly rare, I don't think this has to block the rollout. The current implementation performs failovers when a Gitaly node's health changes, as opposed to doing so at a specific interval as the SQL elector does. This avoids unnecessarily running the failover query which involves more records now.
The rollout would require a short downtime. We should update the election strategy in the configuration of each Praefect and bring them down. Once all of them are down, we can restart them. This is to avoid running with two electors concurrently, which could cause split brain issues due to using different primaries.
Right now we are working on two routing/election stacks side by side, one for virtual storage scoped primaries and one repository specific primaries. Ideally, we'd do this rollout as soon as possible and require repository specific primaries to be used in 14.0. This would allow us to start removing the implementation of the previous router and elector and go forward expecting we know of every repository in the cluster.
The repository import job must have been ran to ensure we have the repositories are recorded the database. This has completed on GitLab.com for staging and production, I'm wondering if we need to do further work to ensure this has been done in 14.0 or would communicating this in upgrade instructions be enough. The status can be seen either from a flag in the database or through a log statement which is printed on Prafect's start up.
/cc @zj-gitlab @mjwood