Override the Gitaly Cluster replication factor for a specific repository
Problem to solve
Currently the replication factor of a Gitaly Cluster is the number of nodes in the cluster. This makes it impossible to enable for clusters with very large storage requirements (e.g. 500 TB cluster with 50 nodes, requires 25 PB to be provisioned), like GitLab.com.
There needs to be a way to specify a replication factor less than than the total number of nodes in the cluster to make Gitaly Cluster work at this scale.
So that we can start iterating towards more flexible configurations, we need a very minimal starting point.
Proposal
Implement a praefect
command that can be used to override the replication factor for a repository path:
- Praefect will randomly select on which nodes to keep a copy
- Praefect will stop routing reads or writes to nodes that are not being used
- Praefect will bring asynchronously bring nodes back online if the replication factor is increased
This will impact any logic that presumes a copy of each repo exists on all healthy nodes, including:
- read distribution
- write transactions/voting
- data loss commands
Edited by James Ramsay (ex-GitLab)