Gitaly Zero Downtime Deploys
How do we handle zero-downtime deploys in Network Gitaly?
When Gitaly runs locally, listening on a unix socket and accessing Git data via an NFS share, it's trivial to handle zero-downtime deploys:
- The worker will shutdown the processes in a rolling deploy
- The LB will distribute traffic to the other workers
- Gitaly will shutdown, but since the other GitLab processes have terminated and there are no incoming requests, this is not an issue
- The deploy will complete. The processes will be restarted and the new Gitaly process will start receiving requests
Unfortunately, when Gitaly is running across the network things become trickier.
- The Gitaly process serves traffic for a single shard of the Git data
- As currently designed, there is a single Gitaly process per shard
- During deployment, this process needs to be replaced with a newer one
- Since the Gitaly 'clients' (worker processes) will still be running during the deployment, something needs to continue to run if we are to have a zero-downtime deployment.
The bigger problem is that Gitaly, right now, is a single point of failure
There are many potential solutions to this problem. Here are some. I've categorised them into two groups:
Solves with Redundancy
Ultimately Gitaly will need to be redundant and not a single-point-of-failure (SPOF).
What I like about these solutions is they move towards that goal.
Option 1: Redundancy with Client Load Balancing and Fallback
Client knows of multiple network addresses for each git shard. If one address is not responding, the client switches over to the other.
Advantages:
- In future, we can use this run Gitaly on multiple server processes, removing all SPOF for Gitaly
- Handles other types of process failures (SEGFAULTs) etc too
- Other reasons to restart a process such as a memory leak would be safe too
Disadvantages:
- Additional complexity in the client, although there are several grpc-related projects looking to solve this problem
Option 2: Redundancy with Server Load Balancing and Fallback
Client knows a single network addresses for each git shard (same as now). If one address is not responding, the client switches over to the other.
During an upgrade, the Gitaly processeses are restarted in sequence, to ensure continued uptime.
Advantages:
- Simple to understand and straightforward
- No code changes on the client or server
Disadvantages:
- (minor) Nginx becomes the SPOF.
Solves without Redundancy
accept (3)
ing
Option 3: Always be Single Gitaly instance per NFS server. Gitaly performs a graceful restart, passing the open listening socket from the parent process to the client in a similar manner to how unicorn does this. Several Golang libraries already support this.
During the deploy, the Gitaly binary is replaced with the upgraded one and then a SIGHUP is sent to the currently running daemon. It will spawn the new version of Gitaly as a child process and pass the listening socket to the child, before gracefully shutting down once all outstanding requests are completed.
Advantages:
- No code changes on the client
- Very small, well understood change to the server (one extra library, one line of code change)
- Probably the smoothest deployment process for users. Since the old process will only terminate once the outstanding requests have completed, long running
git clone
requests will not be affected or need to be reissued.
Disadvantages:
- No process redundancy
Option 4: New Port for Every Deploy
I don't think this option is actually workable due to the way git repository shards are configured.
The idea is we alternate between different ports.
Disadvantages:
- Probably not workable
What do you think @gl-gitaly @ernstvn @pcarranza @sytses