Gitaly Zero Downtime Deploys

How do we handle zero-downtime deploys in Network Gitaly?

When Gitaly runs locally, listening on a unix socket and accessing Git data via an NFS share, it's trivial to handle zero-downtime deploys:

The worker will shutdown the processes in a rolling deploy
The LB will distribute traffic to the other workers
Gitaly will shutdown, but since the other GitLab processes have terminated and there are no incoming requests, this is not an issue
The deploy will complete. The processes will be restarted and the new Gitaly process will start receiving requests

Unfortunately, when Gitaly is running across the network things become trickier.

The Gitaly process serves traffic for a single shard of the Git data
As currently designed, there is a single Gitaly process per shard
During deployment, this process needs to be replaced with a newer one
Since the Gitaly 'clients' (worker processes) will still be running during the deployment, something needs to continue to run if we are to have a zero-downtime deployment.

The bigger problem is that Gitaly, right now, is a single point of failure

There are many potential solutions to this problem. Here are some. I've categorised them into two groups:

Solves with Redundancy

Ultimately Gitaly will need to be redundant and not a single-point-of-failure (SPOF).

What I like about these solutions is they move towards that goal.

Option 1: Redundancy with Client Load Balancing and Fallback

Client knows of multiple network addresses for each git shard. If one address is not responding, the client switches over to the other.

Advantages:

In future, we can use this run Gitaly on multiple server processes, removing all SPOF for Gitaly
Handles other types of process failures (SEGFAULTs) etc too
Other reasons to restart a process such as a memory leak would be safe too

Disadvantages:

Additional complexity in the client, although there are several grpc-related projects looking to solve this problem

Option 2: Redundancy with Server Load Balancing and Fallback

Client knows a single network addresses for each git shard (same as now). If one address is not responding, the client switches over to the other.

During an upgrade, the Gitaly processeses are restarted in sequence, to ensure continued uptime.

Advantages:

Simple to understand and straightforward
No code changes on the client or server

Disadvantages:

(minor) Nginx becomes the SPOF.

Solves without Redundancy

Option 3: Always be `accept (3)`ing

Single Gitaly instance per NFS server. Gitaly performs a graceful restart, passing the open listening socket from the parent process to the client in a similar manner to how unicorn does this. Several Golang libraries already support this.

During the deploy, the Gitaly binary is replaced with the upgraded one and then a SIGHUP is sent to the currently running daemon. It will spawn the new version of Gitaly as a child process and pass the listening socket to the child, before gracefully shutting down once all outstanding requests are completed.

Advantages:

No code changes on the client
Very small, well understood change to the server (one extra library, one line of code change)
Probably the smoothest deployment process for users. Since the old process will only terminate once the outstanding requests have completed, long running git clone requests will not be affected or need to be reissued.

Disadvantages:

No process redundancy

Option 4: New Port for Every Deploy

I don't think this option is actually workable due to the way git repository shards are configured.

The idea is we alternate between different ports.

Disadvantages:

Probably not workable

What do you think @gl-gitaly @ernstvn @pcarranza @sytses

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Gitaly Zero Downtime Deploys

How do we handle zero-downtime deploys in Network Gitaly?

Solves with Redundancy

Option 1: Redundancy with Client Load Balancing and Fallback

Option 2: Redundancy with Server Load Balancing and Fallback

Solves without Redundancy

Option 3: Always be accept (3)ing

Option 4: New Port for Every Deploy

Option 3: Always be `accept (3)`ing