Skip to content

Configure retries for read-only Gitaly RPCs

Will Chandler (ex-GitLab) requested to merge wc/gitaly-client-retry into master

What does this MR do and why?

Currently our Ruby Gitaly gRPC client will automatically perform 'transparent' retries, where the request has reached gRPC's internal loadbalancer but not gone onto the wire. However, gRPC can retry requests in a larger number of scenarios when configured to do so.

Add a retryPolicy to Gitaly's service_config to allow retries for any read-only RPC that fails with an UNAVAILABLE status code. This status code used exclusively to indicate a connection/network failure. We allow up to two additional requests to be sent, at 250ms and 500ms intervals. The overall request deadline is still honored as well. This allows the client to handle momentary service interruptions without bubbling errors up to users.

Gitaly sets a MethodOption named op_type on RPCs to indicate which ones will modify the repository. This is accessible to Golang clients, but with Ruby we are unfortunately forced to manually list RPCs known to be read-only as the Ruby protobuf implementation does not support accessing MethodOptions. gRPC-Core issue # 1198 is open to track adding this feature.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

  1. Stop your GDK's Gitaly with gdk stop gitaly
  2. Load a rails console session and execute a FindCommit with Project.last.repository.commit('HEAD')
  3. Immediately start Gitaly with gdk start gitaly. If Gitaly starts fast enough, the request will succeed with output like => #<Commit id:603975295665c2601289682bd3eefe92da22f848 i-user-0-1696879720/lab-coat@603975295665c2601289682bd3eefe92da22f848>
    1. If you have trouble getting the timing right, increasing maxAttempts and maxBackoff may help.

Note that if you are using Praefect this adds additional delay. Stopping Praefect only will be easier.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Merge request reports

Loading