Implement a custom DNS resolver for Gitaly
For #4529 (closed). For more information, please read this comment and following thread
gRPC supports a built-in DNS resolver. This resolver works quite well in most scenarios. It has some drawbacks:
- After the DNS is resolved for the first time, the resolver does not refresh the list of addresses until the client connection triggers the resolver actively. Client connection does so when it detects some of its subchannels are unavailable permanently. It means as soon as the client connection is stable, the client is not aware of new hosts added to the cluster via DNS service discovery. This behavior leads to unexpected stickiness and workload skew, especially after a failover.
- The support for SRV record is in a weird state. This type of record is only supported when grpclb load balancing strategy is enabled. This strategy is deprecated, unfortunately. Its behavior is also not as we expected. In short-term, we would like to use round-robin strategy. In longer term, we may have a custom strategy for Raft-based cluster. Thus, SRV service discovery is crucial in the future.
- The resolver detects service config via TXT record if any. While this option is convenient for a generic grpc setting, it does not make sense for Gitaly. So, we should get rid of it.
This commit implements a custom DNS resolver. This resolver has somemajor features:
- Resolve DNS service discovery via A records
- Periodically refresh the DNS (5 minutes by default)
- Update DNS state only if it detects real changes
- Support logging.
Service discovery via SRV records is not supported in this version to keep the backward compatibility with Ruby clients.
gRPC depends on the target's scheme to determine which resolver to use. Built-in DNS Resolver registers itself with "dns" scheme. We should use a different scheme for this resolver. However, Ruby, and other cares-based clients, don't support custom resolver. At GitLab, the gRPC target configuration is shared between components. To ensure the compatibility between clients, this resolver intentionally replaces the built-in resolver under the same "dns" scheme.
In theory, I can stub the whole DNS lookup operation. However, I really don't want to stub too much. To test the real DNS behavior, I bring up a real DNS server with this package. It serves a DNS server via UDP. The answers returned from this server is controlled by the test.
Architecture
flowchart TD
Target["dns://8.8.8.8:53/gitaly.consul.internal"]--Pick by dns scheme\nOr grpc.WithResolvers--> dnsresolver.Builder
dnsresolver.Builder--> dnsresolver.Resolver
subgraph ClientConn
dnsresolver.Resolver -.Refresh.-> dnsresolver.Resolver
dnsresolver.Resolver -- Update state --> LoadBalancer
LoadBalancer --> SubChannel1
LoadBalancer --> SubChannel2
LoadBalancer --> SubChannel3
SubChannel1 -. Report .-> LoadBalancer
SubChannel2 -. Report .-> LoadBalancer
SubChannel3 -. Report .-> LoadBalancer
end
subgraph Gitaly
Gitaly1
Gitaly2
Gitaly3
end
SubChannel1 -- TCP --> Gitaly1
SubChannel2 -- TCP --> Gitaly2
SubChannel3 -- TCP --> Gitaly3
dnsresolver.Resolver --> net.Resolver
net.Resolver -.If specify authority.-> Authority[Authority Nameserver\n8.8.8.8:53]
net.Resolver -..-> Authority2[OS's configured nameserver]
net.Resolver -..-> /etc/resolv.conf
Note: While the above figure is specific for grpc-go, grpc-core follows a very similar flow.
In general, when a client performs grpc.Dial
, the target URL must be resolved by a resolver. gRPC supports many built-in resolvers, including DNS resolver. It also provides a powerful framework to build a custom resolver. From the problem stated in the above section, I decided to build one. A resolver includes two main parts: Builder and Resolver.
Builder creates a resolver object. A builder handles a particular scheme. At module loading time, the builder must register itself with a global resolver registry. Users can also use grpc.WithResolvers
to now modify the global registry. When the client connection resolves the target, it depends on its scheme to pick the correct builder. It uses the builder object to creates a Resolver object. Every client connection maintains one resolver object.
Resolver is to resolve the target URL on the behalf of client connection. The result is passed to its client connection via UpdateState
API. In this implementation, the DNS resolver starts a Goroutine to watch for the state of the target URL periodically. The client connection can also trigger an early resolution if it detects a connectivity change, connection interruption, for example. Underlying, the Resolver delegates actual name resolution to std net.Resolver
. Depending on the runtime platform, std resolver does plenty of things. Eventually, it needs to reach a DNS nameserver via UDP. The DNS nameserver is likely to be configured by the runtime OS. Clients can specify the nameserver address in the target URL (8.8.8.8:53
for example).