Service Discovery sometimes fails inside of Kubernetes

High level statement of the Problem

Currently the Gitlab rails application stack used by our web, api, and sidekiq components depends on consul to discover a list of postgres database secondaries, which our code will connect to with read only queries, to reduce database load from the primary. This querying to consul is done periodically (roughly every 60-70 seconds)

In our current virtual machine infrastructure, we run a consul agent on each node, and set our application to query consul via it's DNS interface, address localhost, TCP port 8600 (which is the port consul is listening on for TCP dns requests). This has worked without issue.

When migrating some of our components to run on Kubernetes (GKE), we have discovered our application throwing a few different transient errors when talking to consul. This causes our application to fall back to send all queries to the database primary, until it queries consul again, which succeeds and thus the list of secondaries is returned.

While the rate of errors seems small, with more of Gitlab.com services being migrated to Kubernetes, we wish to make absolutely sure we are not sending more traffic to the postgres master than is needed. To this end, we want to investigate and make sure we can eliminate these errors completely.

A recent incident indicates that this might also be related to database load balancing and how we handle replica connection issues.

Performance issues were unrelated (gitlab-org/charts/gitlab#2377 (closed)).

This is noted as a blocker for pushing the API service into Kubernetes gitlab-com/gl-infra&271 (closed)

The Errors that we are investigating

There are a few different class of errors we are seeing, but after much investigation and tweaking, the last set of errors we see consistently see and wish to solve are

The most important error, what seems to be transient networking connectivity issues to consul

Service discovery encountered an error: No route to host - connect(2) for 10.224.45.232:30001

Our code throwing an error on what seems like malformed dns response from consul (this error is very rare)

Service discovery encountered an error: comparison of Integer with nil failed

#271575 (comment 567510075)

We will not be addressing this particular error due to rarity.

The architecture of the components in question

Our workloads experiencing the problem are the following

Kubernetes deployments gitlab-webservice-git, gitlab-webservice-websockets in the gitlab Kubernetes namespace in the following clusters
- Production clusters
  - gprd-us-east1-b
  - gprd-us-east1-c
  - gprd-us-east1-d
- Staging clusters
  - gstg-us-east1-b
  - gstg-us-east1-c
  - gstg-us-east1-d
Kubernetes deployments in the Kubernetes namespace gitlab

gitlab-sidekiq-catchall-v1             
gitlab-sidekiq-database-throttled-v1   
gitlab-sidekiq-elasticsearch-v1        
gitlab-sidekiq-gitaly-throttled-v1     
gitlab-sidekiq-low-urgency-cpu-bound-v1
gitlab-sidekiq-memory-bound-v1         
gitlab-sidekiq-urgent-cpu-bound-v1     
gitlab-sidekiq-urgent-other-v1

In the following clusters

Production clusters
- gprd-gitlab-gke
Staging clusters
- gstg-gitlab-gke

Consul itself is deployed via a helm chart into the Kubernetes namespace consul. It runs as a Daemonset on all nodes in all our GKE clusters. The Kubernetes service we are about is called consul-consul-dns which is of type ClusterIP, and listens on port 53 and fowards to port 8600 on the consul daemonset pods.

Our application was configured to use the following

      load_balancing:
        discover:
          nameserver: consul-consul-dns.consul.svc.cluster.local.
          record: db-replica.service.consul.
          record_type: SRV
          port: 53
          use_tcp: true

Which meant it would use kubedns to lookup the ClusterIP of the consul Kubernetes service, then do a TCP dns lookup against the ClusterIP to get the SRV records for dns record db-replica.service.consul. This architecture has sinced changed as we attempted to try and make changes to alleviate the issues, but all issues we are seeing were first experienced when using this setup.

Archtecture changes we have tried/done

The first step was to spin up a new Kubernetes Service that was a headless service and configure our workloads to talk to that instead. The theory was by bypassing the iptables rules in place by kube-proxy and instead talking directly to a pod IP received from the headless service DNS lookup would alleviate the issues. This unfortunately had to be rolled back due to the fact that our code was caching the dns result from looking up the headless service, and as Kubernetes nodes (and thus the consul daemonset pods running on them) got cycled, our workloads pods kept attempting to talk to consul pods that no longer existed.
The second step was to add an extra Kubernetes service called consul-consul-nodeport which was a NodePort service, that took port 30001 on all nodes. We then configured our application to use the Kubernetes downward API to connect to consul on the local node via the NodePort.

@ggillies Just realises that this does not have the intended effect that we want. We want all client pods to only talk to the consul pod running on the same host. NodePort services sit on top of ClusterIP, so switching to NodePort will still send traffic to random pods, and is no different from using a ClusterIP Service

Status

Resolved! ✅

Current Status: In all environments, We have changed consul to provide it's dns tcp port as a hostPort that maps directly to the local pod. We have configured our application to use this hostPort and thus make absolutely sure it's hitting only the consul pod running on the same node.

Reference:

From investigations described in the comments:

We're not sure that we are capturing all of the Consul deployment's logs 📜
We could correlate connect(2) failures both kube-dns scale-down events, and consul members being changed due to Node autoscaling. These should be minimal, and are extremely transitory in comparison to No response from nameservers list 🎯 👌
We can slightly optimize the configuration by populating the FQDN of the Consul service within K8s, by configuring nameserver: consul-consul-dns.consul.svc.cluster.local. (note: including trailing .). This would ensure that kube-dns will respond as fast as possible with the address of Consul and prevent multiple calls to the search domain entries in resolv.conf 🏎
We confirmed that these messages can be coming from resolving the nameserver, not from the query to nameserver. To that end, I'm going to pull together an MR to separate the resolution of the nameserver from from the @resolver definition. We can then know which DNS call is failing. 🕵

(Detailed in #271575 (comment 564402963)) We've moved back to using the ClusterIP service, while still keeping the preStop hook in consul, to see if that keeps the No route to host - connect(2) for <IP> errors gone while also making the No response from nameservers list drop down like it was before the switch to headless service.

With gitlab-com/gl-infra/production#4469 (closed) now done, all pods only connect to their local consul pod. Unfortunately, the connect error can still be seen (mentioned in comment

The above is due to not fully understanding how a NodePort Service works. Traffic going to the NodePort service is still sent to any available Pod.

We seem to see it more frequently in our sidekiq pods, which is interesting, perhaps this is load related?

https://log.gprd.gitlab.net/goto/bf4b5e7af26ecc36806268fdf941d4d7

~~As a next step we'll increase observability of the service to try to determine how many requests are failing. Issue to be created.~~

Milestones

Recreate the behavior outside of our application
Bolster our consul configuration
- Consul upgrade: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13004
- Consul Resource configuration: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!376 (merged)
- Consul logging: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12908
Explore other options for running consul in Kubernetes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13005

Edited May 07, 2021 by John Skarbek