Service Discovery sometimes fails inside of Kubernetes
High level statement of the Problem
Currently the Gitlab rails application stack used by our web, api, and sidekiq components depends on consul to discover a list of postgres database secondaries, which our code will connect to with read only queries, to reduce database load from the primary. This querying to consul is done periodically (roughly every 60-70 seconds)
In our current virtual machine infrastructure, we run a consul agent on each node, and set our application to query consul via it's DNS interface, address localhost
, TCP port 8600
(which is the port consul is listening on for TCP dns requests). This has worked without issue.
When migrating some of our components to run on Kubernetes (GKE), we have discovered our application throwing a few different transient errors when talking to consul. This causes our application to fall back to send all queries to the database primary, until it queries consul again, which succeeds and thus the list of secondaries is returned.
While the rate of errors seems small, with more of Gitlab.com services being migrated to Kubernetes, we wish to make absolutely sure we are not sending more traffic to the postgres master than is needed. To this end, we want to investigate and make sure we can eliminate these errors completely.
Performance issues were unrelated (gitlab-org/charts/gitlab#2377 (closed)).
This is noted as a blocker for pushing the API service into Kubernetes gitlab-com/gl-infra&271 (closed)
The Errors that we are investigating
There are a few different class of errors we are seeing, but after much investigation and tweaking, the last set of errors we see consistently see and wish to solve are
- The most important error, what seems to be transient networking connectivity issues to consul
Service discovery encountered an error: No route to host - connect(2) for 10.224.45.232:30001
- Our code throwing an error on what seems like malformed dns response from consul (this error is very rare)
Service discovery encountered an error: comparison of Integer with nil failed
We will not be addressing this particular error due to rarity.
The architecture of the components in question
Our workloads experiencing the problem are the following
-
Kubernetes deployments
gitlab-webservice-git
, gitlab-webservice-websockets in thegitlab
Kubernetes namespace in the following clusters-
Production clusters
gprd-us-east1-b
gprd-us-east1-c
gprd-us-east1-d
-
Staging clusters
gstg-us-east1-b
gstg-us-east1-c
gstg-us-east1-d
-
-
Kubernetes deployments in the Kubernetes namespace
gitlab
gitlab-sidekiq-catchall-v1
gitlab-sidekiq-database-throttled-v1
gitlab-sidekiq-elasticsearch-v1
gitlab-sidekiq-gitaly-throttled-v1
gitlab-sidekiq-low-urgency-cpu-bound-v1
gitlab-sidekiq-memory-bound-v1
gitlab-sidekiq-urgent-cpu-bound-v1
gitlab-sidekiq-urgent-other-v1
In the following clusters
-
Production clusters
gprd-gitlab-gke
-
Staging clusters
gstg-gitlab-gke
Consul itself is deployed via a helm chart into the Kubernetes namespace consul
. It runs as a Daemonset
on all nodes in all our GKE clusters. The Kubernetes service
we are about is called consul-consul-dns
which is of type ClusterIP
, and listens on port 53 and fowards to port 8600
on the consul daemonset pods.
Our application was configured to use the following
load_balancing:
discover:
nameserver: consul-consul-dns.consul.svc.cluster.local.
record: db-replica.service.consul.
record_type: SRV
port: 53
use_tcp: true
Which meant it would use kubedns
to lookup the ClusterIP
of the consul Kubernetes service, then do a TCP dns lookup against the ClusterIP
to get the SRV
records for dns record db-replica.service.consul.
This architecture has sinced changed as we attempted to try and make changes to alleviate the issues, but all issues we are seeing were first experienced when using this setup.
Archtecture changes we have tried/done
-
The first step was to spin up a new Kubernetes
Service
that was a headless service and configure our workloads to talk to that instead. The theory was by bypassing the iptables rules in place bykube-proxy
and instead talking directly to a pod IP received from the headless service DNS lookup would alleviate the issues. This unfortunately had to be rolled back due to the fact that our code was caching the dns result from looking up the headless service, and as Kubernetes nodes (and thus the consul daemonset pods running on them) got cycled, our workloads pods kept attempting to talk to consul pods that no longer existed. -
The second step was to add an extra Kubernetes service called
consul-consul-nodeport
which was aNodePort
service, that took port30001
on all nodes. We then configured our application to use the Kubernetes downward API to connect to consul on the local node via the NodePort.
@ggillies Just realises that this does not have the intended effect that we want. We want all client pods to only talk to the consul pod running on the same host. NodePort services sit on top of ClusterIP, so switching to NodePort will still send traffic to random pods, and is no different from using a ClusterIP Service
Status
Resolved!
Current Status: In all environments, We have changed consul to provide it's dns tcp port as a hostPort
that maps directly to the local pod. We have configured our application to use this hostPort
and thus make absolutely sure it's hitting only the consul pod running on the same node.
Reference:
From investigations described in the comments:
- We're not sure that we are capturing all of the Consul deployment's logs
📜 - We could correlate
connect(2)
failures bothkube-dns
scale-down events, andconsul
members being changed due to Node autoscaling. These should be minimal, and are extremely transitory in comparison toNo response from nameservers list
🎯 👌 - We can slightly optimize the configuration by populating the FQDN of the Consul service within K8s, by configuring
nameserver: consul-consul-dns.consul.svc.cluster.local.
(note: including trailing.
). This would ensure thatkube-dns
will respond as fast as possible with the address of Consul and prevent multiple calls to thesearch
domain entries inresolv.conf
🏎 - We confirmed that these messages can be coming from resolving the
nameserver
, not from the query tonameserver
. To that end, I'm going to pull together an MR to separate the resolution of the nameserver from from the@resolver
definition. We can then know which DNS call is failing.🕵
(Detailed in #271575 (comment 564402963)) We've moved back to using the ClusterIP
service, while still keeping the preStop
hook in consul, to see if that keeps the No route to host - connect(2) for <IP>
errors gone while also making the No response from nameservers list
drop down like it was before the switch to headless service.
With gitlab-com/gl-infra/production#4469 (closed) now done, all pods only connect to their local consul pod. Unfortunately, the connect error can still be seen (mentioned in comment
- https://log.gprd.gitlab.net/goto/3ad6824113c44c224db18610f7184fd4
- https://sentry.gitlab.net/gitlab/gitlabcom/issues/1823997/?query=83a7cc3ef9f94b88b6216e6d19ac9445
The above is due to not fully understanding how a NodePort Service works. Traffic going to the NodePort service is still sent to any available Pod.
We seem to see it more frequently in our sidekiq pods, which is interesting, perhaps this is load related?
As a next step we'll increase observability of the service to try to determine how many requests are failing. Issue to be created.
Milestones
-
Recreate the behavior outside of our application -
Bolster our consul configuration - Consul upgrade: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13004
- Consul Resource configuration: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!376 (merged)
- Consul logging: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12908
-
Explore other options for running consul in Kubernetes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13005