Timeout on proxied upgrade requests (attach, exec, portforward, ...)
Problem
Initially reported with https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/issues/371+ (confidential)
Under unknown circumstances (more on that later) proxied Kube API requests via KAS / agentk that require a connection upgrade to SPDY
(e.g. kubectl run tmp --restart=Never --rm -i --image centos -- /bin/bash
) run into timeouts.
Probably affected endpoints:
- exec
- attach
- portforward
Problem Confirmation
When enabling verbose logging with -v11
and setting DEBUG=true
(it's for the moby/spdystream
pkg not kubectl
itself) we'll notice that it times out waiting for a SPDY
SYN_REPLY
frame after trying to create the first SPDY
stream using SYN_STREAM
after the protocol switch (connection upgrade).
This can be confirmed by adding more specialized debug logs to KAS and agentk.
Specialized logging patch
diff --git a/internal/tool/grpctool/inbound_grpc_to_outbound_http.go b/internal/tool/grpctool/inbound_grpc_to_outbound_http.go
index 88ca771d..0ac4058d 100644
--- a/internal/tool/grpctool/inbound_grpc_to_outbound_http.go
+++ b/internal/tool/grpctool/inbound_grpc_to_outbound_http.go
@@ -135,6 +135,7 @@ func (x *InboundGrpcToOutboundHttp) pipeInboundToOutbound(inbound InboundGrpcToO
}
}),
WithCallback(HttpRequestDataFieldNumber, func(data *HttpRequest_Data) error {
+ x.Log.Sugar().Debugf("in gRPC -> out HTTP: data (len=%d) '% x'", len(data.Data), data.Data)
_, err := pw.Write(data.Data)
return x.maybeHandleIoError("request body write", err)
}),
@@ -164,6 +165,7 @@ func (x *InboundGrpcToOutboundHttp) pipeInboundToOutbound(inbound InboundGrpcToO
}
upgradeConn = r.UpgradeConn
}
+ x.Log.Sugar().Debugf("in gRPC -> out HTTP (upgraded): data (len=%d) '% x'", len(data.Data), data.Data)
_, err := upgradeConn.Write(data.Data)
return x.maybeHandleIoError("upgrade request write", err)
}),
@@ -184,10 +186,14 @@ func (x *InboundGrpcToOutboundHttp) logRequest(header *HttpRequest_Header) {
}
req := header.Request
sugar := x.Log.Sugar()
+ isUpgrade := ""
+ if req.IsUpgrade() {
+ isUpgrade = " (performing upgrade)"
+ }
if len(req.Query) > 0 {
- sugar.Debugf("Handling %s %s?%s", req.Method, req.UrlPath, req.UrlQuery().Encode())
+ sugar.Debugf("Handling %s %s?%s%s", req.Method, req.UrlPath, req.UrlQuery().Encode(), isUpgrade)
} else {
- sugar.Debugf("Handling %s %s", req.Method, req.UrlPath)
+ sugar.Debugf("Handling %s %s%s", req.Method, req.UrlPath, isUpgrade)
}
}
@@ -219,6 +225,7 @@ func (x *InboundGrpcToOutboundHttp) sendResponseHeaderAndBody(inbound InboundGrp
retErr = x.maybeHandleIoError("response body close", err)
}
}()
+ x.Log.Sugar().Debugf("out HTTP -> in gRPC: sending headers with status code %d (%s): %v", resp.StatusCode, resp.Status, resp.Header)
err := inbound.Send(&HttpResponse{
Message: &HttpResponse_Header_{
Header: &HttpResponse_Header{
@@ -246,6 +253,7 @@ func (x *InboundGrpcToOutboundHttp) sendResponseHeaderAndBody(inbound InboundGrp
n, readErr := resp.Body.Read(buffer)
if n > 0 { // handle n>0 before readErr != nil to ensure any consumed data gets forwarded
data.Data = buffer[:n]
+ x.Log.Sugar().Debugf("out HTTP -> in gRPC: data (len=%d) '% x'", len(data.Data), data.Data)
sendErr := inbound.Send(msg)
if sendErr != nil {
return x.handleIoError("SendMsg(HttpResponse_Data) failed", sendErr)
@@ -262,6 +270,10 @@ func (x *InboundGrpcToOutboundHttp) sendResponseHeaderAndBody(inbound InboundGrp
}
func (x *InboundGrpcToOutboundHttp) sendUpgradeResponseStream(inbound InboundGrpcToOutboundHttpStream, upgradeConn *bufio.Reader) error {
+ x.Log.Sugar().Debugf("out HTTP -> in gRPC (upgraded): started response streaming")
+ defer func() {
+ x.Log.Sugar().Debugf("out HTTP -> in gRPC (upgraded): stopped response streaming")
+ }()
buffer := memz.Get32k()
defer memz.Put32k(buffer)
data := &HttpResponse_UpgradeData{} // allocate once
@@ -271,9 +283,12 @@ func (x *InboundGrpcToOutboundHttp) sendUpgradeResponseStream(inbound InboundGrp
},
}
for {
+ x.Log.Sugar().Debugf("out HTTP -> in gRPC (upgraded): reading from upgraded connection")
n, readErr := upgradeConn.Read(buffer)
+ x.Log.Sugar().Debugf("out HTTP -> in gRPC (upgraded): read %d bytes from upgraded connection to forward (err=%s", n, readErr)
if n > 0 { // handle n>0 before readErr != nil to ensure any consumed data gets forwarded
data.Data = buffer[:n]
+ x.Log.Sugar().Debugf("out HTTP -> in gRPC (upgraded): data (len=%d) '% x'", len(data.Data), data.Data)
sendErr := inbound.Send(msg)
if sendErr != nil {
return x.handleIoError("SendMsg(HttpResponse_UpgradeData) failed", sendErr)
diff --git a/internal/tool/grpctool/inbound_http_to_outbound_grpc.go b/internal/tool/grpctool/inbound_http_to_outbound_grpc.go
index 78364664..9b8a8de9 100644
--- a/internal/tool/grpctool/inbound_http_to_outbound_grpc.go
+++ b/internal/tool/grpctool/inbound_http_to_outbound_grpc.go
@@ -348,6 +348,7 @@ func (x *InboundHttpToOutboundGrpc) pipeInboundToOutboundUpgraded(outboundClient
n, readErr := inboundStream.Read(buffer)
if n > 0 { // handle n>0 before readErr != nil to ensure any consumed data gets forwarded
data.Data = buffer[:n]
+ x.Log.Sugar().Debugf("in HTTP -> out gRPC (upgraded): data (len=%d) '% x'", len(data.Data), data.Data)
sendErr := outboundClient.Send(msg)
if sendErr != nil {
if readErr == io.EOF {
@@ -376,6 +377,7 @@ func (x *InboundHttpToOutboundGrpc) pipeOutboundToInboundUpgraded(outboundClient
err := HttpResponseStreamVisitor.Get().Visit(outboundClient,
WithStartState(HttpResponseTrailerFieldNumber),
WithCallback(HttpResponseUpgradeDataFieldNumber, func(data *HttpResponse_UpgradeData) error {
+ x.Log.Sugar().Debugf("out gRPC -> in HTTP (upgraded): data (len=%d) '% x'", len(data.Data), data.Data)
_, err := inboundStream.Write(data.Data)
if err != nil {
writeFailed = true
... where we can see in agentk that a SYN_STREAM
, PING
and eventually GOAWAY
frame is forwarded to the Kube API server without ever receiving a SYN_REPLY
:
{"level":"debug","time":"2023-05-01T09:06:33.201+0200","msg":"Handling POST /api/v1/namespaces/default/pods/tmp/attach?container=tmp&stdin=true&stdout=true&tty=true (performing upgrade)","agent_id":1}
{"level":"debug","time":"2023-05-01T09:06:33.260+0200","msg":"out HTTP -> in gRPC: sending headers with status code 101 (101 Switching Protocols): map[Connection:[Upgrade] Upgrade:[SPDY/3.1] Via:[1.1 gitlab-agent/v0.0.0/00000000] X-Stream-Protocol-Version:[v4.channel.k8s.io]]","agent_id":1}
{"level":"debug","time":"2023-05-01T09:06:33.260+0200","msg":"out HTTP -> in gRPC (upgraded): started response streaming","agent_id":1}
{"level":"debug","time":"2023-05-01T09:06:33.260+0200","msg":"out HTTP -> in gRPC (upgraded): reading from upgraded connection","agent_id":1}
{"level":"debug","time":"2023-05-01T09:06:33.271+0200","msg":"in gRPC -> out HTTP (upgraded): data (len=49) '80 03 00 01 00 00 00 29 00 00 00 01 00 00 00 00 00 00 78 f9 e3 c6 a7 c2 62 60 60 60 04 a7 b3 92 a2 d4 c4 5c 68 51 c2 9a 0a 0a 68 00 00 00 00 ff ff'","agent_id":1}
{"level":"debug","time":"2023-05-01T09:06:38.273+0200","msg":"in gRPC -> out HTTP (upgraded): data (len=12) '80 03 00 06 00 00 00 04 00 00 00 01'","agent_id":1}
{"level":"debug","time":"2023-05-01T09:07:03.274+0200","msg":"in gRPC -> out HTTP (upgraded): data (len=16) '80 03 00 07 00 00 00 08 00 00 00 00 00 00 00 00'","agent_id":1}
Log Explanation:
-
Line 1 (from KAS to Kube API): HTTP POST request to attach to a pod - having
Connection: upgrade
header set. - Line 2 (from Kube API to KAS): HTTP 101 response for attach (upgrading the connection)
- Line 3 (agentk info): started streaming on the upgraded connection
-
Line 5 (from KAS to Kube API):
SYN_STREAM
(80 03 00 01 ...
)SPDY
packet -
Line 6 (from KAS to Kube API):
PING
(80 03 00 06 ...
)SPDY
packet -
Line 7 (from KAS to Kube API):
GOAWAY
(80 03 00 07 ...
)SPDY
packet- received after ~30 seconds from
SYN_STREAM
hardcoded inkubectl
- received after ~30 seconds from
... as you can see there is no data flowing from the Kube API to agentk on the upgraded connection.
The problem can also be confirmed by sniffing the traffic between agentk and the Kube API server (note that you'll need to add a KeyLogWriter
to the TLS config used by agentk for the API calls.
And the entire decrypted TLS stream:
When connecting to the Kube API server directly using kubectl
the decrypted TLS stream looks like this:
... the only difference here is that the TCP packets are segmented differently (the data is the same) - though this is pretty much random and shouldn't affect the functionality ...
Affected
I've run several tests using different KAS / agentk versions, Kubernetes distributions, Kubernetes versions - using different golang versions and the results are pretty random
What I can say though, at least on my machine
I'm also connecting directly to KAS (no nginx whatsoever between kubectl
and KAS
).
KAS / agentk versions tested:
master
v15.8.0
Against the following configurations:
distribution | k8s version | k8s go version | container runtime | works? |
---|---|---|---|---|
rancher-desktop | v1.24.12+k3s1 | go1.19.7 | containerd://1.6.8 | |
k3d | v1.25.6+k3s1 | go1.19.5 | containerd://1.6.15-k3s1 | |
kind | v1.22.17 | go1.16.15 | containerd://1.6.19-46-g941215f49 | |
kind | v1.23.17 | go1.19.6 | containerd://1.6.19-46-g941215f49 | |
kind | v1.24.12 | go1.19.7 | containerd://1.6.19-46-g941215f49 | |
kind | v1.26.3 | go1.19.7 | containerd://1.6.19-46-g941215f49 | |
gke | v1.22.17-gke.8000 | go1.16.15b7 | containerd://1.5.18 | |
gke | v1.23.17-gke.2000 | go1.19.6 | containerd://1.5.18 | |
gke | v1.24.12-gke.1000 | go1.19.7 | containerd://1.6.18 | |
gke | v1.25.8-gke.1000 | go1.19.7 | containerd://1.6.18 | |
gke | v1.26.3-gke.1000 | go1.19.7 | containerd://1.6.18 | |
eks | v1.22.17-eks-ec5523e | go1.16.15 | docker://20.10.17 | |
eks | v1.23.17-eks-ec5523e | go1.19.6 | docker://20.10.17 | |
eks | v1.24.12-eks-ec5523e | go1.19.7 | containerd://1.6.19 | |
eks | v1.25.8-eks-ec5523e | go1.19.7 | containerd://1.6.19 | |
eks | v1.26.2-eks-a59e1f0 | go1.19.6 | containerd://1.6.19 |
As mentioned before, the results (works?
column) is pretty much random. It first looked like k8s versions after v1.24
are affected, but kind
actually never works and eks v1.23.17
is also an outlier to that data ...
Ideas and next steps?
The biggest mystery for me at the moment is still way it only fails for proxied requests via KAS / agentk - it works fine when connection to Kube API directly - though the streams look exactly the stream
- Related to Kubernetes configuration (Kube API, kubelet, ...)
- Related to the container runtime used by kubelet
- Another bug somewhere in the Stack
- Maybe in
moby/spdystream
(according to the Git history there have been problems with it earlier)?
- Maybe in
Prior Art
Last release we discovered as similar bug related to SPDY
and specifically the SYN_REPLY
frame.
It turned out to be a regression bug in the golang standard library introduced with https://go-review.googlesource.com/c/go/+/269997 into Golang 1.19 - which I've reported in https://github.com/golang/go/issues/59564 and was fixed for it was submitted in https://go-review.googlesource.com/c/go/+/485775 and currently under review.
In KAS 15.11 we've worked around this bug with Fix double flush for proxied 1xx responses to a... (!912 - merged)