Support mTLS for receptive agents
See individual commits.
Testing setup
I couldn't test this with GDK (something was wrong with it), but I tested the changes by mocking rails reply (like before there was an API). Here is my configuration for testing:
Note that we have agent.gdk.test
agent1.gdk.test
hosts that resolve to the same IP. This is to test cert validation.
cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
172.16.123.1 gdk.test registry.test agent.gdk.test agent1.gdk.test
Mocking /api/v4/internal/kubernetes/receptive_agents
in internal/module/kas2agentk_tunnel/server/module.go
:
resp := &gapi.GetReceptiveAgentsResponse{
Agents: []*gapi.ReceptiveAgent{
{
Id: 3,
Url: "grpcs://agent1.gdk.test:8082",
//CaCert: "",
//TlsHost: "agent.gdk.test",
AuthConfig: &gapi.ReceptiveAgent_Mtls{
Mtls: &gapi.ReceptiveAgentMutualTLSAuth{
ClientCert: `-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
`,
ClientKey: `-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
`,
},
},
},
},
}
agentk command line:
--context=rancher-desktop
--token-file=token-gdk-agent1.txt
--api-cert-file=agent.gdk.test.pem
--api-key-file=agent.gdk.test-key.pem
--api-mtls=true
--api-listen-address=:8082
--private-api-jwt-file=private-api-gdk-secret.txt
agentk env:
GRPC_GO_LOG_SEVERITY_LEVEL=debug;GRPC_GO_LOG_VERBOSITY_LEVEL=99;LOG_LEVEL=debug;OWN_PRIVATE_API_URL=grpc://127.0.0.1:8081;POD_NAME=agent1;POD_NAMESPACE=ns
Certs can be generated using mkcert. Like this:
brew install mkcert # if you don't have it already
mkcert -install # if you don't have it already
mkcert -ecdsa agent.gdk.test # generate server certs i.e. what agentk will use
mkcert --client -ecdsa agent.gdk.test # generate client certs i.e. what kas will use
ls -la
-rw-------@ 1 mike staff 241 29 Aug 11:35 agent.gdk.test-client-key.pem
-rw-r--r--@ 1 mike staff 1196 29 Aug 11:35 agent.gdk.test-client.pem
-rw-------@ 1 mike staff 241 29 Aug 11:35 agent.gdk.test-key.pem
-rw-r--r--@ 1 mike staff 1184 29 Aug 11:35 agent.gdk.test.pem
Testing results
Happy path
It simply works, nothing interesting to show.
Invalid server cert
What's more interesting is to see how it doesn't work when cert doesn't match the host name, for example. Note that we use agent1.gdk.test
as the host, but the server cert is for agent.gdk.test
.
resp := &gapi.GetReceptiveAgentsResponse{
Agents: []*gapi.ReceptiveAgent{
{
Id: 3,
Url: "grpcs://agent1.gdk.test:8082",
// ...
},
},
}
kas logs:
{"time":"2024-08-29T14:31:42.609785+10:00","level":"INFO","msg":"[core]Creating new client transport to \"{Addr: \\\"172.16.123.1:8082\\\", ServerName: \\\"agent1.gdk.test:8082\\\", }\": connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate is valid for agent.gdk.test, not agent1.gdk.test\""}
agetnk logs:
{"time":"2024-08-29T14:51:27.442938+10:00","level":"INFO","msg":"[core][Server #1]grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"ServerHandshake(\\\"172.16.123.1:64158\\\") failed: remote error: tls: bad certificate\""}
If we set TlsHost
to agent.gdk.test
we can connect even with the "invalid" hostname. Expected behavior. kas logs:
...
{"time":"2024-08-29T14:57:35.286652+10:00","level":"INFO","msg":"[core]original dial target is: \"dns:agent1.gdk.test:8082\""}
{"time":"2024-08-29T14:57:35.286729+10:00","level":"INFO","msg":"[core][Channel #13]Channel created"}
{"time":"2024-08-29T14:57:35.286776+10:00","level":"INFO","msg":"[core][Channel #13]parsed dial target is: resolver.Target{URL:url.URL{Scheme:\"dns\", Opaque:\"agent1.gdk.test:8082\", User:(*url.Userinfo)(nil), Host:\"\", Path:\"\", RawPath:\"\", OmitHost:false, ForceQuery:false, RawQuery:\"\", Fragment:\"\", RawFragment:\"\"}}"}
{"time":"2024-08-29T14:57:35.286808+10:00","level":"INFO","msg":"[core][Channel #13]Channel authority set to \"agent.gdk.test\""}
{"time":"2024-08-29T14:57:35.287107+10:00","level":"INFO","msg":"[core][Channel #13]Channel exiting idle mode"}
{"time":"2024-08-29T14:57:35.363111+10:00","level":"INFO","msg":"[core][Channel #13]Resolver state updated: {\n \"Addresses\": [\n {\n \"Addr\": \"172.16.123.1:8082\",\n \"ServerName\": \"\",\n \"Attributes\": null,\n \"BalancerAttributes\": null,\n \"Metadata\": null\n }\n ],\n \"Endpoints\": [\n {\n \"Addresses\": [\n {\n \"Addr\": \"172.16.123.1:8082\",\n \"ServerName\": \"\",\n \"Attributes\": null,\n \"BalancerAttributes\": null,\n \"Metadata\": null\n }\n ],\n \"Attributes\": null\n }\n ],\n \"ServiceConfig\": null,\n \"Attributes\": null\n} (resolver returned new addresses)"}
{"time":"2024-08-29T14:57:35.363222+10:00","level":"INFO","msg":"[core][Channel #13]Channel switches to new LB policy \"round_robin\""}
{"time":"2024-08-29T14:57:35.363357+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: got new ClientConn state: {{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] [{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] <nil>}] <nil> <nil>} <nil>}"}
{"time":"2024-08-29T14:57:35.363411+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel created"}
{"time":"2024-08-29T14:57:35.363455+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[]}"}
{"time":"2024-08-29T14:57:35.363501+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to CONNECTING"}
{"time":"2024-08-29T14:57:35.363509+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to CONNECTING"}
{"time":"2024-08-29T14:57:35.363549+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel picks a new address \"172.16.123.1:8082\" to connect"}
{"time":"2024-08-29T14:57:35.363596+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc001355470, CONNECTING"}
{"time":"2024-08-29T14:57:35.368221+10:00","level":"INFO","msg":"[core]CPU time info is unavailable on non-linux environments."}
{"time":"2024-08-29T14:57:35.386206+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to READY"}
{"time":"2024-08-29T14:57:35.386275+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc001355470, READY"}
{"time":"2024-08-29T14:57:35.386337+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[SubConn(id:14):{{Addr: \"172.16.123.1:8082\", ServerName: \"\", }}]}"}
{"time":"2024-08-29T14:57:35.386385+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to READY"}
...
{"time":"2024-08-29T14:57:49.60601+10:00","level":"INFO","msg":"Registering agent","agent_id":3,"agent_version":"v0.0.0","expires":"2024-08-29T05:12:49.606003Z","pod_name":"agent1","pod_namespace":"ns"}
{"time":"2024-08-29T14:57:50.128083+10:00","level":"INFO","msg":"Config: new commit","grpc_service":"gitlab.agent.agent_configuration.rpc.AgentConfiguration","grpc_method":"GetConfiguration","agent_id":3,"project_id":"root/agents","commit_id":"0c5bd4b671e02299775b4b3fbe53e007f0c2c87c"}
Invalid client cert
I copied, then uninstalled, and deleted the existing CA that mkcert installed onto my machine. Let's see how providing root CAs works and fails.
So, with the same client and server certs kas logs:
{"time":"2024-08-29T15:26:43.742286+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority\""}
agentk logs:
{"time":"2024-08-29T15:27:24.105034+10:00","level":"INFO","msg":"[core][Server #1]grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"ServerHandshake(\\\"172.16.123.1:65402\\\") failed: remote error: tls: bad certificate\""}
This is because both server certificates that the other end doesn't have a CA for. Ok, let's give kas a CA to validate agentk's/server certificate (using our fake GetReceptiveAgentsResponse
). Now kas prints a different message:
{"time":"2024-08-29T15:31:22.167017+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to IDLE, last error: connection error: desc = \"error reading server preface: remote error: tls: unknown certificate authority\""}
Ok, let's give agentk the CA too so that it can validate kas' client certificate. We do this via the new --api-client-ca-cert-file
command line flag:
--context=rancher-desktop
--token-file=token-gdk-agent1.txt
--api-cert-file=agent.gdk.test.pem
--api-key-file=agent.gdk.test-key.pem
--api-client-ca-cert-file=rootCA.pem
--api-listen-address=:8082
--private-api-jwt-file=private-api-gdk-secret.txt
And the connection is established just fine this time ([Channel #13 SubChannel #14]Subchannel Connectivity change to READY
)! We get errors from AgentInfo()
because GitLab client in kas doesn't have the CA to validate the GDK's GitLab certificate and fails. But this proves that kas connected to agentk, agentk found the tunnel, agentk sent a request to kas, kas tries to validate the agentk's token by calling GitLab and that's where it fails. But kas<->agentk mTLS works as expected, which is what we wanted to test here.
...
{"time":"2024-08-29T15:34:05.743562+10:00","level":"INFO","msg":"[core]original dial target is: \"dns:agent.gdk.test:8082\""}
{"time":"2024-08-29T15:34:05.743629+10:00","level":"INFO","msg":"[core][Channel #13]Channel created"}
{"time":"2024-08-29T15:34:05.743674+10:00","level":"INFO","msg":"[core][Channel #13]parsed dial target is: resolver.Target{URL:url.URL{Scheme:\"dns\", Opaque:\"agent.gdk.test:8082\", User:(*url.Userinfo)(nil), Host:\"\", Path:\"\", RawPath:\"\", OmitHost:false, ForceQuery:false, RawQuery:\"\", Fragment:\"\", RawFragment:\"\"}}"}
{"time":"2024-08-29T15:34:05.743706+10:00","level":"INFO","msg":"[core][Channel #13]Channel authority set to \"agent.gdk.test:8082\""}
{"time":"2024-08-29T15:34:05.743865+10:00","level":"INFO","msg":"[core][Channel #13]Channel exiting idle mode"}
{"time":"2024-08-29T15:34:05.806692+10:00","level":"INFO","msg":"[core][Channel #13]Resolver state updated: {\n \"Addresses\": [\n {\n \"Addr\": \"172.16.123.1:8082\",\n \"ServerName\": \"\",\n \"Attributes\": null,\n \"BalancerAttributes\": null,\n \"Metadata\": null\n }\n ],\n \"Endpoints\": [\n {\n \"Addresses\": [\n {\n \"Addr\": \"172.16.123.1:8082\",\n \"ServerName\": \"\",\n \"Attributes\": null,\n \"BalancerAttributes\": null,\n \"Metadata\": null\n }\n ],\n \"Attributes\": null\n }\n ],\n \"ServiceConfig\": null,\n \"Attributes\": null\n} (resolver returned new addresses)"}
{"time":"2024-08-29T15:34:05.806789+10:00","level":"INFO","msg":"[core][Channel #13]Channel switches to new LB policy \"round_robin\""}
{"time":"2024-08-29T15:34:05.806898+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: got new ClientConn state: {{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] [{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] <nil>}] <nil> <nil>} <nil>}"}
{"time":"2024-08-29T15:34:05.806949+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel created"}
{"time":"2024-08-29T15:34:05.80699+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[]}"}
{"time":"2024-08-29T15:34:05.807027+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to CONNECTING"}
{"time":"2024-08-29T15:34:05.807075+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to CONNECTING"}
{"time":"2024-08-29T15:34:05.807139+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc0005b8d20, CONNECTING"}
{"time":"2024-08-29T15:34:05.807141+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel picks a new address \"172.16.123.1:8082\" to connect"}
{"time":"2024-08-29T15:34:05.811498+10:00","level":"INFO","msg":"[core]CPU time info is unavailable on non-linux environments."}
{"time":"2024-08-29T15:34:05.833905+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to READY"}
{"time":"2024-08-29T15:34:05.833984+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc0005b8d20, READY"}
{"time":"2024-08-29T15:34:05.834045+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[SubConn(id:14):{{Addr: \"172.16.123.1:8082\", ServerName: \"\", }}]}"}
{"time":"2024-08-29T15:34:05.834099+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to READY"}
...
{"time":"2024-08-29T15:34:13.317839+10:00","level":"ERROR","msg":"AgentInfo()","grpc_service":"gitlab.agent.agent_registrar.rpc.AgentRegistrar","grpc_method":"Register","error":"Get \"https://gdk.test:3333/api/v4/internal/kubernetes/agent_info\": tls: failed to verify certificate: x509: certificate signed by unknown authority"}