Skip to content

Support mTLS for receptive agents

Mikhail Mazurskiy requested to merge ash2k/receptive-mtls into master

See individual commits.

Testing setup

I couldn't test this with GDK (something was wrong with it), but I tested the changes by mocking rails reply (like before there was an API). Here is my configuration for testing:

Note that we have agent.gdk.test agent1.gdk.test hosts that resolve to the same IP. This is to test cert validation.

cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1	localhost
255.255.255.255	broadcasthost
::1             localhost
172.16.123.1 gdk.test registry.test agent.gdk.test agent1.gdk.test

Mocking /api/v4/internal/kubernetes/receptive_agents in internal/module/kas2agentk_tunnel/server/module.go:

resp := &gapi.GetReceptiveAgentsResponse{
	Agents: []*gapi.ReceptiveAgent{
		{
			Id:  3,
			Url: "grpcs://agent1.gdk.test:8082",
			//CaCert:     "",
			//TlsHost: "agent.gdk.test",
			AuthConfig: &gapi.ReceptiveAgent_Mtls{
				Mtls: &gapi.ReceptiveAgentMutualTLSAuth{
					ClientCert: `-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
`,
					ClientKey: `-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
`,
				},
			},
		},
	},
}

agentk command line:

--context=rancher-desktop
--token-file=token-gdk-agent1.txt
--api-cert-file=agent.gdk.test.pem
--api-key-file=agent.gdk.test-key.pem
--api-mtls=true
--api-listen-address=:8082
--private-api-jwt-file=private-api-gdk-secret.txt

agentk env:

GRPC_GO_LOG_SEVERITY_LEVEL=debug;GRPC_GO_LOG_VERBOSITY_LEVEL=99;LOG_LEVEL=debug;OWN_PRIVATE_API_URL=grpc://127.0.0.1:8081;POD_NAME=agent1;POD_NAMESPACE=ns

Certs can be generated using mkcert. Like this:

brew install mkcert # if you don't have it already
mkcert -install # if you don't have it already
mkcert -ecdsa agent.gdk.test # generate server certs i.e. what agentk will use
mkcert --client -ecdsa agent.gdk.test # generate client certs i.e. what kas will use
ls -la                                             
-rw-------@  1 mike  staff    241 29 Aug 11:35 agent.gdk.test-client-key.pem
-rw-r--r--@  1 mike  staff   1196 29 Aug 11:35 agent.gdk.test-client.pem
-rw-------@  1 mike  staff    241 29 Aug 11:35 agent.gdk.test-key.pem
-rw-r--r--@  1 mike  staff   1184 29 Aug 11:35 agent.gdk.test.pem

Testing results

Happy path

It simply works, nothing interesting to show.

Invalid server cert

What's more interesting is to see how it doesn't work when cert doesn't match the host name, for example. Note that we use agent1.gdk.test as the host, but the server cert is for agent.gdk.test.

resp := &gapi.GetReceptiveAgentsResponse{
	Agents: []*gapi.ReceptiveAgent{
		{
			Id:  3,
			Url: "grpcs://agent1.gdk.test:8082",
			// ...
		},
	},
}

kas logs:

{"time":"2024-08-29T14:31:42.609785+10:00","level":"INFO","msg":"[core]Creating new client transport to \"{Addr: \\\"172.16.123.1:8082\\\", ServerName: \\\"agent1.gdk.test:8082\\\", }\": connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate is valid for agent.gdk.test, not agent1.gdk.test\""}

agetnk logs:

{"time":"2024-08-29T14:51:27.442938+10:00","level":"INFO","msg":"[core][Server #1]grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"ServerHandshake(\\\"172.16.123.1:64158\\\") failed: remote error: tls: bad certificate\""}

If we set TlsHost to agent.gdk.test we can connect even with the "invalid" hostname. Expected behavior. kas logs:

...
{"time":"2024-08-29T14:57:35.286652+10:00","level":"INFO","msg":"[core]original dial target is: \"dns:agent1.gdk.test:8082\""}
{"time":"2024-08-29T14:57:35.286729+10:00","level":"INFO","msg":"[core][Channel #13]Channel created"}
{"time":"2024-08-29T14:57:35.286776+10:00","level":"INFO","msg":"[core][Channel #13]parsed dial target is: resolver.Target{URL:url.URL{Scheme:\"dns\", Opaque:\"agent1.gdk.test:8082\", User:(*url.Userinfo)(nil), Host:\"\", Path:\"\", RawPath:\"\", OmitHost:false, ForceQuery:false, RawQuery:\"\", Fragment:\"\", RawFragment:\"\"}}"}
{"time":"2024-08-29T14:57:35.286808+10:00","level":"INFO","msg":"[core][Channel #13]Channel authority set to \"agent.gdk.test\""}
{"time":"2024-08-29T14:57:35.287107+10:00","level":"INFO","msg":"[core][Channel #13]Channel exiting idle mode"}
{"time":"2024-08-29T14:57:35.363111+10:00","level":"INFO","msg":"[core][Channel #13]Resolver state updated: {\n  \"Addresses\": [\n    {\n      \"Addr\": \"172.16.123.1:8082\",\n      \"ServerName\": \"\",\n      \"Attributes\": null,\n      \"BalancerAttributes\": null,\n      \"Metadata\": null\n    }\n  ],\n  \"Endpoints\": [\n    {\n      \"Addresses\": [\n        {\n          \"Addr\": \"172.16.123.1:8082\",\n          \"ServerName\": \"\",\n          \"Attributes\": null,\n          \"BalancerAttributes\": null,\n          \"Metadata\": null\n        }\n      ],\n      \"Attributes\": null\n    }\n  ],\n  \"ServiceConfig\": null,\n  \"Attributes\": null\n} (resolver returned new addresses)"}
{"time":"2024-08-29T14:57:35.363222+10:00","level":"INFO","msg":"[core][Channel #13]Channel switches to new LB policy \"round_robin\""}
{"time":"2024-08-29T14:57:35.363357+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: got new ClientConn state: {{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] [{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] <nil>}] <nil> <nil>} <nil>}"}
{"time":"2024-08-29T14:57:35.363411+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel created"}
{"time":"2024-08-29T14:57:35.363455+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[]}"}
{"time":"2024-08-29T14:57:35.363501+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to CONNECTING"}
{"time":"2024-08-29T14:57:35.363509+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to CONNECTING"}
{"time":"2024-08-29T14:57:35.363549+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel picks a new address \"172.16.123.1:8082\" to connect"}
{"time":"2024-08-29T14:57:35.363596+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc001355470, CONNECTING"}
{"time":"2024-08-29T14:57:35.368221+10:00","level":"INFO","msg":"[core]CPU time info is unavailable on non-linux environments."}
{"time":"2024-08-29T14:57:35.386206+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to READY"}
{"time":"2024-08-29T14:57:35.386275+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc001355470, READY"}
{"time":"2024-08-29T14:57:35.386337+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[SubConn(id:14):{{Addr: \"172.16.123.1:8082\", ServerName: \"\", }}]}"}
{"time":"2024-08-29T14:57:35.386385+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to READY"}
...
{"time":"2024-08-29T14:57:49.60601+10:00","level":"INFO","msg":"Registering agent","agent_id":3,"agent_version":"v0.0.0","expires":"2024-08-29T05:12:49.606003Z","pod_name":"agent1","pod_namespace":"ns"}
{"time":"2024-08-29T14:57:50.128083+10:00","level":"INFO","msg":"Config: new commit","grpc_service":"gitlab.agent.agent_configuration.rpc.AgentConfiguration","grpc_method":"GetConfiguration","agent_id":3,"project_id":"root/agents","commit_id":"0c5bd4b671e02299775b4b3fbe53e007f0c2c87c"}

Invalid client cert

I copied, then uninstalled, and deleted the existing CA that mkcert installed onto my machine. Let's see how providing root CAs works and fails.

So, with the same client and server certs kas logs:

{"time":"2024-08-29T15:26:43.742286+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority\""}

agentk logs:

{"time":"2024-08-29T15:27:24.105034+10:00","level":"INFO","msg":"[core][Server #1]grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"ServerHandshake(\\\"172.16.123.1:65402\\\") failed: remote error: tls: bad certificate\""}

This is because both server certificates that the other end doesn't have a CA for. Ok, let's give kas a CA to validate agentk's/server certificate (using our fake GetReceptiveAgentsResponse). Now kas prints a different message:

{"time":"2024-08-29T15:31:22.167017+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to IDLE, last error: connection error: desc = \"error reading server preface: remote error: tls: unknown certificate authority\""}

Ok, let's give agentk the CA too so that it can validate kas' client certificate. We do this via the new --api-client-ca-cert-file command line flag:

--context=rancher-desktop
--token-file=token-gdk-agent1.txt
--api-cert-file=agent.gdk.test.pem
--api-key-file=agent.gdk.test-key.pem
--api-client-ca-cert-file=rootCA.pem
--api-listen-address=:8082
--private-api-jwt-file=private-api-gdk-secret.txt

And the connection is established just fine this time ([Channel #13 SubChannel #14]Subchannel Connectivity change to READY)! We get errors from AgentInfo() because GitLab client in kas doesn't have the CA to validate the GDK's GitLab certificate and fails. But this proves that kas connected to agentk, agentk found the tunnel, agentk sent a request to kas, kas tries to validate the agentk's token by calling GitLab and that's where it fails. But kas<->agentk mTLS works as expected, which is what we wanted to test here.

...
{"time":"2024-08-29T15:34:05.743562+10:00","level":"INFO","msg":"[core]original dial target is: \"dns:agent.gdk.test:8082\""}
{"time":"2024-08-29T15:34:05.743629+10:00","level":"INFO","msg":"[core][Channel #13]Channel created"}
{"time":"2024-08-29T15:34:05.743674+10:00","level":"INFO","msg":"[core][Channel #13]parsed dial target is: resolver.Target{URL:url.URL{Scheme:\"dns\", Opaque:\"agent.gdk.test:8082\", User:(*url.Userinfo)(nil), Host:\"\", Path:\"\", RawPath:\"\", OmitHost:false, ForceQuery:false, RawQuery:\"\", Fragment:\"\", RawFragment:\"\"}}"}
{"time":"2024-08-29T15:34:05.743706+10:00","level":"INFO","msg":"[core][Channel #13]Channel authority set to \"agent.gdk.test:8082\""}
{"time":"2024-08-29T15:34:05.743865+10:00","level":"INFO","msg":"[core][Channel #13]Channel exiting idle mode"}
{"time":"2024-08-29T15:34:05.806692+10:00","level":"INFO","msg":"[core][Channel #13]Resolver state updated: {\n  \"Addresses\": [\n    {\n      \"Addr\": \"172.16.123.1:8082\",\n      \"ServerName\": \"\",\n      \"Attributes\": null,\n      \"BalancerAttributes\": null,\n      \"Metadata\": null\n    }\n  ],\n  \"Endpoints\": [\n    {\n      \"Addresses\": [\n        {\n          \"Addr\": \"172.16.123.1:8082\",\n          \"ServerName\": \"\",\n          \"Attributes\": null,\n          \"BalancerAttributes\": null,\n          \"Metadata\": null\n        }\n      ],\n      \"Attributes\": null\n    }\n  ],\n  \"ServiceConfig\": null,\n  \"Attributes\": null\n} (resolver returned new addresses)"}
{"time":"2024-08-29T15:34:05.806789+10:00","level":"INFO","msg":"[core][Channel #13]Channel switches to new LB policy \"round_robin\""}
{"time":"2024-08-29T15:34:05.806898+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: got new ClientConn state: {{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] [{[{Addr: \"172.16.123.1:8082\", ServerName: \"\", }] <nil>}] <nil> <nil>} <nil>}"}
{"time":"2024-08-29T15:34:05.806949+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel created"}
{"time":"2024-08-29T15:34:05.80699+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[]}"}
{"time":"2024-08-29T15:34:05.807027+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to CONNECTING"}
{"time":"2024-08-29T15:34:05.807075+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to CONNECTING"}
{"time":"2024-08-29T15:34:05.807139+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc0005b8d20, CONNECTING"}
{"time":"2024-08-29T15:34:05.807141+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel picks a new address \"172.16.123.1:8082\" to connect"}
{"time":"2024-08-29T15:34:05.811498+10:00","level":"INFO","msg":"[core]CPU time info is unavailable on non-linux environments."}
{"time":"2024-08-29T15:34:05.833905+10:00","level":"INFO","msg":"[core][Channel #13 SubChannel #14]Subchannel Connectivity change to READY"}
{"time":"2024-08-29T15:34:05.833984+10:00","level":"INFO","msg":"[balancer]base.baseBalancer: handle SubConn state change: 0xc0005b8d20, READY"}
{"time":"2024-08-29T15:34:05.834045+10:00","level":"INFO","msg":"[roundrobin]roundrobinPicker: Build called with info: {map[SubConn(id:14):{{Addr: \"172.16.123.1:8082\", ServerName: \"\", }}]}"}
{"time":"2024-08-29T15:34:05.834099+10:00","level":"INFO","msg":"[core][Channel #13]Channel Connectivity change to READY"}
...
{"time":"2024-08-29T15:34:13.317839+10:00","level":"ERROR","msg":"AgentInfo()","grpc_service":"gitlab.agent.agent_registrar.rpc.AgentRegistrar","grpc_method":"Register","error":"Get \"https://gdk.test:3333/api/v4/internal/kubernetes/agent_info\": tls: failed to verify certificate: x509: certificate signed by unknown authority"}
Edited by Mikhail Mazurskiy

Merge request reports

Loading