Limit the max age of a TLS keepalive connection
Why was this MR needed?
Previously the Runner keeps the default DisableKeepAlive
setting to
false
, which ensures that API requests to POST /api/v4/jobs/request
get reused on the same connection before and
after jobs run. This connection appears to live indefinitely, but
this long-lived connection can cause a number of problems:
-
When TLS certificates were rotated on GitLab.com, existing connections continued to use the old ones to populate
CI_SERVER_TLS_CA_FILE
for Git clones. Limiting the connection to 15 minutes will force the Runner to reconnect and pick up the latest certificates. -
As https://github.com/golang/go/issues/54429 describes, Web services may scale up over time and distribute the load. Long-lived connections can prevent connections from being evenly distributed.
This commit also adds a connection_max_age
setting. If the value
is not specified, the default 15 minutes is used.
When the max age is reached, this commit calls
CloseIdleConnections()
. This will force a reconnection if all
network calls are idle. Once https://github.com/golang/go/issues/54429
is implemented, we could avoid the need to manage this timer.
What's the best way to test this MR?
main
branch
With - In your
config.toml
register a runner with gitlab.com. - Run
tcpdump -i <interface> -w /tmp/gitlab1.pcap host gitlab.com
. - Run the runner:
./out/binaries/gitlab-runner run --config config.toml
- Wait a 5-10 minutes and hit CTRL-C for both
tcpdump
andgitlab-runner
. - Open
wireshark /tmp/gitlab1.pcap
. On the firstClient Info
message, click that message, right-click onFollow
->TCP stream
. - Sort by
Info
.
You should see many Client Hello
messages.
With this branch
- Check out this branch and compile (
make runner-bin-host
) - In
config.toml
addconnection_max_age = "1s"
. - Run
tcpdump -i <interface> -w /tmp/gitlab2.pcap host gitlab.com
. - Re-run the runner:
./out/binaries/gitlab-runner run --config config.toml
- Repeats step 4-6 and see that there should only be one
Client Hello
messages for the first TCP stream.
What are the relevant issue numbers?
Relates to #37275 (closed)