Draft: Improve Kubernetes executor's pod ready detection
What does this MR do?
Improves the pod ready detection and handles cases where the Pod is "ready", but actually has unready/terminated containers.
The error as to why the pod fails is now reported, rather than being silently ignored.
Why was this MR needed?
Fixes an issue where a pod is advertised as ready, despite the build container failing to even start/being terminated. I think there's a few cases where this can happen, but can easily occur for Windows if you specify "pwsh" as a shell, but use a job image that doesn't contain it.
What's the best way to test this MR?
On a Kubernetes cluster with Windows nodes, specify pwsh
as the shell, but use a nanoserver
image for the job (which doesn't include pwsh).
Before this MR, the error response is: ERROR: Job failed (system failure): prepare environment: unable to upgrade connection: 404 request not found.
.
After this MR, the error response is still rather cryptic, but is the response containerd/docker returns if you try to start a container with an entrypoint that doesn't exist.
What are the relevant issue numbers?
Closes #29103