Draft: Catch external pod disruptions / terminations
What does this MR do?
When a pod disappears (ie. spot/preemptible instances), we treat this as system_failure.
Why was this MR needed?
There are cases where pod disruptions are only caught late, because we poll the pod with relatively long intervalls.
Example: the process in the build container talks to a service in a service container, but the service container is terminated by the kubelet, thus the service not reachable anymore, and the build script fails and signals a build_failure.
What's the best way to test this MR?
-
runner.toml
listen_address = ":9252" concurrent = 3 check_interval = 1 log_level = "debug" shutdown_timeout = 0 [session_server] session_timeout = 1800 [[runners]] name = "dm" limit = 3 url = "https://gitlab.com/" id = 0 token = "glrt-NopeNopeNope" token_obtained_at = 0001-01-01T00:00:00Z token_expires_at = 0001-01-01T00:00:00Z executor = "kubernetes" shell = "bash" [runners.kubernetes] image = "ubuntu:22.04" privileged = true [[runners.kubernetes.services]] name = "nginx" [[runners.kubernetes.volumes.empty_dir]] name = "docker-certs" mount_path = "/certs/client" medium = "Memory" [runners.feature_flags] FF_USE_ADVANCED_POD_SPEC_CONFIGURATION = true FF_USE_POD_ACTIVE_DEADLINE_SECONDS = true FF_PRINT_POD_EVENTS = true FF_USE_FASTZIP = true
-
pipeline.yaml
stages: - test variables: DOCKER_HOST: tcp://docker:2376 DOCKER_TLS_CERTDIR: "/certs" DOCKER_TLS_VERIFY: 1 DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client" # FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY: true default: image: docker services: - docker:dind - nginx Test: stage: test retry: max: 2 when: runner_system_failure script: - | while true ; do echo '=====' docker info >/dev/null wget -O /dev/null http://nginx/ sleep 10 # exit 1 # exit 0 done
-
run a build
-
disrupt the pod
- for EKS you can e.g. use the Fault Injection Service to mimic a spot instance termination
- on GCP preemptible instances and the termination thereof can be mimicked by a shutdown / ACPI soft off
- you can evict the pod, eg. with kubectl-evict
- you can "just" delete the pod
-
see, that this results in a system_failure rather than a build_failure