Draft: Catch external pod disruptions / terminations (!5068) · Merge requests · GitLab.org / gitlab-runner

Hannes Hörl requested to merge hhoerl/catch-external-pod-disruptions into main Oct 10, 2024

What does this MR do?

When a pod disappears (ie. spot/preemptible instances), we treat this as system_failure.

Why was this MR needed?

There are cases where pod disruptions are only caught late, because we poll the pod with relatively long intervalls.

Example: the process in the build container talks to a service in a service container, but the service container is terminated by the kubelet, thus the service not reachable anymore, and the build script fails and signals a build_failure.

What's the best way to test this MR?

runner.toml

listen_address = ":9252"
concurrent = 3
check_interval = 1
log_level = "debug"
shutdown_timeout = 0
[session_server]
  session_timeout = 1800
[[runners]]
  name = "dm"
  limit = 3
  url = "https://gitlab.com/"
  id = 0
  token = "glrt-NopeNopeNope"
  token_obtained_at = 0001-01-01T00:00:00Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "kubernetes"
  shell = "bash"
  [runners.kubernetes]
    image = "ubuntu:22.04"
    privileged = true
    [[runners.kubernetes.services]]
      name = "nginx"
    [[runners.kubernetes.volumes.empty_dir]]
      name = "docker-certs"
      mount_path = "/certs/client"
      medium = "Memory"
  [runners.feature_flags]
    FF_USE_ADVANCED_POD_SPEC_CONFIGURATION = true
    FF_USE_POD_ACTIVE_DEADLINE_SECONDS = true
    FF_PRINT_POD_EVENTS = true
    FF_USE_FASTZIP = true

pipeline.yaml

stages:
  - test

variables:
  DOCKER_HOST: tcp://docker:2376
  DOCKER_TLS_CERTDIR: "/certs"
  DOCKER_TLS_VERIFY: 1
  DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
  # FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY: true

default:
  image: docker
  services:
    - docker:dind
    - nginx

Test:
  stage: test
  retry:
    max: 2
    when: runner_system_failure
  script:
    - |
      while true ; do
        echo '====='
        docker info >/dev/null
        wget -O /dev/null http://nginx/
        sleep 10

        # exit 1
        # exit 0
      done

run a build
disrupt the pod
- for EKS you can e.g. use the Fault Injection Service to mimic a spot instance termination
- on GCP preemptible instances and the termination thereof can be mimicked by a shutdown / ACPI soft off
- you can evict the pod, eg. with kubectl-evict
- you can "just" delete the pod
see, that this results in a system_failure rather than a build_failure

Draft: Catch external pod disruptions / terminations

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports