Skip to content

Draft: Catch external pod disruptions / terminations

Hannes Hörl requested to merge hhoerl/catch-external-pod-disruptions into main

What does this MR do?

When a pod disappears (ie. spot/preemptible instances), we treat this as system_failure.

Why was this MR needed?

There are cases where pod disruptions are only caught late, because we poll the pod with relatively long intervalls.

Example: the process in the build container talks to a service in a service container, but the service container is terminated by the kubelet, thus the service not reachable anymore, and the build script fails and signals a build_failure.

What's the best way to test this MR?

  • runner.toml
    listen_address = ":9252"
    concurrent = 3
    check_interval = 1
    log_level = "debug"
    shutdown_timeout = 0
    [session_server]
      session_timeout = 1800
    [[runners]]
      name = "dm"
      limit = 3
      url = "https://gitlab.com/"
      id = 0
      token = "glrt-NopeNopeNope"
      token_obtained_at = 0001-01-01T00:00:00Z
      token_expires_at = 0001-01-01T00:00:00Z
      executor = "kubernetes"
      shell = "bash"
      [runners.kubernetes]
        image = "ubuntu:22.04"
        privileged = true
        [[runners.kubernetes.services]]
          name = "nginx"
        [[runners.kubernetes.volumes.empty_dir]]
          name = "docker-certs"
          mount_path = "/certs/client"
          medium = "Memory"
      [runners.feature_flags]
        FF_USE_ADVANCED_POD_SPEC_CONFIGURATION = true
        FF_USE_POD_ACTIVE_DEADLINE_SECONDS = true
        FF_PRINT_POD_EVENTS = true
        FF_USE_FASTZIP = true
  • pipeline.yaml
    stages:
      - test
    
    variables:
      DOCKER_HOST: tcp://docker:2376
      DOCKER_TLS_CERTDIR: "/certs"
      DOCKER_TLS_VERIFY: 1
      DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
      # FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY: true
    
    default:
      image: docker
      services:
        - docker:dind
        - nginx
    
    Test:
      stage: test
      retry:
        max: 2
        when: runner_system_failure
      script:
        - |
          while true ; do
            echo '====='
            docker info >/dev/null
            wget -O /dev/null http://nginx/
            sleep 10
    
            # exit 1
            # exit 0
          done
  • run a build

  • disrupt the pod

    • for EKS you can e.g. use the Fault Injection Service to mimic a spot instance termination
    • on GCP preemptible instances and the termination thereof can be mimicked by a shutdown / ACPI soft off
    • you can evict the pod, eg. with kubectl-evict
    • you can "just" delete the pod
  • see, that this results in a system_failure rather than a build_failure

What are the relevant issue numbers?

Merge request reports

Loading