Introduce a timeout for prepare stage of a single job execution

Release notes

{placeholder for release notes}

Overview

In gitlab-com/gl-infra/production#2351 (closed) there were networking problems between GitLab and DockerHub which resulted in the docker pull $IMAGE command to take an extremely long time since there were no timeouts set in place by the Runner. The docker pull ended up taking the duration of the job, so if the job timeout was 3 hours, the job took 3 hours until it finished trying to pull the image. This caused a cascading effect because the available number of concurrent jobs was filled up by these jobs pulling forever.

Proposal

Allow user to define a configurable timeout for the whole prepare stage of a single job execution. Note - by introducing the timeout for the prepare stage, as noted here, this would be a generic solution that will be used by all CI jobs regardless of the executor that is used.
Introduce a default limit of 15 minutes. The assumption is that 15 minutes is enough time to prepare the environment for docker+linux. Docker+windows may need a longer timeout.
Enable the default timeout to be changed in the [[runners]] level in the runner's config.toml file. in case of the Docker executor, the prepare stage would be limited mostly by the time of pulling the images and spinning-up the service containers

Expected behavior when implemented

Docker executor: in case of the Docker executor, the prepare stage would be limited mostly by the time of pulling the images and spinning-up the service containers.
Docker machine executor: in case of the Docker Machine executor it would additionally include also the time needed to spin-up the autoscaled VM if it's the case with the provided autoscaling configuration.
VirtualBox/Parallels: in case of the VirtualBox/Parallels executors it would mostly relay on the time needed to spin-up a new VM.
SSH & Shell: in case of the SSH and Shell executors the prepare stage should be mostly innoticeable.
Custom executor: in case of the Custom executor, it fully depends on the driver's configuration

Original proposal (included here to provide additional context)

Implement a timeout when pulling a docker image for the executordocker of around 15-30 minutes so that the job fails if it takes longer than that time (if it has not pulled the image by then it's safe to say the image isn't going to be pulled). This would result in the Runner being more resilient to packet loss between the Runner and where it's pulling the image. If a timeout is implemented the job finishes at a failed state much quicker and we don't end up putting back pressure on the job queues.
We need to think carefully about the timeout here and make sure that we have enough time to pull Windows images that are sometimes 5GB large. If the docker pull logic has a way to see if we are getting any data from the network request or not it might be ideal to see if we can have some kind of checks to see when the last byte was written.
We should also investigate how Kubernetes does this internally and try to mimic this since Kubernetes is resilient to such problems and just shows an error to the user when it can't pull an image.

Edited Feb 16, 2021 by Darren Eastman - out of office from Monday Dec 23 2024 to Wednesday Jan 2 2025