WIP: Add retry on docker requests (!1983) · Merge requests · GitLab.org / gitlab-runner

Steve Xuereb requested to merge 4450-error-job-failed-system-failure-error-no-such-container-container_id-retry into master Apr 01, 2020

What does this MR do?

Add a retry mechanism for all docker commands that can return a 404 on sending requests to the Docker API. Use the retry package, so every time a 404 is returned from the API it is returned it is retried with some backoff logic.

Why was this MR needed?

Sometimes when the Docker daemon is underload or there is performance problems it might result in the Runner trying to start/inspect a container that hasn't been created yet, which results in a failure on the job.

Are there points in the code the reviewer needs to double check?

Looking at all the methods that we use and at v1.25 API below are the ones that can return a 404 because the container is not found:

ImageInspectWithRaw
ContainerStart
ContainerKill
ContainerInspect
ContainerAttach
ContainerRemove
ContainerLogs
ContainerExecCreate
ContainerExecAttach
NetworkRemove
NetworkDisconnect
NetworkInspect

Discussion for PoC

Background

Docker will return an objecNotFound, every time a 404 is returned from the API. It doesn't seem like there is a way to make it automatically retry when 404 error is returned in wrapError or anything since each function call is different. To make it reusable we need to wrap each call a run function which will just return an error and then check that error type which is what we are doing in this PoC.

What you should be looking when evaluating the PoC

Does it make sense, how we are using the retry mechanism?
Is there a better way instead of having us create a struct to implement the retryable interface every time?
Do you think it's clear that the user has to look at the struct fields to get the response from the field?
Do we all wish Go has generics?
Would it be easier to implement a backoff mechanism just for the Docker library?
Does it make sense to have context, passed to Run instead of having it as a struct field.
Should we implement this for every method we have inside of the client or the ones that only that make sense?

What you should NOT be looking at when evaluating the PoC

Code quality
Logging
Any error checking

This is a simple PoC to get an idea of what we need to do. The

Does this MR meet the acceptance criteria?

Documentation created/updated
Added tests for this feature/bug
In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

reference #4450 (closed)

Edited Apr 01, 2020 by Steve Xuereb

WIP: Add retry on docker requests