WIP: Add retry on docker requests
What does this MR do?
Add a retry mechanism for all docker commands that can return a 404 on sending requests to the Docker API. Use the retry package, so every time a 404 is returned from the API it is returned it is retried with some backoff logic.
Why was this MR needed?
Sometimes when the Docker daemon is underload or there is performance problems it might result in the Runner trying to start/inspect a container that hasn't been created yet, which results in a failure on the job.
Are there points in the code the reviewer needs to double check?
Looking at all the methods that we use and at v1.25 API below are the ones that can return a 404
because the container is not found:
- ImageInspectWithRaw
- ContainerStart
- ContainerKill
- ContainerInspect
- ContainerAttach
- ContainerRemove
- ContainerLogs
- ContainerExecCreate
- ContainerExecAttach
- NetworkRemove
- NetworkDisconnect
- NetworkInspect
Discussion for PoC
Background
Docker will return an objecNotFound
, every time a 404
is returned from the API. It doesn't seem like there is a way to make it automatically retry when 404
error is returned in wrapError
or anything since each function call is different. To make it reusable we need to wrap each call a run
function which will just return an error and then check that error type which is what we are doing in this PoC.
What you should be looking when evaluating the PoC
- Does it make sense, how we are using the retry mechanism?
- Is there a better way instead of having us create a struct to implement the retryable interface every time?
- Do you think it's clear that the user has to look at the struct fields to get the response from the field?
- Do we all wish Go has generics?
- Would it be easier to implement a backoff mechanism just for the Docker library?
- Does it make sense to have
context
, passed toRun
instead of having it as a struct field. - Should we implement this for every method we have inside of the client or the ones that only that make sense?
What you should NOT be looking at when evaluating the PoC
- Code quality
- Logging
- Any error checking
This is a simple PoC to get an idea of what we need to do. The
Does this MR meet the acceptance criteria?
-
Documentation created/updated -
Added tests for this feature/bug -
In case of conflicts with master
- branch was rebased
What are the relevant issue numbers?
reference #4450 (closed)