Stop considering Docker image pull as runner system failure
Overview
In gitlab-com/gl-infra/production#4649 (closed) we saw a spike of system_failures
because it failed to pull Docker images for example https://gitlab.com/steveazz/playground/-/jobs/1278831109. For GitLab.com we have an SLI that checks the error rate of runner_system_failure
. The image
keyword is something that the user controls so a single user as we see in gitlab-com/gl-infra/production#4649 (closed) can trigger this SLI with an image that doesn't exist and there is no action from us to take.
Proposal
When we fail to pull an image it shouldn't be considered as a runner system failure
Edited by Steve Xuereb