Image pull failed: Back-off pulling image (Kubernetes Executor)
Status update 2022-04-07
- We are testing on an EKS cluster(1.19) in order to reliably reproduce the error messages listed in the issue description and comments.
- So far we haven't been able to generate the error with our test configuration.
Test matrix
Test Date | Runner Version | EKS | Test Result |
---|---|---|---|
2022-04-07 | 14.9.0 | 1.19 | No errors pulling images |
Configurations used
gitlab-ci.yml
stages:
- build
- test
build-image:
stage: build
variables:
DOCKER_DRIVER: overlay2
DOCKER_HOST: tcp://dockerservice:2375/
DOCKER_TLS_CERTDIR: ""
image: docker:20.10.14
services:
- name: docker:20.10.14-dind
alias: dockerservice
before_script:
- sleep 30
- docker info
script:
- echo ${CI_BUILD_TOKEN} | docker login -u gitlab-ci-token --password-stdin ${CI_REGISTRY}
- docker build --pull --no-cache -t ${CI_REGISTRY_IMAGE}:1.0.0 - << EOF
FROM alpine:3.13.1
RUN apk update
RUN apk upgrade
RUN apk add --no-cache git ca-certificates curl python3 jq apache2-utils nmap-ncat netcat-openbsd wget
ENTRYPOINT []
EOF
- docker push ${CI_REGISTRY_IMAGE}:1.0.0
test:
stage: test
image: ${CI_REGISTRY_IMAGE}:1.0.0
variables:
GIT_STRATEGY: none
script:
- sleep 10
parallel: 50
config.toml
concurrent = 90
check_interval = 1
log_level = "debug"
[session_server]
session_timeout = 1800
[[runners]]
url = "https://gitlab.com/"
token = "__REDACTED__"
executor = "kubernetes"
[runners.custom_build_dir]
[runners.kubernetes]
#service_account="image-pull-secret-sa"
pull_policy="always"
#image_pull_secrets=["regcred"]
#image = "aaro301/alpine:latest"
namespace_overwrite_allowed = ""
privileged = true
allow_privilege_escalation = true
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
terminationGracePeriodSeconds = 30
[runners.kubernetes.affinity]
[runners.kubernetes.volumes]
[runners.kubernetes.dns_config]
Problem
We are currently experiencing problem on our Kubernetes cluster : pods which needs to pull image through ImagePullSecrets fail sometimes with the error image pull failed: Back-off pulling image
This problem appeared after we upgraded our kubernetes cluster from 1.17 to 1.18.
How to reproduce
- Configure a kube-executor runner and set pull_policy = "always" to maximize reproduce chances
- Create a project with this gitlab-ci.yml :
stages:
- build
- test
build-image:
stage: build
variables:
DOCKER_HOST: tcp://127.0.0.1:2375
DOCKER_TLS_CERTDIR: ""
image: docker:19.03.15
services:
- docker:19.03.15-dind
script:
- echo ${CI_BUILD_TOKEN} | docker login -u gitlab-ci-token --password-stdin ${CI_REGISTRY}
- docker build --pull --no-cache -t ${CI_REGISTRY_IMAGE}:1.0.0 - << EOF
FROM alpine:3.13.1
RUN apk update --quiet && \
apk upgrade --quiet && \
apk add --quiet --no-cache git ca-certificates curl python3 jq apache2-utils nmap-ncat netcat-openbsd wget
ENTRYPOINT []
EOF
- docker push ${CI_REGISTRY_IMAGE}:1.0.0
test:
stage: test
image: ${CI_REGISTRY_IMAGE}:1.0.0
variables:
GIT_STRATEGY: none
script:
- sleep 10
parallel: 30
- Schedule this project to run every 10 minutes until some jobs fails with
Image pull failed: Back-off pulling image
Tests & Results
We tried multiple configurations on staging environments :
Versions | EKS 1.17.12 | EKS 1.18.9 | EKS 1.19.6 | EKS 1.21.2 | EKS1.18.9(Master)+EKS1.17.12(Nodes) |
---|---|---|---|---|---|
Runner 13.3 | OK | Random errors | Random errors | Untested | Untested |
Runner 13.8 | OK | Random errors | Random errors | Untested | OK |
Runner 13.9 | OK | Random errors | Random errors | Untested | OK |
Runner 13.12 | OK | Random errors | Random errors | Random errors | OK |
Debug & Trace logs
Runner 13.8.0 / Gitlab 13.8.4
ERROR: Job failed (system failure): prepare environment: image pull failed: Back-off pulling image "registry.example.net/tests/issue-image:1.0.0". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information duration=2m54.750286918s job=22436 project=550 runner=aDykLRhX
WARNING: Failed to process runner builds=18 error=prepare environment: image pull failed: Back-off pulling image "registry.example.net/tests/issue-image:1.0.0". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information executor=kubernetes runner=aDykLRhX
Kubelet 1.18.9
kubelet: E0305 10:59:52.282050 3958 reflector.go:178] object-"runners"/"runner-adyklrhx-project-550-concurrent-216pxpk": Failed to list *v1.Secret: secrets "runner-adyklrhx-project-550-concurrent-216pxpk" is forbidden: User "system:node:issue.eu-west-1.compute.internal" cannot list resource "secrets" in API group "" in the namespace "runners": no relationship found between node "issue.eu-west-1.compute.internal" and this object
kubelet: I0305 10:59:52.347492 4012 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-lxr6w" (UniqueName: "kubernetes.io/secret/14b74129-b719-4183-bffc-530b5d9d5bb3-default-token-lxr6w") pod "runner-adyklrhx-project-550-concurrent-21q7gp7" (UID: "14b74129-b719-4183-bffc-530b5d9d5bb3")
kubelet: W0305 10:59:53.509762 4012 kubelet_pods.go:863] Unable to retrieve pull secret runners/
runner-adyklrhx-project-550-concurrent-216pxpk for runners/runner-adyklrhx-project-550-concurrent-21q7gp7 due to failed to sync secret cache: timed out waiting for the condition. The image pull may not succeed.
kubelet: E0305 10:59:55.315834 4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.net/v2/tests/issue-image/manifests/1.0.0: denied: access forbidden"
kubelet: E0305 10:59:56.125427 4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ImagePullBackOff: "Back-off pulling image \"registry.example.net/tests/issue-image:1.0.0\""
kubelet: E0305 10:59:57.146155 4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ImagePullBackOff: "Back-off pulling image \"registry.example.net/tests/issue-image:1.0.0\""
NB : on 1.17.12 we also have the first message "no relationship found between..." but no message "due to failed to sync secret cache: timed out waiting for the condition. The image pull may not succeed."
EKS "Master" 1.18.9
09:59:52.181921 - scheduler.go:731 - pod runners/runner-adyklrhx-project-550-concurrent-21q7gp7 is bound successfully on node "issue.eu-west-1.compute.internal", 12 nodes evaluated, 2 nodes were found feasible.
09:59:52.163165 - audit.k8s.io/v1 - create binding (@/api/v1/namespaces/runners/pods/runner-adyklrhx-project-550-concurrent-21q7gp7/binding & issue.eu-west-1.compute.internal ) - (HTTP 201)
09:59:52.201538 - node_authorizer.go:197 - NODE DENY: node "issue.eu-west-1.compute.internal" cannot get unknown secret runners/runner-adyklrhx-project-550-concurrent-216pxpk
09:59:52.201492 - audit.k8s.io/v1 - no relationship found between node "xxx" and this object (the ImagePullSecret) - (HTTP 403)
09:59:52.208891 - audit.k8s.io/v1 - get pods - (HTTP 200)
09:59:52.212011 - audit.k8s.io/v1 - patch pod status (running) - (HTTP 200)
09:59:53.522568 - audit.k8s.io/v1 - list the ImagePullSecret - (HTTP 200)
09:59:56.133547 - audit.k8s.io/v1 - get the ImagePullSecret - (HTTP 200)
09:59:56.139091 - audit.k8s.io/v1 - patch pod stat : ErrImagePull - (HTTP 200)
NB : on 1.17.12 we won't see any differences expect the last audit message is patch pod stat : RUNNING
instead ErrImagePull
Assumption
-
We have discarded a race condition between the creation of pod and the creation of the secret because we observed pods sometimes had "imagepullbackoff" after 30s wait to be scheduled on our cluster
-
As we can't reproduce the problem on a kubelet 1.17 with master node on 1.18 : we assume the problem is linked to kubelet >=1.18 & gitlab runner design
-
May-it be a race condition between scheduler and node_authorizer/graph_populator ? https://github.com/kubernetes/kubernetes/pull/46941
Other related issues
- Seems to be the same problem as : #4846