Kubernetes exceeded quota error still not trigger retries in version 16.9.0
Summary
The 16.9.0 version of runner introduced with this PR a7118702 the possibility to retry scheduling Kubernetes Pods for Jobs when custom error like exceeded quota
is returned from API Server.
Using this version does not seem to execute retries and jobs fail "as usual" from first try if quotas is exceeded on the namespace.
Following the example in the docs (by the way, i think there is an indentation issue for runners.kubernetes.retry_limits ), no retries happen in jobs logs even with high values for "exceeded quota" key :
[[runners]]
name = "myRunner"
url = "https://gitlab.example.com/"
executor = "kubernetes"
[runners.kubernetes]
retry_limit = 5
[runners.kubernetes.retry_limits]
"exceeded quota" = 20
Also, it's not clear, wether retries will be applied automatically using the runner configuration , or the gitlab-ci.yml job should use a retry field on job defintion.
Steps to reproduce
- Deploy the runner with 16.9.0 version with Kubernetes executor and the new
[runners.kubernetes.retry_limits]
field and "exceeded quota" key. Executors namespace should configure resource quotas - Start pipelines using KUBERNETES_CPU and KUBERNETES_MEMORY variables to force higher value to trigger exceeded quota error.
- Check that jobs fail immeditely without any retry to wait until quotas is free to schedule the Pod for job.
.gitlab-ci.yml
my_job:
variables:
KUBERNETES_MEMORY_LIMIT: '5Gi'
KUBERNETES_MEMORY_REQUEST: '5Gi'
KUBERNETES_CPU_LIMIT: 4
KUBERNETES_CPU_REQUEST: 4
Actual behavior
- Jobs fail directly (the same behaviour as before 16.9.0)
Expected behavior
- Jobs will retry to schedule Pod as many times as defined in runners.kubernetes.retry_limits configuration
Relevant logs and/or screenshots
job log
Add the job log
Environment description
- Gitlab.com SaaS
- Runner 16.9.0 deployed in on premise K8S cluster and Kubernetes executor
config.toml contents
[runners.kubernetes.retry_limits]
"exceeded quota" = 200
Used GitLab Runner version
Possible fixes
- Make sure NewRetry method handles also non network errors when setting WithCheck builder : https://gitlab.com/gitlab-org/gitlab-runner/-/blob/v16.9.0/executors/kubernetes/kubernetes.go?ref_type=tags#L173
Edited by Nabil ZOUABI