Kubernetes should use a scheduled cleaner to ensure all related resources are cleaned up
this proposal. Find part 1 at #4184 (closed)
This is Part 2 ofSummary
If a Runner is abruptly shut down it doesn't get the chance to do cleanup. After a restart, the runner is unaware of previously created resources. This issue moves this burden away from GitLab Runner into a separate garbage collector/cleaner.
Proposal
In an external project create a cleaner that is ran through a Kubernetes CronJob every X amount of time. The cleaner's job would be to clean up any stale resources marked with specific labels.
Specification:
-
cleaner
that is running on a cron every X time.- Search for pods annotated with
cleaner.kubernetes.gitlab.org/ttl: 'X'
. WhereX
is the number in seconds of how long the pod should be old for, for example,36000
(1hr) - Get the pod, it can be running/failed/initalizing whatever the state, check its creation time and see if the creation time is older then the
ttl
if it is delete it.
- Search for pods annotated with
- Have GitLab Runner specify
pod_annotations
inside of theconfig.toml
- If the timeout of a job is 3 hours, they can update the
config.toml
to specifycleaner.kubernetes.gitlab.org/ttl: '11700'
(3hrs15min). The extra 15min are there in case pod deletion requested by GitLab Runner takes a long time. - Users are also able to override annotations
- If the timeout of a job is 3 hours, they can update the
Example:
-
cleaner
runs and finds a pod that was created3hrs20min
ago with the annotationcleaner.kubernetes.gitlab.org/ttl: '11700'
. It deletes the pod -
cleaner
runs and finds a pod that was created1hr
ago with the annotationcleaner.kubernetes.gitlab.org/ttl: '11700'
. It ignores the pod.
Specification document:
https://gitlab.com/azzsteve/podgc/-/blob/3a4fea99acf31386a2a3156a3c4bf5d761057c35/SPEC.md
Distribution:
-
Helm: Most of the Kubernetes executor users use the Helm Chart. We could easily add a CronJob resource to the templates, which will be created and managed by helm. With a few configuration options, such as
cron job image
andcron job expression
we should be good to go. - Runner Operator: The Operator, just like the Helm chart can deploy and manage the CronJob.
- Others: For other deployments we can provide a simple CronJob yaml in the docs, which users can use to manage it themselves.
Future iterations
The above specification is a great first iteration. Future improvements might or might not include:
- Automatically setting the ttl label to be the same(with a few minutes/seconds leeway) as the job's duration
- Terminate pods through liveness probes and let them wait for cleanup in a failed state. This way they will consume less resources
Edited by Darren Eastman