Implement graceful build container shutdown for docker executor
What does this MR do?
This MR implements graceful shutdown of build containers in docker executor. It does this in two ways:
- Uses
Init:true
incontainer.HostConfig
, which runstini-init
asPID 1
. This is equivalent to including--init
indocker run
ordocker create
and is available indocker>=1.13
(see https://github.com/krallin/tini). This will propagate theSIGTERM
sent by docker ondocker stop
toPID 1
to children ofPID 1
. -
docker exec
s into the build container and sendsSIGTERM
to all process excluding 1 and itself, in decreasing numerical pid order. This part is only implemented forbash/sh
shell.
Even with 1 inplace, 2 is necessary to handle the case when shells are in the mix (which is always), since shells do no propagate signals to children. To properly send SIGTERM
(or any signal) to a process spawned by a shell, we have to send the signal directly to that process. We send signals in decreasing pid order positing that the leaf (and thus higher numbered) processes are the ones that are long running and preventing the container from exiting, and terminating them will allow the rest of the process tree to terminate naturally. Note that any shell processes that are sent signals will ignore them anyway, so this approach is less heavy-handed than it might appear at first.
Note: best reviewed commit at a time.
Note: I'm working on adding an integration test.
Why was this MR needed?
Enable graceful shutdown of build containers when a job has been cancelled or times out, to properly enable cleanup of resources created by the job.
What's the best way to test this MR?
Run the following job on a runner with docker executor:
.gitlab-ci.yml
stages:
- test
test:
timeout: 15s
stage: test
image: ubuntu:latest # this can be any container
script:
- ./long-script-with-cleanup.sh
long-script-with-cleanup.sh
#!/bin/sh
rm -f output.txt
cleanup() {
echo "caught signal $1" | tee -a output.txt
sleep 3
echo "exiting..." | tee -a output.txt
exit 0
}
trap 'cleanup "SIGTERM"' TERM
for i in $(seq 1 60); do
echo "$i" | tee -a output.txt
sleep 1
done
- let the job run until it times out
- cancel the job from the UI before the timeout
Then inspect the docker volumes for the output file created by the job script.
> sudo cat $(sudo find docker/volumes -name output.txt)
1
2
3
4
5
6
7
8
9
10
11
12
13
caught signal SIGTERM
exiting...
In both cases the lines :
caught signal SIGTERM
exiting...
indicate the process received SIGTERM
AND was allowed to exit before being SIGKILL
ed
What are the relevant issue numbers?
Closes #6359 (closed)