(experimental) Add shell-adapter that supports process termination within Shell executor
What does this MR do?
Solves issues like Long running jobs canceled in GitLab UI, but ru... (#3376 - closed) and CI or CD process does not receive SIGTERM on te... (#27443 - closed) when the specific mix of Unix+su+shell executor are used.
Why was this MR needed?
To solve a known issue (#3376 (comment 1012451347)) that we thought we've already solved.
Background
The known use case happens on Unix systems when certain criteria are met:
- Shell executor is used,
-
bash
orpowershell
shell is defined, - runner process is owned by one user (usually
root
when being a system service, but this doesn't need to be a rule), - jobs are being executed by another user (
gitlab-runer run
is executed with--user some-user
CLI option; for our DEB/RPM installation it would use thegitlab-runner
user created when installing the package).
To execute jobs shell executor starts a new process of a defined shell and either sends the job step script to it through STDIN or passes it through a script file. Anyway, in a default case when --user
flag is not used, we have a process tree that looks more or less like this:
# ps axfo user,group,pid,ppid,pgid,args
USER GROUP PID PPID PGID COMMAND
root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml
root root 101 100 101 bash -l
root root 102 101 101 \_ bash -l
root root 103 102 101 \_ sleep 600
Runner have started a bash -l
shell and within this shell a sleep 600
process is running. The important thing is that PGID
column. Both the shell and the sleep
command are set by runner to be executed within the same process group.
When an event that triggers jobs interruption happens (runner process forceful shutdown, job's timeout, job's canceling through UI), runner uses something that we name ProcessKiller
to terminate job execution. For that, ProcessKiller
sends a SIGTERM
signal first, waits up to 10 minutes (default and configurable value) and then sends the SIGKILL
signal.
The signal is sent to the process group which is done by finding the value of PID
of the direct child process that runner have created (bash --login
in this case), negating the number and sending the signal to it. So in our above example runner would execute an equivalent of kill -SIGTERM -101
and if it would not cause process termination within 10 minutes, it would then call kill -SIGKILL -101
. The benefit of that approach is that bash --login
doesn't need to be set to handle signals, as the OS kernel will signal each process from the group individually.
Now, the problem begins when we decide to use --user
flag. If this was done, our example would change to something like this:
# ps axfo user,group,pid,ppid,pgid,args
USER GROUP PID PPID PGID COMMAND
root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
root root 101 100 101 su -s /bin/bash gr -c bash -l
gr gr 102 101 102 \_ bash -l
gr gr 103 102 102 \_ bash -l
gr gr 104 103 102 \_ sleep 600
In this case we have su -s /bin/bash -c bash -l
being the direct children process. And as in the previous example - clue is in the PGID
value. In this case, despite of being asked to create all descendant processes in the same process group, su
creates the shell process separated from itself. This seems reasonable, as su
is creating a new login session for the selected user.
So when we're reaching job interruption event, what happens? Runner sends kill -SIGTERM -101
, but in this case in the 101
process group we have only the su
command. Which means that the OS kernel will signal only su
.
su
is being a good player here and after receiving SIGTERM
it starts terminating the session that it started. It does that by:
- sending
SIGTERM
to it's child process, - waiting up to 2 seconds,
- sending
SIGKILL
to it's child process.
And here we have a problem. Because the signal is sent to the process and not the process group. su
expects that the shell will push termination forward to it's descendants. Unfortunately, this doesn't happen.
From what it looks like, when the shell is executing things, it doesn't respond to the signals. So SIGTERM
is ignored. After a moment su
follows with SIGKILL
and here OS kernel doesn't ask - it just kills the process. But just the one process being su
's direct descendant. And we're then left with:
# ps axfo user,group,pid,ppid,pgid,args
USER GROUP PID PPID PGID COMMAND
root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
gr gr 103 0 102 bash -l
gr gr 104 103 102 \_ sleep 600
The first bash -l
process being the direct descendant of su
is gone (killed by OS kernel). The second one with it's tree of subprocesses is orphaned, so the parent PID (PPID
) is changed to 0
and left. And it will be left until manually removed or until it will exit by itself (so depending on what we've been executing in our CI/CD job it may never exit).
A solution for this could be a trap defined in the shell script, but... we have no way to do this. The first shell process in the tree is used by su
as a proxy
- it is created to start user's "session" and immediately starts the second shell process, which is what runner asks for to have a place for job execution. But even if we would have a way to define a trap here, it's complicated. Our bash
shell is a wrapper that should support both bash
and sh
. And - in theory - any Unix shell that claims is compatible with bash/sh
. And there is also powershell
that can be used on Unix systems with su
being involved.
The number of combinations that we would need to cover starts getting big and hard to maintain.
How the solution works
The solution proposed in this MR is to create a "dummy shell with a trap" that would be generic for all setups. This is done by defining the gitlab-runner shell-adapter
command. When the feature flag FF_USE_SHELL_ADAPTER_IN_SHELL_EXECUTOR
is set to true
, this is the command that is started as direct descendant of su
:
# ps axfo user,group,pid,ppid,pgid,args
USER GROUP PID PPID PGID COMMAND
root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
root root 101 100 101 su -s /bin/bash gr -c gitlab-runner shell-adapter --command bash -l
gr gr 102 101 102 \_ gitlab-runner shell-adapter --command bash -l
gr gr 103 102 103 \_ bash -l
gr gr 104 103 103 \_ bash -l
gr gr 105 104 103 \_ sleep 600
gitlab-runner shell-adapter
have two jobs:
- start the command requested by
--command
forcing the usage of process group, - handle
SIGTERM
signal and executing exactly sameProcessKill
mechanism on the executed command when the signal is received.
With this, when we reach job interruption event, su
receives SIGTERM
. It passes it to its descendant - the shell-adapter
command. shell-adapter
recognizes the signal and sends SIGTERM
to all processes in the group created from its descendants. If these don't exit within 2 seconds from when the initial SIGTERM
to su
was sent, shell-adapter
is killed with SIGKILL
.
We're left then with:
# ps axfo user,group,pid,ppid,pgid,args
USER GROUP PID PPID PGID COMMAND
root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
gr gr 103 0 103 bash -l
gr gr 104 103 103 \_ bash -l
gr gr 105 104 103 \_ sleep 600
So su
is gone, shell-adapter
is gone, shells and the sleep
command are left behind. But in this case they've all already received SIGTERM
signal and hopefully are being terminated right now. And at least for the simple example like the one here, both shells and sleep
are terminated almost immediately, and we're left with pure:
# ps axfo user,group,pid,ppid,pgid,args
USER GROUP PID PPID PGID COMMAND
root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
Potential problem
Now, the solution for now depends on all processes should respect SIGTERM
. But the truth is that not all do it. And it may happen that despite that shell-adapter
have signaled the whole process group with SIGTERM
before being terminated itself, there can be some orphans left behind.
I see two solutions for this:
-
Limit the graceful termination time to 1 second in
shell-adapter
. In this case if whatever the job is executing will not exit immediately, it will getSIGKILL
(the whole process group in fact) beforeshell-adapter
will be force-killed itself. This however is risky, as some jobs may require a longer termination period to exit in a reasonable clean way. -
Have two layers of
shell-adapter
. Assu
terminates and kills only its direct descendant, we could workaround this by having oneshell-adapter
that properly forwards termination signal and is basically just thrown to be devoured bysu
. And the second one that would persist together with job shells and script, but would be already in aawaiting for you to handle SIGTERM before I'll sent SIGKILL in a moment
mode. And then we can set any value for the timeouts.In this case, our initial state would be:
# ps axfo user,group,pid,ppid,pgid,args USER GROUP PID PPID PGID COMMAND root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr root root 101 100 101 su -s /bin/bash gr -c gitlab-runner shell-adapter --command gitlab-runner shell-adapter --command bash -l gr gr 102 101 102 \_ gitlab-runner shell-adapter --command gitlab-runner shell-adapter --command bash -l gr gr 103 102 103 \_ gitlab-runner shell-adapter --command bash -l gr gr 104 103 104 \_ bash -l gr gr 105 104 104 \_ bash -l gr gr 106 105 104 \_ sleep 600
After
su
would sent itsSIGTERM
and followingSIGINT
signals, we would be left in an intermediate state:# ps axfo user,group,pid,ppid,pgid,args USER GROUP PID PPID PGID COMMAND root root 100 1 100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr gr gr 103 0 103 gitlab-runner shell-adapter --command bash -l gr gr 104 103 104 \_ bash -l gr gr 105 104 104 \_ bash -l gr gr 106 105 104 \_ sleep 600
But now we don't need to hope that
bash
andsleep
or whatever else the job is executing will make sure to exit after receivingSIGTERM
from the already terminatedshell-adapter
. We have another instance ofshell-adapter
that "owns" the rest of the process tree and makes sure to send aSIGKILL
after the configured time.
What's the best way to test this MR?
What are the relevant issue numbers?
Related to https://gitlab.com/gitlab-com/ops-sub-department/section-ops-request-for-help/-/issues/7, #3376 (closed), #27443 (closed)