(experimental) Add shell-adapter that supports process termination within Shell executor (!3511) · Merge requests · GitLab.org / gitlab-runner

Tomasz Maczukin requested to merge fix-shell-executor-process-termination-again into main Jun 30, 2022

What does this MR do?

Solves issues like Long running jobs canceled in GitLab UI, but ru... (#3376 - closed) and CI or CD process does not receive SIGTERM on te... (#27443 - closed) when the specific mix of Unix+su+shell executor are used.

Why was this MR needed?

To solve a known issue (#3376 (comment 1012451347)) that we thought we've already solved.

Background

The known use case happens on Unix systems when certain criteria are met:

Shell executor is used,
bash or powershell shell is defined,
runner process is owned by one user (usually root when being a system service, but this doesn't need to be a rule),
jobs are being executed by another user (gitlab-runer run is executed with --user some-user CLI option; for our DEB/RPM installation it would use the gitlab-runner user created when installing the package).

To execute jobs shell executor starts a new process of a defined shell and either sends the job step script to it through STDIN or passes it through a script file. Anyway, in a default case when --user flag is not used, we have a process tree that looks more or less like this:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml
root     root         101     100     101 bash -l
root     root         102     101     101  \_ bash -l
root     root         103     102     101      \_ sleep 600

Runner have started a bash -l shell and within this shell a sleep 600 process is running. The important thing is that PGID column. Both the shell and the sleep command are set by runner to be executed within the same process group.

When an event that triggers jobs interruption happens (runner process forceful shutdown, job's timeout, job's canceling through UI), runner uses something that we name ProcessKiller to terminate job execution. For that, ProcessKiller sends a SIGTERM signal first, waits up to 10 minutes (default and configurable value) and then sends the SIGKILL signal.

The signal is sent to the process group which is done by finding the value of PID of the direct child process that runner have created (bash --login in this case), negating the number and sending the signal to it. So in our above example runner would execute an equivalent of kill -SIGTERM -101 and if it would not cause process termination within 10 minutes, it would then call kill -SIGKILL -101. The benefit of that approach is that bash --login doesn't need to be set to handle signals, as the OS kernel will signal each process from the group individually.

Now, the problem begins when we decide to use --user flag. If this was done, our example would change to something like this:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
root     root         101     100     101 su -s /bin/bash gr -c bash -l
gr       gr           102     101     102  \_ bash -l
gr       gr           103     102     102      \_ bash -l
gr       gr           104     103     102          \_ sleep 600

In this case we have su -s /bin/bash -c bash -l being the direct children process. And as in the previous example - clue is in the PGID value. In this case, despite of being asked to create all descendant processes in the same process group, su creates the shell process separated from itself. This seems reasonable, as su is creating a new login session for the selected user.

So when we're reaching job interruption event, what happens? Runner sends kill -SIGTERM -101, but in this case in the 101 process group we have only the su command. Which means that the OS kernel will signal only su.

su is being a good player here and after receiving SIGTERM it starts terminating the session that it started. It does that by:

sending SIGTERM to it's child process,
waiting up to 2 seconds,
sending SIGKILL to it's child process.

And here we have a problem. Because the signal is sent to the process and not the process group. su expects that the shell will push termination forward to it's descendants. Unfortunately, this doesn't happen.

From what it looks like, when the shell is executing things, it doesn't respond to the signals. So SIGTERM is ignored. After a moment su follows with SIGKILL and here OS kernel doesn't ask - it just kills the process. But just the one process being su's direct descendant. And we're then left with:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
gr       gr           103       0     102 bash -l
gr       gr           104     103     102  \_ sleep 600

The first bash -l process being the direct descendant of su is gone (killed by OS kernel). The second one with it's tree of subprocesses is orphaned, so the parent PID (PPID) is changed to 0 and left. And it will be left until manually removed or until it will exit by itself (so depending on what we've been executing in our CI/CD job it may never exit).

A solution for this could be a trap defined in the shell script, but... we have no way to do this. The first shell process in the tree is used by su as a proxy - it is created to start user's "session" and immediately starts the second shell process, which is what runner asks for to have a place for job execution. But even if we would have a way to define a trap here, it's complicated. Our bash shell is a wrapper that should support both bash and sh. And - in theory - any Unix shell that claims is compatible with bash/sh. And there is also powershell that can be used on Unix systems with su being involved.

The number of combinations that we would need to cover starts getting big and hard to maintain.

How the solution works

The solution proposed in this MR is to create a "dummy shell with a trap" that would be generic for all setups. This is done by defining the gitlab-runner shell-adapter command. When the feature flag FF_USE_SHELL_ADAPTER_IN_SHELL_EXECUTOR is set to true, this is the command that is started as direct descendant of su:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
root     root         101     100     101 su -s /bin/bash gr -c gitlab-runner shell-adapter --command bash -l
gr       gr           102     101     102  \_ gitlab-runner shell-adapter --command bash -l
gr       gr           103     102     103      \_ bash -l
gr       gr           104     103     103          \_ bash -l
gr       gr           105     104     103              \_ sleep 600

gitlab-runner shell-adapter have two jobs:

start the command requested by --command forcing the usage of process group,
handle SIGTERM signal and executing exactly same ProcessKill mechanism on the executed command when the signal is received.

With this, when we reach job interruption event, su receives SIGTERM. It passes it to its descendant - the shell-adapter command. shell-adapter recognizes the signal and sends SIGTERM to all processes in the group created from its descendants. If these don't exit within 2 seconds from when the initial SIGTERM to su was sent, shell-adapter is killed with SIGKILL.

We're left then with:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
gr       gr           103       0     103 bash -l
gr       gr           104     103     103  \_ bash -l
gr       gr           105     104     103      \_ sleep 600

So su is gone, shell-adapter is gone, shells and the sleep command are left behind. But in this case they've all already received SIGTERM signal and hopefully are being terminated right now. And at least for the simple example like the one here, both shells and sleep are terminated almost immediately, and we're left with pure:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr

Potential problem

Now, the solution for now depends on all processes should respect SIGTERM. But the truth is that not all do it. And it may happen that despite that shell-adapter have signaled the whole process group with SIGTERM before being terminated itself, there can be some orphans left behind.

I see two solutions for this:

Limit the graceful termination time to 1 second in shell-adapter. In this case if whatever the job is executing will not exit immediately, it will get SIGKILL (the whole process group in fact) before shell-adapter will be force-killed itself. This however is risky, as some jobs may require a longer termination period to exit in a reasonable clean way.

Have two layers of shell-adapter. As su terminates and kills only its direct descendant, we could workaround this by having one shell-adapter that properly forwards termination signal and is basically just thrown to be devoured by su. And the second one that would persist together with job shells and script, but would be already in a awaiting for you to handle SIGTERM before I'll sent SIGKILL in a moment mode. And then we can set any value for the timeouts.

In this case, our initial state would be:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
root     root         101     100     101 su -s /bin/bash gr -c gitlab-runner shell-adapter --command gitlab-runner shell-adapter --command bash -l
gr       gr           102     101     102  \_ gitlab-runner shell-adapter --command gitlab-runner shell-adapter --command bash -l
gr       gr           103     102     103      \_ gitlab-runner shell-adapter --command bash -l
gr       gr           104     103     104          \_ bash -l
gr       gr           105     104     104              \_ bash -l
gr       gr           106     105     104                  \_ sleep 600

After su would sent its SIGTERM and following SIGINT signals, we would be left in an intermediate state:

# ps axfo user,group,pid,ppid,pgid,args
USER     GROUP        PID    PPID    PGID COMMAND
root     root         100       1     100 gitlab-runner run --config /etc/gitlab-runner/config.toml --user gr --working-directory /home/gr
gr       gr           103       0     103 gitlab-runner shell-adapter --command bash -l
gr       gr           104     103     104  \_ bash -l
gr       gr           105     104     104      \_ bash -l
gr       gr           106     105     104          \_ sleep 600

But now we don't need to hope that bash and sleep or whatever else the job is executing will make sure to exit after receiving SIGTERM from the already terminated shell-adapter. We have another instance of shell-adapter that "owns" the rest of the process tree and makes sure to send a SIGKILL after the configured time.

What's the best way to test this MR?

What are the relevant issue numbers?

Edited Jun 30, 2022 by Tomasz Maczukin

(experimental) Add shell-adapter that supports process termination within Shell executor