Cancel stage script upon job cancellation in attach mode
What does this MR do?
When the job is cancelled from the UI, in attach mode for the executorkubernetes, GitLab Runner doesn't cancel the execution and hangs until the job eventually times out.
With this MR, a command is remotely executed on the stage container to explicitly cancel the ongoing script.
For bash
shell, all pids those the name ends with the stage script and its child process are killed. We however avoid to kill the parent process responsible of the tee
redirection in the output.log file.
Without this redirection it is impossible to GitLab Runner to get the trap_status
and gracefully finish the job.
The same logic was used for powershell
shell.
Why was this MR needed?
When using a Kubernetes executor with a self-hosted Runner (16.11.1) on GitLab.com jobs appear to hang instead of cancelling immediately.
What's the best way to test this MR?
Pipeline passes
gitlab-ci
variables:
FF_USE_POWERSHELL_PATH_RESOLVER: "true"
FF_RETRIEVE_POD_WARNING_EVENTS: "true"
FF_PRINT_POD_EVENTS: "true"
FF_SCRIPT_SECTIONS: "true"
FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR: "false" # also tested with "true" value
simple-job:
script:
- sleep 3600
after_script:
- echo "this is the after_script running"
Bash
config.toml
concurrent = 1
check_interval = 1
log_level = "debug"
shutdown_timeout = 0
listen_address = ':9252'
[session_server]
session_timeout = 1800
[[runners]]
name = "investigation"
url = "https://gitlab.com/"
id = 0
token = "glrt-REDACTED"
token_obtained_at = "0001-01-01T00:00:00Z"
token_expires_at = "0001-01-01T00:00:00Z"
executor = "kubernetes"
shell = "bash"
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = "alpine"
pod_termination_grace_period_seconds = 3600
namespace = ""
namespace_overwrite_allowed = ""
pod_labels_overwrite_allowed = ""
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
node_selector_overwrite_allowed = ".*"
allow_privilege_escalation = false
[[runners.kubernetes.services]]
[runners.kubernetes.dns_config]
[runners.kubernetes.pod_labels]
user = "ratchade"
Those tests were made with FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR:true
and FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR:false
PowerShell
config.toml
concurrent = 1
check_interval = 1
log_level = "debug"
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = ""
url = "https://gitlab.com/"
id = 0
token = "glrt-REDACTED"
token_obtained_at = "0001-01-01T00:00:00Z"
token_expires_at = "0001-01-01T00:00:00Z"
executor = "kubernetes"
shell = "powershell"
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = "mcr.microsoft.com/windows/servercore:ltsc2022"
namespace = ""
namespace_overwrite_allowed = ""
node_selector_overwrite_allowed = ""
helper_image = "gitlab/gitlab-runner-helper:x86_64-latest-servercore21H2"
poll_timeout = 3600
pod_labels_overwrite_allowed = ""
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
[runners.kubernetes.node_selector]
"kubernetes.io/arch" = "amd64"
"kubernetes.io/os" = "windows"
"node.kubernetes.io/windows-build" = "10.0.20348"
[runners.kubernetes.pod_security_context]
[runners.kubernetes.volumes]
[runners.kubernetes.dns_config]
Job cancelled as expected
What are the relevant issue numbers?
close #37780 (closed) https://gitlab.com/gitlab-com/ops-sub-department/section-ops-request-for-help/-/issues/340