Jobs intermittently hang after executing `runner_script_trap` with attach strategy

Hello there!

We at @maltcommunity are migrating our CI to Gitlab CI. We're pretty excited because Gitlab CI fits our needs! We stumbled upon an intermittent bug which is very similar to #25346 (closed). We see it almost every day. We'd love to hear what you all think about it.

Summary

Jobs hang on a step that outside of what's defined in .gitlab-ci.yml. It's usually when uploading caches, but the most recent examples show it also happens when extracting the cache as well as just setting up the environment.

Steps to reproduce

Set FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY = false and run jobs. In our case, we build some Java applications with maven. I can provide the job definitions as well as .gitlab-ci.yml, although I currently don't believe it's relevant: the problem seems to occur in the helper container.

Actual behavior

We usually see this kind of logs:

Saving cache for successful job
Creating cache m2-gitlab-parent-6...
.m2/repository: found 2395 matching files and directories 
Uploading cache.zip to https://storage.googleapis.com/[redacted]
Created cache
Creating cache node-6465f6dab11ad48f5f055dc39eb8f90044374c3c-6...
node_modules: found 153333 matching files and directories 
WARNING: parent/node_modules: no matching files    
Uploading cache.zip to https://storage.googleapis.com/[redacted]
Created cache
ERROR: Job failed: execution took longer than 1h0m0s seconds

Cache is present on the bucket. Nothing wrong in this regard. Usually the jobs are less than 15 minutes long. So, the helper takes more than 45 minutes idling, until it times out.

Expected behavior

Cache is present and job ends whenever it's uploaded.

Relevant logs and/or screenshots

We turned on debug logs and started to see a pattern: it stops right at the end of the runner_script_trap function, which is called several times in the helper container. Here are 2 different examples:

++ echo 'Checking cache for m2-gitlab-api-models/messenger-api-models-6...'
Checking cache for m2-gitlab-api-models/messenger-api-models-6...
++ /usr/bin/gitlab-runner-helper cache-extractor --file [redacted] --timeout 10 --url 'https://storage.googleapis.com/[redacted]'
Downloading cache.zip from https://storage.googleapis.com/[redacted]
++ echo 'Successfully extracted cache'
Successfully extracted cache
++ /usr/bin/gitlab-runner-helper --version
++ echo 'Checking cache for node-6465f6dab11ad48f5f055dc39eb8f90044374c3c-6...'
Checking cache for node-6465f6dab11ad48f5f055dc39eb8f90044374c3c-6...
++ /usr/bin/gitlab-runner-helper cache-extractor --file [redacted] --timeout 10 --url 'https://storage.googleapis.com/[redacted]'
Downloading cache.zip from https://storage.googleapis.com/[redacted] 
++ echo 'Successfully extracted cache'
Successfully extracted cache
+ exit 0
+ runner_script_trap
+ exit_code=0
+ log_file=/logs-26473222-1459024058/output.log
+ out_json='{"command_exit_code": 0, "script": "/scripts-26473222-1459024058/restore_cache"}'
++ tail -c1 /logs-26473222-1459024058/output.log
++ wc -l
+ [[ 1 -gt {"command_exit_code": 0, "script": "/scripts-26473222-1459024058/restore_cache"}
0 ]]
+ printf '{"command_exit_code": 0, "script": "/scripts-26473222-1459024058/restore_cache"}\n'
+ exit 0
ERROR: Job failed: execution took longer than 1h0m0s seconds

Preparing environment 01:00:00
Waiting for pod gitlab-runner/runner-kjn4uzc6-project-26473222-concurrent-21tqdf7 to be running, status is Pending
+ set -eo pipefail
+ set +o noclobber
+ :
+ eval 'echo "Running on $(hostname) via malt-gitlab-runner-gitlab-runner-5c9597f855-7qp7l..."
'
+++ hostname
++ echo 'Running on runner-kjn4uzc6-project-26473222-concurrent-21tqdf7 via malt-gitlab-runner-gitlab-runner-5c9597f855-7qp7l...'
Running on runner-kjn4uzc6-project-26473222-concurrent-21tqdf7 via malt-gitlab-runner-gitlab-runner-5c9597f855-7qp7l...
+ exit 0
+ runner_script_trap
+ exit_code=0
+ log_file=/logs-26473222-1458907526/output.log
+ out_json='{"command_exit_code": 0, "script": "/scripts-26473222-1458907526/prepare_script"}'
++ tail -c1 /logs-26473222-1458907526/output.log
++ wc -l
+ [[ 1 -gt {"command_exit_code": 0, "script": "/scripts-26473222-1458907526/prepare_script"}
0 ]]
+ printf '{"command_exit_code": 0, "script": "/scripts-26473222-1458907526/prepare_script"}\n'
+ exit 0
ERROR: Job failed: execution took longer than 1h0m0s seconds

Environment description

Used GitLab Runner version

We're using the Kubernetes executor v14.1.0. The helper image is registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-8925d9a0

Are resources lacking?

At first, we thought this issue was due to the lack of resources on our machines. I monitored last 2 example and I can say pods and nodes weren't under heavy load. CPU usage was nominal as well as RAM, network, disk usage. I believe it's a bug somewhere around runner_script_trap[1]

Additional info

I'd be happy to try any config change or code ones in order to help figuring out/fixing this issue 🙂

[1] https://gitlab.com/gitlab-org/gitlab-runner/-/blob/ba0d3d6bc66999b30c22cc9c7f54ad14c91dbdaa/shells/bash_trap.go#L17

Edited Jul 28, 2021 by Johan Lorenzo