Provides users the option to force-cancel a canceling pipeline (was Jobs stuck for hours in "Canceling" state with "Waiting for Resource" message)
Status update (2024-10-17)
-
Specific to the related bug with the Runner Kubernetes executor: There is a fix that has been merged in v17.3 and that was used to patch GitLab Runner v16.11 to v17.2.
-
The following patches have been released as of 2024-07-27:
* GitLab Runner v17.2.1 / GitLab Runner Helm Chart v0.67.1
* GitLab Runner v17.1.1 / GitLab Runner Helm Chart v0.66.1
* GitLab Runner v17.0.2 / GitLab Runner Helm Chart v0.65.2
* GitLab Runner v16.11.3 / GitLab Runner Helm Chart v0.64.3
We also had a separate issue (#483290 (closed)) that also causes jobs to be stuck in cancelling but happens in specific circumstances. See #483290 (comment 2097069681) to determine the appropriate fix.
Overview
This issue is being opened as per the documentation.
Description: A GitLab Premium customer reports that Job IDs are stuck in "Waiting for resource," but the UI does not show a status of running
or pending
, rather it shows canceling
.
-
Project:
/sparksuite-family/hoa-express/main-stack/
-
Job IDs:
7027551018
,7027751534
-
Job status:
canceling
-
How often the problem occurs: Problem began occurring last week and has been seen sporadically since then.
-
Steps to reproduce the problem:
They have not been able to reproduce this issue consistently. It has been seen multiple times in the last week.
Thee job has been re-run since the initial failure but you can see a recording of the issue below: https://images.sparksuite.com/v/4QCsZEKKoJOs7jcmEyku
Zendesk ticket (internal link only)
Troubleshooting notes
User/Customer | GitLab Hosted or Self-Managed Runner | Runner Executor |
---|---|---|
Wes Cossick | Self-Managed Runner | Docker Machine |
Niklas van Schrick | Self-Managed Runner | Kubernetes |
SFDC | Self-Managed Runner | Kubernetes |
Internal link | Kubernetes | |
Jon Benson | Self-Managed Runner |
Implementation Guide
Allow users to force-cancel a canceling
pipeline if it is stuck in canceling. A job could end up stuck in canceling due infrastructure issues(like a runner ran out of memory) or users mistakenly running logic that will run longer than they expected.
It might be worth getting a UX proposal for this. Do we want a different force-cancel button or do we want the cancel button to remain available and it will transition the job from canceling to cancelled.
From a backend perspective we can add to CommitStatus:
event :cancel do
transition canceling: :canceled
transition running: :canceling, if: :supports_canceling?
transition CANCELABLE_STATUSES.map(&:to_sym) + [:manual] => :canceled
end
Then we need to ensure any stage/pipeline and cross project status changes work well with the new logic.