The ProcessRequirementsReportsWorker retries on an unrecoverable error
I came here from looking at the error budgets for Certify
The biggest cause of spend is the RequirementsManagement::ProcessRequirementsReportsWorker
failing and being retried 3 times. This worker seems to fail about 30% of the runs with an Gitlab::Access::AccessDeniedError
. The past 7 days we've spent about 8 hours of compute running these jobs that cannot succeed.
json.class.keyword: Descending | json.job_status.keyword: Descending | json.error_class.keyword: Descending | Count | Count percentages | Sum of json.duration_s |
---|---|---|---|---|---|
RequirementsManagement::ProcessRequirementsReportsWorker | done | Missing | 523,534 | 69.45% | 87,031.19 |
RequirementsManagement::ProcessRequirementsReportsWorker | fail | Gitlab::Access::AccessDeniedError | 230,299 | 30.55% | 30,303.60 |
(https://log.gprd.gitlab.net/goto/44537aca483b44655c01d987e84e9fa9)
This error is raised from the ProcessTestReportsService
. This kind of error does not warrant a Sidekiq retry, since it is unlikely succeed on a next run.
Perhaps we could also skip scheduling this worker from the BuildFinishedWorker
if it is going to be a no-op anyway?
Acceptance Criteria
There are two changes to make here:
- Don't retry this job if it's unlikely to run on subsequent attempts.
- Don't schedule this job in the first place if it won't succeed. A strategy for this is detailed in #338717 (comment 653146872):
So in this particular example, I think we could consider wrapping the 4 checks in a method that we can check before scheduling the job.
In our codebase, we have this kind of pattern using a nested
Async
class, perhaps we could do something similar here?