Prevent CounterJobWorker from exceeding 300 seconds
What does this MR do and why?
Solves #357966 (closed).
This MR prevents CounterJobWorker
from exceeding 300 seconds by enforcing a timeout: the BatchCounter
check between each batch if the timeout elapsed, and if so, it stops returning a partial result. The CounterJobWorker
then schedules itself again to continue counting later. In that way we keep each run of the job under 300 seconds and save our Error Budget.
This required the creation of a new method in BatchCounter
: BatchCounter#count_with_timeout
. This method is similar to BatchCounter#count
, but takes additional optional arguments for a timeout and a partial result (to enable restarting a count operation from a previous partial result). It also has a different return value: instead of returning just the count, it returns a hash with a status, to distinguish between different situations (completion, timeout, cancellation, etc.). To avoid duplication, and at the same time avoid breaking all the places that use BatchCounter#count
, this method was reimplemented on top of the new BatchCounter#count_with_timeout
, simply enforcing no timeout and mapping the result hash to a simple count, maintaining the original behavior.
Note: the pre-existing BatchCounter#count
returns a count, or -1 in situations where the count cannot be determined (database query cancelled, bad arguments, etc.). Arguably, returning a result object with a status like we do in the new method BatchCounter#count_with_timeout
would be better, but involves changing all the call sites in non-trivial ways, which is beyond the scope of this MR. It could be done as a separate issue.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #357966 (closed)