Skip to content

Prevent CounterJobWorker from exceeding 300 seconds

What does this MR do and why?

Solves #357966 (closed).

This MR prevents CounterJobWorker from exceeding 300 seconds by enforcing a timeout: the BatchCounter check between each batch if the timeout elapsed, and if so, it stops returning a partial result. The CounterJobWorker then schedules itself again to continue counting later. In that way we keep each run of the job under 300 seconds and save our Error Budget.

This required the creation of a new method in BatchCounter: BatchCounter#count_with_timeout. This method is similar to BatchCounter#count, but takes additional optional arguments for a timeout and a partial result (to enable restarting a count operation from a previous partial result). It also has a different return value: instead of returning just the count, it returns a hash with a status, to distinguish between different situations (completion, timeout, cancellation, etc.). To avoid duplication, and at the same time avoid breaking all the places that use BatchCounter#count, this method was reimplemented on top of the new BatchCounter#count_with_timeout, simply enforcing no timeout and mapping the result hash to a simple count, maintaining the original behavior.

Note: the pre-existing BatchCounter#count returns a count, or -1 in situations where the count cannot be determined (database query cancelled, bad arguments, etc.). Arguably, returning a result object with a status like we do in the new method BatchCounter#count_with_timeout would be better, but involves changing all the call sites in non-trivial ways, which is beyond the scope of this MR. It could be done as a separate issue.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #357966 (closed)

Edited by Magdalena Frankiewicz

Merge request reports

Loading