Shorten MergeTrains refresh life span and add StuckTrainWorker
What does this MR do and why?
This MR changes the locking mechanism we use to prevent concurrent executions of the MergeTrain::RefreshService
on the same merge train.
Instead of using deduplicate with the standard 6 hour lock TTL, we use a SleepingLock
from ExclusiveLeaseHelpers
with a 4-minute TTL. RefreshService
itself has been modified to stop refreshing cars on the merge train after three minutes and return an error, causing the worker to immediately re-queue another job.
Effectively, the MergeTrain::RefreshWorker will continue running on a given Merge Train until all the cars have been refreshed, but by shortening the lifespan of each we can become more fault tolerant, allowing for the RefreshWorker to be triggered more often.
This change, on it's own, shouldn't affect the execution of the RefreshService or solve the stuck MergeTrain problem as described in Merge request stuck in locked state when gettin... (#389044). But it does allow us to introduce a new worker that can detect what we believe to be a stuck MergeTrain, and fire off a new RefreshWorker job to take care of it, on a 5-minute interval.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.