More granular processing of replication jobs
Problems to solve
-
Currently there is one worker goroutine per Praefect instance processing the replication queue. An instance gets 10 jobs from the queue and processes them sequentially. If the first job in the sequence takes a long time (10 minutes), the other 9 jobs dequeued by the instance are blocked for 10 minutes.
-
Since replication jobs do not currently have a timeout, a slow job can block replications to 10 repositories for an undefined time. Since the dequeuing a job acquires the repository lock, no other Praefect instance can process any jobs for the affected repositories either: #3486
-
Job attempts are decremented for every job in the dequeue batch. Problems processing one job can lead to attempts for other jobs being decremented as well, since they are dequeued in the same batch.
Possible solution
-
Each Praefect should have a configurable number of worker goroutines processing replication jobs. This allows the concurrency be determined independently from the number of Praefect instances. With higher number of workers, few slower jobs do not block replications for the whole cluster.
-
Each worker should dequeue and lock a single job at a time to prevent a problematic job from either blocking replications to other repositories or affecting the attempt count for other jobs.
-
We should enforce a timeout for replication jobs to prevent them from blocking workers indefinitely: #3486