Improve scheduling for batched background migrations
After initial production testing of batched migrations, we have a few improvements that can be made immediately.
Allow short variance when scheduling next job
A current limitation of the sidekiq-cron
scheduling is that it only polls for jobs once per minute. Since the interval between jobs is often configured at an even minute value, this results in the cron missing the execution of the next job by a matter of seconds.
In practice, we saw a migration with a configured interval of 2 minutes running jobs at mostly a 3 minute interval. In order to work around this problem, we can add a small variance when checking if the interval has elapsed since the last job. If we're within a reasonable window of the next execution (a few seconds), we should execute the job.
ExclusiveLease
Increase timeout on
We currently takeout an ExclusiveLease
in the BatchedBackgroundMigrationWorker
to ensure we only have single-thread execution of background jobs. The timeout for this lease is currently set at two times the job interval.
However, in a production incident where database response times could drop considerably, the lease could likely expire before the job has finished its work, resulting in multiple jobs executing in parallel. To avoid this add-on effect, we should increase the timeout on the lease to a more conservative value.
This shouldn't impact operation during normal execution, as the lease is released when the job finishes.
Make pause time configurable
When processing a batched background migration, we have two batching values, the batch_size
and sub_batch_size
. The batch_size
is used to find the boundaries for each background job that executes. The sub_batch_size
is used inside the background job, to process the outer batch in small pieces to keep query times low.
This strategy seems to work well, and we include a small sleep between each sub batch, to slow operations on the database. However, there is an upper limit on sub_batch_size
that is determined by execution time on the database. At the same time, we may be able to increase the overall batch_size
by large amounts, especially when processing in off-peak hours.
Unfortunately, the more batches we process, the more time the job spends pausing between each individual query, which can ultimately dominate the total execution time of the job. We should make the pause time a configurable value too, so that we can increase or decrease as necessary.