Geo: Fail syncs which exceed a timeout
Problem
- Sidekiq is shutdown ungracefully, while a sync job is running.
- I think even if Sidekiq is restarted, and is able to push the job back onto the queue, the job will fail 3 times and disappear, due to an orphaned lease
So the job will be stuck in state
= started
. This is rare, but possible.
I actually got my GDK into this state without trying:
How to find affected records
Return the number of PackageFile sync attempts which started more than 8 hours ago:
Geo::PackageFileRegistry.where("state = 1 AND last_synced_at < ?", 8.hours.ago).count
Return the first PackageFile sync attempt which started more than 8 hours ago:
Geo::PackageFileRegistry.where("state = 1 AND last_synced_at < ?", 8.hours.ago).first
Workaround
Mark all affected registry records as "failed". These will be picked up by the background jobs.
Geo::PackageFileRegistry.where("state = 1 AND last_synced_at < ?", 8.hours.ago).update_all(state: 3)
Proposal
Like VerificationTimeoutWorker, add a SyncTimeoutWorker
. Which moves things which started
a long time ago to failed
.
Note that performance of the worker is not a concern since it would take a very very long time to get many records stuck, since stuck records count against the concurrency limit.
Edited by Michael Kozono