Geo: Fail syncs which exceed a timeout

Problem

Sidekiq is shutdown ungracefully, while a sync job is running.
I think even if Sidekiq is restarted, and is able to push the job back onto the queue, the job will fail 3 times and disappear, due to an orphaned lease

So the job will be stuck in state = started. ~~This is rare, but possible.~~

I actually got my GDK into this state without trying:

How to find affected records

Return the number of PackageFile sync attempts which started more than 8 hours ago:

Geo::PackageFileRegistry.where("state = 1 AND last_synced_at < ?", 8.hours.ago).count

Return the first PackageFile sync attempt which started more than 8 hours ago:

Geo::PackageFileRegistry.where("state = 1 AND last_synced_at < ?", 8.hours.ago).first

Workaround

Mark all affected registry records as "failed". These will be picked up by the background jobs.

Geo::PackageFileRegistry.where("state = 1 AND last_synced_at < ?", 8.hours.ago).update_all(state: 3)

Proposal

Like VerificationTimeoutWorker, add a SyncTimeoutWorker. Which moves things which started a long time ago to failed.

Note that performance of the worker is not a concern since it would take a very very long time to get many records stuck, since stuck records count against the concurrency limit.

Edited Mar 02, 2021 by Michael Kozono