datastore: Fix acknowledgement of stale jobs considering timezones (!3784) · Merge requests · GitLab.org / gitaly

Patrick Steinhardt requested to merge pks-datastore-fix-timezone-bug into master Aug 20, 2021

For quite some time, tests for the Postgres replication event queue have been failing in one specific test which is queueing up a replication job, dequeueingit and then immediately tries to acknowledge stale jobs. The acknowledged jobs are never in the correct state though: they're marked as failed, even though they should still be in progress. While this could be a race, the fact that this only occurs for some developers strongly hints at the fact that there may be something else going on: even bumping the threshold to an hour wouldn't fix it.

If one bumps the timeout to slightly above two hours though, then it now starts to fail. This is a strong indicator of it being timezone-related, given I'm located at UTC+2. And indeed: while we always make sure to insert and compare SQL timestamps in the replication queue as UTC, we don't when acknowledging stale jobs. Depending on the timezone, this either means that we're taking way too long to update jobs (if in a positive timezone) or that we always mark jobs as failed immediately (if in a negative timezone).

Fix the bug by correctly using UTC timezone when acknowledging stale jobs.

Changelog: fixed

datastore: Fix acknowledgement of stale jobs considering timezones

Merge request reports