Create leases in pending, clear completed_timestamp in mark_worker_started
Before raising this MR, consider whether the following are required, and complete if so:
-
Unit tests -
Metrics -
Documentation update(s)
Description
This PR addresses an issue found in the JobAssigner
where jobs without a lease but with worker_start/worker_completed
timestamps could cause a deadlock. The reason for this is due to assign_n_leases
selecting a bunch of jobs with SELECT FOR UPDATE
in a session, and inside that session job.create_lease
being called. This in turn calls job.update_lease_state
which clears the worker_start/worker_completed
timestamps and attempts to persist it in the database using a different session. This results in a deadlock as the inner transaction is waiting for a lock held by the outer transaction. The only reason this doesn't happen all the time is that these fields are typically empty when a job doesn't have a lease, but all it takes is a job in a weird state due to connectivity issues to hit this code path.
We have a longer standing issue to rework how we schedule/handle jobs to use transactions properly, which removing the in-memory scheduler was the first step, but this PR addresses one very specific issue which is debilitating if it happens.
I've also added a test to verify this behavior is fixed with these changes.
Validation
The new test test_job_lease_creation_with_preexisting_worker_timestamps
covers this behavior. Running this test against the old code results in a test failure due to lock timeouts which doesn't happen with the new version.