Skip to content

t0610: work around flaky test with concurrent writers

Patrick Steinhardt requested to merge pks-reftable-win32-flaky-tests into master

In 6241ce21 (refs/reftable: reload locked stack when preparing transaction, 2024-09-24) we have introduced a new test that exercises how the reftable backend behaves with many concurrent writers all racing with each other. This test was introduced after a couple of fixes in this context that should make concurrent writes behave gracefully. As it turns out though, Windows systems do not yet handle concurrent writes properly, as we've got two reports for Cygwin and MinGW failing in this newly added test.

The root cause of this is how we update the "tables.list" file: when writing a new stack of tables we first write the data into a lockfile and then rename that file into place. But Windows forbids us from doing that rename when the target path is open for reading by another process. And as the test races both readers and writers with each other we are quite likely to hit this edge case.

Now the two reports are somewhat different from one another:

  • On Cygwin we hit timeouts because we fail to lock the "tables.list" file within 10 seconds. The renames themselves succeed even when the target file is open because Cygwin provides extensive compatibility logic to make them work even when the target file is open already.

  • On MinGW we hit I/O errors on rename. While we do have some retry logic in place to make the rename work in some cases, this is seemingly not sufficient when there is this much contention around the files.

Neither of these cases is a regression: the logic didn't work before the mentioned commit, and after the commit it performs well on Linux and macOS, and at least a bit better on Windows. But the tests surface that we need to put more thought into how to make this work properly on MinGW systems.

The fact that Cygwin can work around this issue with better emulation of POSIX-style atomic renames shows that we can in theory make MinGW work better, as well. But doing so likely requires quite some fiddling with Windows internals, and Git v2.47 is about to be released in a couple days. This makes any potential fix quite risky as it would have to happen deep down in our rename(3P) implementation in "compat/mingw.c".

Let's instead work around both issues by disabling the test on MinGW and by significantly increasing the locking timeout for Cygwin. This bumped timeout also helps when running with e.g. the address and memory sanitizers, which also tend to significantly extend the runtime of this test.

This should be revisited after Git v2.47 is out.

Signed-off-by: Patrick Steinhardt ps@pks.im

Part of Racy writes in the reftable backend can cause I... (#402).

Merge request reports

Loading