backup: Fix the flake in TestLogEntryArchiver
We noticed some flaky tests in internal/backup
package. They come
from TestLogEntryArchiver tests:
https://gitlab.com/gitlab-org/gitaly/-/jobs/7578192954
=== FAIL: internal/backup TestLogEntryArchiver/notify_range_of_entries (unknown)
panic: sync: negative WaitGroup counter
goroutine 771 [running]:
sync.(*WaitGroup).Add(0x10?, 0x10?)
/usr/local/go/src/sync/waitgroup.go:62 +0xd8
sync.(*WaitGroup).Done(...)
/usr/local/go/src/sync/waitgroup.go:87
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*mockLogManager).AcknowledgeTransaction(0xc0056ff280, {0x750000c004b8cd48?, 0x0?}, 0x5)
/builds/gitlab-org/gitaly/internal/backup/log_entry_test.go:71 +0x136
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).ingestNotifications.func1({0x300cf78?, 0xc0056ff280?})
/builds/gitlab-org/gitaly/internal/backup/log_entry.go:292 +0x42
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*mockLogManagerAccessor).CallLogManager(0xc00517d008, {0xc0041020c0?, 0xc004ba69a0?}, {0x26d1669?, 0xc002688060?}, 0xc0060d8de0?, 0xc00463c618)
/builds/gitlab-org/gitaly/internal/backup/log_entry_test.go:53 +0x142
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).ingestNotifications(0xc0060fafa0, {0x301f3e8, 0xc004166000})
/builds/gitlab-org/gitaly/internal/backup/log_entry.go:291 +0x4c3
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).main(0xc0060fafa0, {0x301f3e8, 0xc004166000}, 0xc002688000, 0xc002688060)
/builds/gitlab-org/gitaly/internal/backup/log_entry.go:242 +0x1c5
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).Run.func1()
/builds/gitlab-org/gitaly/internal/backup/log_entry.go:211 +0x1eb
created by gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).Run in goroutine 654
/builds/gitlab-org/gitaly/internal/backup/log_entry.go:188 +0x78
In these tests, we mocked the log manager. It stores the list of incoming acknowledgments from LogEntryArchiver. After the final LSN is acknowledged, it calls a finish function to release an outside wait group. After investigating, we found out that mockLogManager's AcknowledgeTransaction is called more than once with the same LSN.
When the LogEntryArchiver processes a notification, it sets the current state's high water mark = notification's high water mark and notifies another goroutine for processing. The processing goroutine processes log entries sequentially from the current nextLSN -> high watermark. After each iteration, it acknowledges the corresponding log entry. If the next notification contains the same high watermark as the prior, there's a chance the processing goroutine already handles all log entries. It then re-acknowledges the high watermark and skips the notification.
For example:
- Ingest notification [3, 5]
- Process 3, ack 3 -> nextLSN = 4
- Process 4, ack 4 -> nextLSN = 5
- Process 5, ack 5 -> nextLSN = 6
- Ingest notification [4, 5] -> re-ack 5
The real LogManager allows duplicated acknowledgment. In fact, it simply sets a pointer. So, to fix the flake, we simply allow the mockLogManager to call the finish function for at most pre-defined amount of calls. Excessive calls are ignored.
This commit also stops the archiver completely before moving to the assertion stage.
Before
This flake could be reproduced by running the tests with -race -failfast -count 1000