Skip to content

backup: Fix the flake in TestLogEntryArchiver

Quang-Minh Nguyen requested to merge qmnguyen0711/fix-log-achiver-flake into master

We noticed some flaky tests in internal/backup package. They come from TestLogEntryArchiver tests:

https://gitlab.com/gitlab-org/gitaly/-/jobs/7578192954

=== FAIL: internal/backup TestLogEntryArchiver/notify_range_of_entries (unknown)
panic: sync: negative WaitGroup counter
goroutine 771 [running]:
sync.(*WaitGroup).Add(0x10?, 0x10?)
	/usr/local/go/src/sync/waitgroup.go:62 +0xd8
sync.(*WaitGroup).Done(...)
	/usr/local/go/src/sync/waitgroup.go:87
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*mockLogManager).AcknowledgeTransaction(0xc0056ff280, {0x750000c004b8cd48?, 0x0?}, 0x5)
	/builds/gitlab-org/gitaly/internal/backup/log_entry_test.go:71 +0x136
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).ingestNotifications.func1({0x300cf78?, 0xc0056ff280?})
	/builds/gitlab-org/gitaly/internal/backup/log_entry.go:292 +0x42
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*mockLogManagerAccessor).CallLogManager(0xc00517d008, {0xc0041020c0?, 0xc004ba69a0?}, {0x26d1669?, 0xc002688060?}, 0xc0060d8de0?, 0xc00463c618)
	/builds/gitlab-org/gitaly/internal/backup/log_entry_test.go:53 +0x142
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).ingestNotifications(0xc0060fafa0, {0x301f3e8, 0xc004166000})
	/builds/gitlab-org/gitaly/internal/backup/log_entry.go:291 +0x4c3
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).main(0xc0060fafa0, {0x301f3e8, 0xc004166000}, 0xc002688000, 0xc002688060)
	/builds/gitlab-org/gitaly/internal/backup/log_entry.go:242 +0x1c5
gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).Run.func1()
	/builds/gitlab-org/gitaly/internal/backup/log_entry.go:211 +0x1eb
created by gitlab.com/gitlab-org/gitaly/v16/internal/backup.(*LogEntryArchiver).Run in goroutine 654
	/builds/gitlab-org/gitaly/internal/backup/log_entry.go:188 +0x78

In these tests, we mocked the log manager. It stores the list of incoming acknowledgments from LogEntryArchiver. After the final LSN is acknowledged, it calls a finish function to release an outside wait group. After investigating, we found out that mockLogManager's AcknowledgeTransaction is called more than once with the same LSN.

When the LogEntryArchiver processes a notification, it sets the current state's high water mark = notification's high water mark and notifies another goroutine for processing. The processing goroutine processes log entries sequentially from the current nextLSN -> high watermark. After each iteration, it acknowledges the corresponding log entry. If the next notification contains the same high watermark as the prior, there's a chance the processing goroutine already handles all log entries. It then re-acknowledges the high watermark and skips the notification.

For example:

  • Ingest notification [3, 5]
  • Process 3, ack 3 -> nextLSN = 4
  • Process 4, ack 4 -> nextLSN = 5
  • Process 5, ack 5 -> nextLSN = 6
  • Ingest notification [4, 5] -> re-ack 5

The real LogManager allows duplicated acknowledgment. In fact, it simply sets a pointer. So, to fix the flake, we simply allow the mockLogManager to call the finish function for at most pre-defined amount of calls. Excessive calls are ignored.

This commit also stops the archiver completely before moving to the assertion stage.

Before

This flake could be reproduced by running the tests with -race -failfast -count 1000

Screenshot_2024-08-14_at_10.52.17

After

Screenshot_2024-08-14_at_14.38.28

Edited by Quang-Minh Nguyen

Merge request reports

Loading