Race condition in Concurrency Limiter
Relates to #2574 (closed)
The concurrency limiter has a race condition where a weighted semaphore can be updated during garbage collection. See the affected critical sections from concurrency_limiter.go below:
func (c *ConcurrencyLimiter) attemptCollection(lockKey string) {
c.mux.Lock()
defer c.mux.Unlock()
ws := c.semaphores[lockKey]
if ws == nil {
return
}
if !ws.TryAcquire(c.max) {
return
}
// By releasing, we prevent a lockup of goroutines that have already
// acquired the semaphore, but have yet to acquire on it
ws.Release(c.max)
// 🚩 There is a RACE CONDITION here! This is because during Limit, the caller
// may have acquired a semaphore outside of this c.mux lock.
// If we managed to acquire all the locks, we can remove the semaphore for this key
delete(c.semaphores, lockKey)
}
// Limit will limit the concurrency of f
func (c *ConcurrencyLimiter) Limit(ctx context.Context, lockKey string, f LimitedFunc) (interface{}, error) {
if c.max <= 0 {
return f()
}
start := time.Now()
c.monitor.Queued(ctx)
w := c.getSemaphore(lockKey) // 🚩 c.mux lock released here
// Attempt to cleanup the semaphore it's no longer being used
defer c.attemptCollection(lockKey)
err := w.Acquire(ctx, 1) // 🚩 RACE CONDITION HERE!
c.monitor.Dequeued(ctx)
if err != nil {
return nil, err
}
c.monitor.Enter(ctx, time.Since(start))
defer c.monitor.Exit(ctx)
defer w.Release(1)
resp, err := f()
return resp, err
}
This problem becomes more apparent during unit tests when the test is run such that the concurrency limits are stress tested. This was done in !1965 (merged). Now, we see test failures where the maximum number of concurrent operations exceeds the expected range (https://gitlab.com/gitlab-org/gitaly/-/jobs/483167996):
concurrency_limiter_test.go:176:
Error Trace: concurrency_limiter_test.go:176
concurrency_limiter.go:103
concurrency_limiter_test.go:172
asm_amd64.s:1373
Error: Should be true
Test: TestLimiter/wide-spread
Messages: Expected the number of concurrent operations (126) to not exceed the maximum concurrency (105)
concurrency_limiter_test.go:176:
Error Trace: concurrency_limiter_test.go:176
concurrency_limiter.go:103
concurrency_limiter_test.go:172
asm_amd64.s:1373
Error: Should be true
Test: TestLimiter/wide-spread
Messages: Expected the number of concurrent operations (128) to not exceed the maximum concurrency (105)
concurrency_limiter_test.go:176:
Error Trace: concurrency_limiter_test.go:176
concurrency_limiter.go:103
concurrency_limiter_test.go:172
asm_amd64.s:1373
Error: Should be true
Test: TestLimiter/wide-spread
Messages: Expected the number of concurrent operations (130) to not exceed the maximum concurrency (105)
concurrency_limiter_test.go:193:
Error Trace: concurrency_limiter_test.go:193
Error: Should be true
Test: TestLimiter/wide-spread
Messages: Expected maximum concurrency to be in the range [95,105] but got 171
FAIL
Rather than rely on indeterministic test results due to the race condition, we should fix the underlying problem. The test should never exceed the specified maximum concurrency range. We currently use a range of allowable test results to mitigate the race condition, this should not be needed.