Race condition in Concurrency Limiter

Relates to #2574 (closed)

The concurrency limiter has a race condition where a weighted semaphore can be updated during garbage collection. See the affected critical sections from concurrency_limiter.go below:

func (c *ConcurrencyLimiter) attemptCollection(lockKey string) {
	c.mux.Lock()
	defer c.mux.Unlock()

	ws := c.semaphores[lockKey]
	if ws == nil {
		return
	}

	if !ws.TryAcquire(c.max) {
		return
	}

	// By releasing, we prevent a lockup of goroutines that have already
	// acquired the semaphore, but have yet to acquire on it
	ws.Release(c.max)

	// 🚩 There is a RACE CONDITION here! This is because during Limit, the caller
	// may have acquired a semaphore outside of this c.mux lock.

	// If we managed to acquire all the locks, we can remove the semaphore for this key
	delete(c.semaphores, lockKey)
}

// Limit will limit the concurrency of f
func (c *ConcurrencyLimiter) Limit(ctx context.Context, lockKey string, f LimitedFunc) (interface{}, error) {
	if c.max <= 0 {
		return f()
	}

	start := time.Now()
	c.monitor.Queued(ctx)

	w := c.getSemaphore(lockKey) // 🚩 c.mux lock released here

	// Attempt to cleanup the semaphore it's no longer being used
	defer c.attemptCollection(lockKey)

	err := w.Acquire(ctx, 1) // 🚩 RACE CONDITION HERE!
	c.monitor.Dequeued(ctx)

	if err != nil {
		return nil, err
	}

	c.monitor.Enter(ctx, time.Since(start))
	defer c.monitor.Exit(ctx)

	defer w.Release(1)

	resp, err := f()

	return resp, err
}

This problem becomes more apparent during unit tests when the test is run such that the concurrency limits are stress tested. This was done in !1965 (merged). Now, we see test failures where the maximum number of concurrent operations exceeds the expected range (https://gitlab.com/gitlab-org/gitaly/-/jobs/483167996):

         concurrency_limiter_test.go:176: 
             	Error Trace:	concurrency_limiter_test.go:176
             	            				concurrency_limiter.go:103
             	            				concurrency_limiter_test.go:172
             	            				asm_amd64.s:1373
             	Error:      	Should be true
             	Test:       	TestLimiter/wide-spread
             	Messages:   	Expected the number of concurrent operations (126) to not exceed the maximum concurrency (105)
         concurrency_limiter_test.go:176: 
             	Error Trace:	concurrency_limiter_test.go:176
             	            				concurrency_limiter.go:103
             	            				concurrency_limiter_test.go:172
             	            				asm_amd64.s:1373
             	Error:      	Should be true
             	Test:       	TestLimiter/wide-spread
             	Messages:   	Expected the number of concurrent operations (128) to not exceed the maximum concurrency (105)
         concurrency_limiter_test.go:176: 
             	Error Trace:	concurrency_limiter_test.go:176
             	            				concurrency_limiter.go:103
             	            				concurrency_limiter_test.go:172
             	            				asm_amd64.s:1373
             	Error:      	Should be true
             	Test:       	TestLimiter/wide-spread
             	Messages:   	Expected the number of concurrent operations (130) to not exceed the maximum concurrency (105)
         concurrency_limiter_test.go:193: 
             	Error Trace:	concurrency_limiter_test.go:193
             	Error:      	Should be true
             	Test:       	TestLimiter/wide-spread
             	Messages:   	Expected maximum concurrency to be in the range [95,105] but got 171
 FAIL

Rather than rely on indeterministic test results due to the race condition, we should fix the underlying problem. The test should never exceed the specified maximum concurrency range. We currently use a range of allowable test results to mitigate the race condition, this should not be needed.

Edited Mar 26, 2020 by Paul Okstad (ex-GitLab)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information