Skip to content

Use semaphore to limit concurrent walks for WalkFallbackParallel

Hayley Swimelar requested to merge limit-walkfallbackparallel into release/2.8-gitlab

Rationale

During testing of the GCS driver garbage collection of a large (10TB) registry, most runs produced multiple instances of the following error from enumerating blobs in the mark phase:

Get "<BLOB_PATH>%2F&prettyPrint=false&projection=full&versions=false": http2: client conn not usable

Looking at the net/http package in the standard library, we see that (currently) this error is only returned here: https://github.com/golang/go/blob/22688f740dbbae281c1de09c2b4fe6520337a124/src/net/http/h2_bundle.go#L7654

Following the code backwards a bit suggests that our issue is caused either by too many connections attempting to be opened at one time, or by a connection idling for too much time. This indicates that the Limiter reducing the rate of GCS's List and Stat calls within WalkFallbackParallel is not effectively limiting the driver under these circumstances.

Solution

This MR modifies WalkFallbackParallel to use a semaphore to limit the number of active goroutines at a single time.

Results

There are the last lines from several runs at different concurrently limits (with the least relevant fields removed for brevity):

==> gcs-with-fixes-25-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=2018.351790469 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303

==> gcs-with-fixes-50-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=1314.63886816 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303

==> gcs-with-fixes-100-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=1001.826073841 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303

==> gcs-with-fixes-200-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=942.574678945 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303

==> gcs-with-fixes-400-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=952.659678137 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303

This table summarizes the findings overall. For memory consumption, these figures are mostly representative, as memory tends to climb to a peak and ease up only slightly during the course of the run. For Max Goroutines, these data represent short-lived spikes in usage with typical goroutine usage being somewhat lower.

Max Concurrency Approx. Completion Time in Minutes Max Goroutines Max Memory Usage
25 33 557 2 GiB
50 22 740 2.5 GiB
100 16.5 824 3 GiB
200 15.7 1285 3.7 GiB
400 15.8 1716 4 GiB

Looking at the data, we observe diminishing returns after 100 max concurrency, with 200 max concurrency being worse than 400 max concurrency in all measures.

Raw Data:

gcs-with-fixes-25-max-stats

gcs-with-fixes-50-max-stats

gcs-with-fixes-100-max-stats

gcs-with-fixes-200-max-stats

gcs-with-fixes-400-max-stats

Weaknesses

The pre-existing GCS parameter maxconcurrency is co-opted for the walk as well. It possible users may wish to vary this independently, but this approach is chosen so that we can choose to break it out and add a new parameter later, if we do receive requests for this. Additionally, maxConcurrency seems to be intended as a general concurrency option, as the comment above the field in the gcs driverParameters indicates:

// maxConcurrency limits the number of concurrent driver operations
// to GCS, which ultimately increases reliability of many simultaneous
// pushes by ensuring we aren't DoSing our own server with many
// connections.

This MR increases the complexity of the doWalkFallbackParallel function. Mostly due to the inability to use defer to release the semaphore. Generous comments are made to address this, as well as using a typical pattern for clean-up and error handling code typically found in code written using the C Programming Language.

Merge request reports

Loading