Use semaphore to limit concurrent walks for WalkFallbackParallel
Rationale
During testing of the GCS driver garbage collection of a large (10TB) registry, most runs produced multiple instances of the following error from enumerating blobs in the mark phase:
Get "<BLOB_PATH>%2F&prettyPrint=false&projection=full&versions=false": http2: client conn not usable
Looking at the net/http
package in the standard library, we see that (currently) this error is only returned here: https://github.com/golang/go/blob/22688f740dbbae281c1de09c2b4fe6520337a124/src/net/http/h2_bundle.go#L7654
Following the code backwards a bit suggests that our issue is caused either by too many connections attempting to be opened at one time, or by a connection idling for too much time. This indicates that the Limiter reducing the rate of GCS's List
and Stat
calls within WalkFallbackParallel is not effectively limiting the driver under these circumstances.
Solution
This MR modifies WalkFallbackParallel to use a semaphore to limit the number of active goroutines at a single time.
Results
There are the last lines from several runs at different concurrently limits (with the least relevant fields removed for brevity):
==> gcs-with-fixes-25-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=2018.351790469 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303
==> gcs-with-fixes-50-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=1314.63886816 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303
==> gcs-with-fixes-100-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=1001.826073841 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303
==> gcs-with-fixes-200-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=942.574678945 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303
==> gcs-with-fixes-400-max.log <==
msg="mark stage complete" blobs_marked=163088 blobs_to_delete=139044 duration_s=952.659678137 go.version=go1.14 manifests_to_delete=20970 storage_use_estimate_bytes=4610263066303
This table summarizes the findings overall. For memory consumption, these figures are mostly representative, as memory tends to climb to a peak and ease up only slightly during the course of the run. For Max Goroutines, these data represent short-lived spikes in usage with typical goroutine usage being somewhat lower.
Max Concurrency | Approx. Completion Time in Minutes | Max Goroutines | Max Memory Usage |
---|---|---|---|
25 | 33 | 557 | 2 GiB |
50 | 22 | 740 | 2.5 GiB |
100 | 16.5 | 824 | 3 GiB |
200 | 15.7 | 1285 | 3.7 GiB |
400 | 15.8 | 1716 | 4 GiB |
Looking at the data, we observe diminishing returns after 100 max concurrency, with 200 max concurrency being worse than 400 max concurrency in all measures.
Raw Data:
Weaknesses
The pre-existing GCS parameter maxconcurrency
is co-opted for the walk as well. It possible users may wish to vary this independently, but this approach is chosen so that we can choose to break it out and add a new parameter later, if we do receive requests for this. Additionally, maxConcurrency seems to be intended as a general concurrency option, as the comment above the field in the gcs driverParameters
indicates:
// maxConcurrency limits the number of concurrent driver operations
// to GCS, which ultimately increases reliability of many simultaneous
// pushes by ensuring we aren't DoSing our own server with many
// connections.
This MR increases the complexity of the doWalkFallbackParallel
function. Mostly due to the inability to use defer
to release the semaphore. Generous comments are made to address this, as well as using a typical pattern for clean-up and error handling code typically found in code written using the C Programming Language.