Resolve "Optimize the s3 storage driver's Walk function"
Allow the s3 driver's walk function to run the WalkFn
in parallel.
Standard concurrently techniques were difficult to apply in this instance due to the recursive nature of Walk, along with its employment of a callback function.
In the event than an error is encountered WalkFn
s that have not yet been called will not be called and new goroutines will be be spawned. This is accomplished by adapting a technique found on the Go Blog Under Explicit Cancellation:
This approach has a problem: each downstream receiver needs to know the number of potentially blocked upstream senders and arrange to signal those senders on early return. Keeping track of these counts is tedious and error-prone.
We need a way to tell an unknown and unbounded number of goroutines to stop sending their values downstream. In Go, we can do this by closing a channel, because a receive operation on a closed channel can always proceed immediately, yielding the element type's zero value.
This means that main can unblock all the senders simply by closing the done channel. This close is effectively a broadcast signal to the senders. We extend each of our pipeline functions to accept done as a parameter and arrange for the close to happen via a defer statement, so that all return paths from main will signal the pipeline stages to exit.
Changes Needed to Accommodate this MR
The changes in this MR require that all functions passed to the storage driver's Walk
method (and derivative functions) must be thread safe. The following changes should be merged and these issues resolved in before this one to insure the stability of the registry:
#4 (closed)
#5 (closed)
#6 (closed)
#7 (closed)
#8 (closed)
#9 (closed)
Current Behavior
The walk behavior of the walk function is mostly accomplished through an non-exported doWalk
function, this function calls the AWS SDK's ListObjectsV2PagesWithContext
which calls the provided callback function on each S3 object for (potentially multiple) paginated requests. Requests are limited by a provided prefix. These objects are then inspected for information about the files and directories they represent, appended to a slice, and iterated over.
For each iteration the callback function is called on the file.
On error, pagination is stopped and the error reported to Walk
.
If the file is a directory, doWalk
is called recursively on this directory, using this directory's prefix in order to descend down the filesystem from this point.
For each page of results, all directories and their subdirectories must be fully walked and each file in the request must be evaluated before the next request is made.
New Behavior
Within the doWalk
function, every object from a page of results from ListObjectsV2PagesWithContext
is evaluated within it's own goroutine, the next page of results is requested as soon as each goroutine has been spawned.
doWalk
has been updated to run in a thread-safe manner. If the file is a directory, doWalk
will be spawned in a new goroutine, and the parent doWalk
is free to exit before its child goroutines it spawns resolve.
Walk
will wait for all goroutines spawned by doWalk
and its recursive calls to resolve before returning. In this way, calls to Walk
remain synchronous, with the exception that the callback function must be thread-safe.
Errors are reported on an error channel. As soon as the first error is encountered, new goroutines exit immediately and doWalk
calls request no further pages. This error is ultimately reported by the Walk
function.