Optimize the garbage collector sweep stage for S3

We're currently looking at optimizing the Container Registry garbage collection algorithm for S3 (gitlab#31071 (closed)).

In this issue we'll investigate possible improvements for the sweep stage.

Current Behaviour

The GC sweep stage starts by looking for a list of manifests and blobs that should be deleted. The manifests list is built during the mark stage (recursive path walk), and the blob list is made similarly, but during the sweep stage.

With the list of manifests and blobs to delete, the GC uses storage.Vacuum as an intermediate for the storage driver, removing objects from the backend.

Note: Optimizations around the path walk are being investigated in #3 (closed), so they're out of the scope of this issue. Here we focus exclusively on the delete operation.

Blobs

The GC, in storage.MarkAndSweep, iterates over the list of blobs to delete and calls storage.Vacuum.RemoveBlob for each one, passing the blob digest (e.g. sha256:f13f9a179e1c60cfe6b7a8b8b983443f43d9c49f81e1e449617234bec10407e6) as argument:
1. The storage.Vacuum.RemoveBlob method parses the blob digest, generates its full path (e.g. docker/registry/v2/blobs/sha256/f1/f13f9a179e1c60cfe6b7a8b8b983443f43d9c49f81e1e449617234bec10407e6) and invokes the storage driver Delete method:
  1. The storage.driver.s3.Delete method will then do:
    1. Send a ListObjects request to S3 to obtain the full list of all objects under a blob’s path. In this case, a blob’s path has a single object within, the data file.
    2. With the list of objects in hand (a single data file), an S3 DeleteObjects request is built and sent. During the request, S3 will automatically delete the parent folder f13f9a179e1c60cfe6b7a8b8b983443f43d9c49f81e1e449617234bec10407e6 , and its parent f1 if empty.

The process described above is synchronous and sequential. Thus, for each blob to delete, we send two HTTP requests to S3 (list and delete) and wait for the response before moving to the next one. So we have 2N network requests, where N is the number of blobs to delete.

Bottlenecks and Possible Solutions

The storage.driver.s3.Delete method is generic, i.e., it's designed to work with all kinds of objects. That's why a ListObjects request always precedes a DeleteObjects request. For blob deletion, we don't need the ListObjects, we could delete the data file directly. This would cut the required network requests in half.
We're not making use of the bulk capabilities of the S3 DeleteObjects operation when deleting blobs. It supports up to 1000 objects per request (source), which means that we could potentially send a single request instead of 2N in some scenarios. Again, the storage.driver.s3.Delete method is generic. In the blob deletion scenario, it currently does a DeleteObjects request with a single element (data file) per blob.
If using bulk requests isn't viable or enough, we can parallelize the calls to storage.Vacuum.RemoveBlob in storage.MarkAndSweep, spawning a goroutine per blob deletion.
This is minor, but in storage.MarkAndSweep a blob's digest is converted to a string just to be passed as an argument to storage.Vacuum.RemoveBlob, where it's then parsed back to a digest. The storage.Vacuum.RemoveBlob method is only used in one more place, the proxy.NewRegistryPullThroughCache function, where the same thing happens (digest converted to a string). So we can avoid these conversions.

The improvements related to blob deletion are being implemented in !31 (merged).

Manifests

In storage.MarkAndSweep, the garbage collector iterates over the list of manifests to delete and calls storage.Vacuum.RemoveManifest for each one, passing the manifest name, digest and tag references as argument. This method will then do two things:
1. Iterate over the tag references list and:
  1. Generate the tag index directory full path (e.g. docker/registry/v2/repositories/myorg/myproj/myimg/_manifests/tags/index/sha256/82d...057)
  2. Send a ListObjects request to S3 to check if the tag index file exists (stat). If so, the storage driver Delete method is invoked to delete the tag reference. The storage.driver.s3.Delete method will then do:
    1. Send a ListObjects request to S3 to obtain the list of objects under a tag index full path (second ListObjects request for the same object). There is only one, the tag index link file (e.g. 82d...057/link).
    2. With the list of objects in hand (only one), an S3 DeleteObjects request is built and sent.
2. Build the manifest revision directory full path (e.g. docker/registry/v2/repositories/myorg/myproj/myimg/_manifests/revisions/sha256/82d...057)
  1. Invoke the storage driver Delete method to delete the manifest revision. The storage.driver.s3.Delete method will then do:
    1. Send a ListObjects request to S3 to obtain the list of objects under a manifest revision full path. There is only one, the manifest revision link file (e.g. 82d...057/link).
    2. With the list of objects in hand (only one), an S3 DeleteObjects request is built and sent.

The process described above is synchronous and sequential.

For each manifest to delete, the GC always sends at least two HTTP requests to S3 (ListObjects and DeleteObjects). That's 2N network requests, where N is the number of manifests to delete. On top of that, we can have another 2N requests (ListObjects and DeleteObjects) being made to delete tag index files (if any), where N is the number of tag index files for each manifest.

Bottlenecks and Possible Solutions

The bottlenecks identified for blobs also apply to manifests.
We can potentially bundle the deletion of manifest revision and tag index files in a single DeleteObjects request.

The improvements related to manifest deletion are being implemented in !39 (merged).

Edited Jan 07, 2020 by João Pereira