Skip to content

repository: Fix ReplicateRepository returning before RPCs have finished

Patrick Steinhardt requested to merge pks-replicate-repository-cancellation into master

When calling ReplicateRepository(), we try to both synchronize the gitattributes file as well as the actual repository contents. This is being done in parallel via the errgroup package, which allows us to run multiple functions in parallel and then retrieve the first error that was returned by any of them.

One feature that is not obvious is that errgroup.WithContext() returns a cancellable context which gets cancelled as soon as any of the invoked functions returns an error. We're using this feature right now, which means that if either of the synchronizing functions fails, the other one will get cancelled.

While it sounds like a logical thing to do, it does have problems in combination with the gRPC calls. If the context of a gRPC call gets cancelled, then the call will get aborted. But while there is no official documentation proving this, it seems like cancellation happens asynchronously. So while we're cancelling the remote call, we'll return even though the remote side hasn't yet been cancelled. Which effectively means that it may still be running on the remote server after ReplicateRepository() has returned on the local client.

There doesn't seem to be any way to synchronize on the remote cancellation, so what this commit instead does is to simply not pass the cancellable context of the errgroup to those functions. This has the downside of not cancelling RPCs even if we know that they'll fail, or that we don't care for their result. But on the other hand, it fixes above race where we may modify the repository even after the RPC has returned. This also fixes a test race in "TestReplicate/BadRepository".

Merge request reports

Loading