[Container Repository] The sync fails (at first) and then succeeds on the next retry when blobs are too large

Problem

The container_registry_token_expire_delay parameter is 5 minutes by default, and some blobs can't be downloaded for this time, so what happens, is that the next blob(or manifest) download in a batch, fails, because the token is already expired. The registry record is marked as 'failed' and so it will be immediately retried and will be downloaded successfully in most cases. Theoretically, it can lead to problems with images where there are few extra large blobs because few re-tries will be needed to sync the image successfully. As we use progressive re-try time - it will take too much time to be synced.

How to find out that it affects you

You will find a log message:

2022-12-05_20:04:19.35550 time="2022-12-05T20:04:19.355Z" level=info msg="token not to be used after 2022-12-05 19:54:07 +0000 UTC - currently 2022-12-05 20:04:19.350129808 +0000 UTC m=+260696.975772286"

Workaround

Increase value of container_registry_token_expire_delay

Suggestion

I think, during the HTTP request, it should catch the "token expired" response and just renew the token to get the blob.

Edited Dec 06, 2022 by Valery Sizov