Race condition finding or creating a container repository by path
Summary
Container Repository race condition when finding or creating a container repository by path when it doesn't exist.
- Follow-up from Investigate: Intermittent failure on push to co... (#404326 - closed)
- Seems to happen for large layers although I haven't confirmed it happens for small ones too.
- Can reproduce locally
- Reported on GitLab versions 14.10.5 and 16.3.
- Cannot reproduce on GitLab.com
- This was first addressed by Fix registry race condition (!75483 - merged) but there are still some cases when this happens.
Some ideas:
- For small database instances,
ensure_container_repository!
returnsfind_by_path!
too quickly.
Steps to reproduce
This can be reproduced locally using Podman and the container registry.
- Install podman https://podman.io/docs/installation
- Enable the registry with authentication.
- Optional. Configure an insecure registry with podman https://podman.io/docs/installation#registriesconf
podman machine ssh
sudo vi /etc/containers/registries.conf
## add the following lines
[[registry]]
location="registry.test:5000"
insecure=true
- Build an image with a few fairly large layers in it, sample
Dockerfile
. Remember to tag the image with a different path for every request. The aim is to triggerfind_by_path!
insidefind_or_create_from_path
FROM alpine:latest
RUN dd if=/dev/urandom of=1M bs=100M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=2M bs=300M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=3M bs=100M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=4M bs=100M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=5M bs=200M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=6M bs=100M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=7M bs=100M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=8M bs=100M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=9M bs=700M count=1 "iflag=fullblock"
RUN dd if=/dev/urandom of=10M bs=100M count=1 "iflag=fullblock"
- Push the image to the registry and expect the following error:
podman push registry.test:5000/root/project/unknown2:latest
Getting image source signatures
Copying blob a3c15c74a5f3 done |
Copying blob 2d29c494279d done |
Copying blob 80150619a846 done |
Copying blob b2669a77b6d4 done |
Copying blob dbfa4f640bf1 done |
Copying blob 507bc4814517 done |
Copying blob 072eb7954b0f done |
Copying blob 6c720da2a9cd done |
Copying blob 7b8b5191c1b9 done |
Error: trying to reuse blob sha256:5f4d9fc4d98de91820d2a9c81e501c8cc6429bc8758b43fcb2cd50f4cab9a324 at destination: Requesting bearer token: invalid status code from registry 404 (Not Found)
Example Project
Reported in #404326 (closed) on self-managed installations of GitLab 14.10.5 and 16.3.
I have not been able to reproduce on GitLab.com yet.
What is the current bug behavior?
A new container repository fails to be pushed when the ContainerRepository
model does not exist.
Attempting to push the repository again succeeds, because the ContainerRepository
has been created.
What is the expected correct behavior?
No race condition regardless of how many times we call the method ensure_container_repository!
concurrently.
Relevant logs and/or screenshots
podman push output
Error: trying to reuse blob sha256:5f4d9fc4d98de91820d2a9c81e501c8cc6429bc8758b43fcb2cd50f4cab9a324 at destination: Requesting bearer token: invalid status code from registry 404 (Not Found)
exception log
ActiveRecord::RecordNotFound (Couldn't find ContainerRepository with [WHERE "container_repositories"."project_id" = $1 AND "container_repositories"."name" = $2]):
app/models/container_repository.rb:622:in `find_by_path!'
app/models/container_repository.rb:614:in `find_or_create_from_path'
app/services/auth/container_registry_authentication_service.rb:229:in `ensure_container_repository!'
app/services/auth/container_registry_authentication_service.rb:201:in `process_repository_access'
app/services/auth/container_registry_authentication_service.rb:170:in `process_scope'
app/services/auth/container_registry_authentication_service.rb:157:in `block in scopes'
app/services/auth/container_registry_authentication_service.rb:156:in `map'
app/services/auth/container_registry_authentication_service.rb:156:in `scopes'
app/services/auth/container_registry_authentication_service.rb:28:in `execute'
ee/app/services/ee/auth/container_registry_authentication_service.rb:12:in `execute'
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
See #428115 (comment 1609807228)
TL;DR
def self.find_or_create_from_path(path)
record = safe_find_or_create_by!(
project: path.repository_project,
name: path.repository_name
)
return record if record.persisted?
now = Time.zone.now
while(Time.zone.now < now + 1.second)
container = find_by_path(path)
break container if container
end
rescue ActiveRecord::RecordNotUnique
find_by_path(path)
end