Atomically refresh branch/tag cache in PostReceive
What does this MR do and why?
Previously if someone two pushes in close succession, there was a chance that the a pipeline would not be created due to race conditions with the branch/tag cache:
- Job 1 expires cache.
- Job 1 starts computing branch list.
- Job 2 starts.
- Job 2 expires cache (no-op because nothing is there).
- Job 1 finishes computing branch list, persists cache.
- Job 2 reads from stale cache instead of loading a fresh branch list.
To fix this scenario, this commit adds a feature flag
post_receive_sync_refresh_cache
that will cause the branch and
tag cache to be refreshed atomatically inside PostReceive.
Relates to #20891 (closed)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
How to set up and validate locally
The only way I've been able to reproduce and validate this is by injecting artificial delays and using an Omnibus instance with multiple Sidekiq workers. For some reason, the GDK only seems to be able to handle one push at a time, not concurrently. Could be due to only having one Sidekiq process.
- Install GitLab on Omnibus and ensure there are multiple Sidekiq processes pulling from the queue. In
/etc/gitlab/gitlab.rb
, add this line:
sidekiq['queue_groups'] = ['*'] * 4
- Apply this patch to the instance:
cd /opt/gitlab/embedded/service/gitlab-rails/
curl -o /tmp/post-receive.diff https://gitlab.com/gitlab-org/gitlab/-/merge_requests/160946.diff
patch -p1 < /tmp/post-receive.diff
You can ignore the warnings about spec
files not applying.
- Then apply this patch for injecting delays:
diff --git a/app/workers/post_receive.rb b/app/workers/post_receive.rb
index 3480a3b397f7..b475f28539dd 100644
--- a/app/workers/post_receive.rb
+++ b/app/workers/post_receive.rb
@@ -21,6 +21,7 @@ def perform(gl_repository, identifier, changes, push_options = {})
container, project, repo_type = Gitlab::GlRepository.parse(gl_repository)
@project = project
@gl_repository = gl_repository
+ @changes = changes
if container.nil? || (container.is_a?(ProjectSnippet) && project.nil?)
log("Triggered hook for non-existing gl_repository \"#{gl_repository}\"")
@@ -121,6 +122,13 @@ def expire_caches(post_received, repository)
def expire_branch_cache(repository)
repository.expire_branches_cache
+ STDERR.puts "=== #{Time.now} expired branch cache, changes = #{@changes}"
+
+ if @changes.include?("delay-test")
+ STDERR.puts "=== received delay test branch, pausing for 5 seconds to create race condition"
+ sleep 5
+ end
+
# Consider the scenario where multiple pushes happen in close succession:
#
# 1. Job 1 expires cache.
diff --git a/lib/gitlab/git/repository.rb b/lib/gitlab/git/repository.rb
index 754e975fedb0..42285f62ee62 100644
--- a/lib/gitlab/git/repository.rb
+++ b/lib/gitlab/git/repository.rb
@@ -109,6 +109,10 @@ def create_repository(default_branch = nil, object_format: nil)
def branch_names
refs = list_refs([Gitlab::Git::BRANCH_REF_PREFIX])
+ STDERR.puts "=== #{Time.now} called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled"
+ sleep 10
+ STDERR.puts "=== #{Time.now} done sleeping in #branch_names"
+
refs.map { |ref| Gitlab::Git.branch_name(ref.name) }
end
I saved this to /tmp/debug.patch
and ran patch -p1 < /tmp/debug.patch
from the same directory.
- Reconfigure and restart Sidekiq (for good measure):
gitlab-ctl reconfigure
gitlab-ctl restart sidekiq
- Now create or use a project that has a
.gitlab-ci.yml
with something simple:
test:
script:
- echo "hello world"
- Clone the project in two different directories:
cd /tmp
git clone git@git.example.com:root/test-project.git test1
git clone git@git.example.com:root/test-project.git test2
- Now watch the logs:
sudo tail -f /var/log/gitlab/sidekiq/current | egrep "===|Reference not found"
- Now run this script:
TEST_NUMBER=1
cd /tmp/test1
git checkout -b test-$TEST_NUMBER
git push -o mr.create -u origin HEAD &
git -C /tmp/test2 checkout -b delay-test-$TEST_NUMBER
sleep 6 # Sleep long enough for the ListRefs RPC to start
git -C /tmp/test2 push -o mr.create -u origin HEAD
- Without the feature flag, you should a
Reference not found
error:
=== 2024-08-01 21:48:11 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/test-1
=== 2024-08-01 21:48:11 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:48:11 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:48:17 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/delay-test-1
=== received delay test branch, pausing for 5 seconds to create race condition
=== 2024-08-01 21:48:21 +0000 done sleeping in #branch_names
=== 2024-08-01 21:48:21 +0000 done sleeping in #branch_names
{"severity":"WARN","time":"2024-08-01T21:48:22.598Z","class":"Git::BranchHooksService","correlation_id":"01J47YMHDA68A7P7BH1W21SVTE","project_id":51,"project_path":"root/test-orject","message":"Error creating pipeline","errors":"Reference not found","pipeline_params":{"before":"0000000000000000000000000000000000000000","after":"3eabef3aad20de2223dee166384adac554d805e0","ref":"refs/heads/delay-test-1","variables_attributes":[],"checkout_sha":"3eabef3aad20de2223dee166384adac554d805e0"},"retry":0}
- Now run
gitlab-rails console
and enable the feature flag:
Feature.enable(:post_receive_sync_refresh_cache)
- Repeat step 8 with
TEST_NUMBER=2
instead. You should no longer see theReference not found
, and notice the secondListRefs
RPC message is staggered afterdone sleeping
messages:
=== 2024-08-01 21:39:53 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/test-2
=== 2024-08-01 21:39:53 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:39:58 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/delay-test-2
=== received delay test branch, pausing for 5 seconds to create race condition
=== 2024-08-01 21:40:03 +0000 done sleeping in #branch_names
=== 2024-08-01 21:40:03 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:40:13 +0000 done sleeping in #branch_names