Skip to content

Atomically refresh branch/tag cache in PostReceive

Stan Hu requested to merge sh-post-receive-atomic-branch-tag-sync into master

What does this MR do and why?

Previously if someone two pushes in close succession, there was a chance that the a pipeline would not be created due to race conditions with the branch/tag cache:

  1. Job 1 expires cache.
  2. Job 1 starts computing branch list.
  3. Job 2 starts.
  4. Job 2 expires cache (no-op because nothing is there).
  5. Job 1 finishes computing branch list, persists cache.
  6. Job 2 reads from stale cache instead of loading a fresh branch list.

To fix this scenario, this commit adds a feature flag post_receive_sync_refresh_cache that will cause the branch and tag cache to be refreshed atomatically inside PostReceive.

Relates to #20891 (closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

The only way I've been able to reproduce and validate this is by injecting artificial delays and using an Omnibus instance with multiple Sidekiq workers. For some reason, the GDK only seems to be able to handle one push at a time, not concurrently. Could be due to only having one Sidekiq process.

  1. Install GitLab on Omnibus and ensure there are multiple Sidekiq processes pulling from the queue. In /etc/gitlab/gitlab.rb, add this line:
sidekiq['queue_groups'] = ['*'] * 4
  1. Apply this patch to the instance:
cd /opt/gitlab/embedded/service/gitlab-rails/
curl -o /tmp/post-receive.diff https://gitlab.com/gitlab-org/gitlab/-/merge_requests/160946.diff
patch -p1 < /tmp/post-receive.diff

You can ignore the warnings about spec files not applying.

  1. Then apply this patch for injecting delays:
diff --git a/app/workers/post_receive.rb b/app/workers/post_receive.rb
index 3480a3b397f7..b475f28539dd 100644
--- a/app/workers/post_receive.rb
+++ b/app/workers/post_receive.rb
@@ -21,6 +21,7 @@ def perform(gl_repository, identifier, changes, push_options = {})
     container, project, repo_type = Gitlab::GlRepository.parse(gl_repository)
     @project = project
     @gl_repository = gl_repository
+    @changes = changes
 
     if container.nil? || (container.is_a?(ProjectSnippet) && project.nil?)
       log("Triggered hook for non-existing gl_repository \"#{gl_repository}\"")
@@ -121,6 +122,13 @@ def expire_caches(post_received, repository)
   def expire_branch_cache(repository)
     repository.expire_branches_cache
 
+    STDERR.puts "=== #{Time.now} expired branch cache, changes = #{@changes}"
+
+    if @changes.include?("delay-test")
+      STDERR.puts "=== received delay test branch, pausing for 5 seconds to create race condition"
+      sleep 5
+    end
+
     # Consider the scenario where multiple pushes happen in close succession:
     #
     # 1. Job 1 expires cache.
diff --git a/lib/gitlab/git/repository.rb b/lib/gitlab/git/repository.rb
index 754e975fedb0..42285f62ee62 100644
--- a/lib/gitlab/git/repository.rb
+++ b/lib/gitlab/git/repository.rb
@@ -109,6 +109,10 @@ def create_repository(default_branch = nil, object_format: nil)
       def branch_names
         refs = list_refs([Gitlab::Git::BRANCH_REF_PREFIX])
 
+        STDERR.puts "=== #{Time.now} called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled"
+        sleep 10
+        STDERR.puts "=== #{Time.now} done sleeping in #branch_names"
+
         refs.map { |ref| Gitlab::Git.branch_name(ref.name) }
       end

I saved this to /tmp/debug.patch and ran patch -p1 < /tmp/debug.patch from the same directory.

  1. Reconfigure and restart Sidekiq (for good measure):
gitlab-ctl reconfigure
gitlab-ctl restart sidekiq
  1. Now create or use a project that has a .gitlab-ci.yml with something simple:
test:
  script:
    - echo "hello world"
  1. Clone the project in two different directories:
cd /tmp
git clone git@git.example.com:root/test-project.git test1
git clone git@git.example.com:root/test-project.git test2
  1. Now watch the logs:
sudo tail -f /var/log/gitlab/sidekiq/current | egrep "===|Reference not found"
  1. Now run this script:
TEST_NUMBER=1
cd /tmp/test1
git checkout -b test-$TEST_NUMBER
git push -o mr.create -u origin HEAD &
git -C /tmp/test2 checkout -b delay-test-$TEST_NUMBER
sleep 6 # Sleep long enough for the ListRefs RPC to start
git -C /tmp/test2 push -o mr.create -u origin HEAD
  1. Without the feature flag, you should a Reference not found error:
=== 2024-08-01 21:48:11 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/test-1
=== 2024-08-01 21:48:11 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:48:11 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:48:17 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/delay-test-1
=== received delay test branch, pausing for 5 seconds to create race condition
=== 2024-08-01 21:48:21 +0000 done sleeping in #branch_names
=== 2024-08-01 21:48:21 +0000 done sleeping in #branch_names
{"severity":"WARN","time":"2024-08-01T21:48:22.598Z","class":"Git::BranchHooksService","correlation_id":"01J47YMHDA68A7P7BH1W21SVTE","project_id":51,"project_path":"root/test-orject","message":"Error creating pipeline","errors":"Reference not found","pipeline_params":{"before":"0000000000000000000000000000000000000000","after":"3eabef3aad20de2223dee166384adac554d805e0","ref":"refs/heads/delay-test-1","variables_attributes":[],"checkout_sha":"3eabef3aad20de2223dee166384adac554d805e0"},"retry":0}
  1. Now run gitlab-rails console and enable the feature flag:
Feature.enable(:post_receive_sync_refresh_cache)
  1. Repeat step 8 with TEST_NUMBER=2 instead. You should no longer see the Reference not found, and notice the second ListRefs RPC message is staggered after done sleeping messages:
=== 2024-08-01 21:39:53 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/test-2
=== 2024-08-01 21:39:53 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:39:58 +0000 expired branch cache, changes = 0000000000000000000000000000000000000000 3eabef3aad20de2223dee166384adac554d805e0 refs/heads/delay-test-2
=== received delay test branch, pausing for 5 seconds to create race condition
=== 2024-08-01 21:40:03 +0000 done sleeping in #branch_names
=== 2024-08-01 21:40:03 +0000 called ListRefs RPC finished in #branch_names, sleeping for 10 to delay cache from being filled
=== 2024-08-01 21:40:13 +0000 done sleeping in #branch_names
Edited by Stan Hu

Merge request reports

Loading