"ref not found" after new branch pushed
Issue
Customer in internal ZD link has been plagued with intermittent problems associated with newly-pushed branches. We have confirmed that it is not caused by slow sidekiq jobs. Periodically they end up with Merge Requests that claim "source branch not found" despite all other indications on the system showing that the source branch was successfully pushed and processed. The users have found that if they push another commit on the same branch, or push another branch to the same repo, it will usually resolve the problem. We have identified a rails console script that will find existing MRs with this symptom and (if run in this version) it will recover for those existing examples by clearing the branches cache:
MergeRequest.find_each(start: 660000, finish: 666300, batch_size: 50) do |mr|
if !mr.source_branch_exists? and mr.open?
puts "#{mr.project.full_path}, #{mr.iid}, #{mr.source_branch}"
mr.project.repository.expire_branches_cache
check_service = MergeRequests::MergeabilityCheckService.new(mr)
check_service.execute(recheck: true, retry_lease: false)
puts "#{mr.project.full_path}!#{mr.iid} refreshed"
end
end
We identified that we can reliably find repositories that have this condition from this error in the Git::BranchHooksService
sidekiq job, which is kicked off as part of the PostReceive
job when a new branch is received:
{"severity":"WARN","time":"2021-03-24T15:53:16.778Z","class":"Git::BranchHooksService","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","project_id":43212,"project_path":"sre/repo_name","message":"Error creating pipeline","errors":"Reference not found","pipeline_params":{"before":"0000000000000000000000000000000000000000","after":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e","ref":"refs/heads/branch/path/name","variables_attributes":[],"checkout_sha":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e"},"retry":0}
Using the "Reference not found error" in Git::BranchHooksService
we were able to identify at least one repository where the users were not using merge requests but the repository otherwise showed the same behavior as one identified by searching from the MR perspective: namely, that you can look at the repository in the GitLab UI, see the name of the "missing" branch in the branches list, but then get a 404 when trying to click through to that branch. They have not had users reporting this issue because users are unlikely to poke at the UI to view a branch they just pushed, unless they're creating a merge request.
We believe we've narrowed down where the problem happens to these few lines in the PostReceive
code
expire_caches(post_received, project.repository)
enqueue_project_cache_update(post_received, project)
process_ref_changes(project, user, push_options: push_options, changes: changes)
update_remote_mirrors(post_received, project)
after_project_changes_hooks(project, user, changes.refs, changes.repository_data)
Clearly the first thing we do here is expire the caches, then enqueue_project_cache_update
is just a job that updates the cached project size, so is irrelevant to this situation. process_ref_changes
is the thing that eventually runs the Git::BranchHooksService
job where we first see the missing ref behavior reflected in logs.
The customer has 4 front-end GitLab nodes with repository data mounted via NFS. We have exhaustively double-checked possible inconsistencies across nodes due to NFS timing issues, possibly looking at different redis nodes (they have 3 redis nodes, with sentinel; there have been no redis failovers). GitLab version is 13.7.5 confirmed on all nodes, and redis is 5.0.9. We also confirmed that whenever they identified "bad" repositories with the rails query (without the cache expiration step) they saw the same list of repositories across all GitLab nodes, which of course just checks that all nodes are seeing the same redis cache results.
We were then able to identify one push where we saw all the activity associated with that push executed on the same GitLab node, which would seem to rule out any intra-node inconsistency:
First, the push comes in via http, as seen in production_json.log
:
{"method":"GET","path":"/sre/repo_name.git/info/refs","format":"*/*","controller":"Repositories::GitHttpController","action":"info_refs","status":200,"time":"2021-03-24T15:53:15.650Z","params":[{"key":"service","value":"git-receive-pack"},{"key":"repository_path","value":"sre/repo_name.git"}],"remote_ip":"10.254.100.6","user_id":3950,"username":"username","ua":"git/2.29.2","correlation_id":"01F1JF18REK4P6W35KS3AM89TF","meta.user":"username","meta.project":"sre/repo_name","meta.root_namespace":"sre","meta.caller_id":"Repositories::GitHttpController#info_refs","meta.feature_category":"source_code_management","redis_calls":2,"redis_duration_s":0.000811,"redis_read_bytes":312,"redis_write_bytes":98,"redis_cache_calls":2,"redis_cache_duration_s":0.000811,"redis_cache_read_bytes":312,"redis_cache_write_bytes":98,"db_count":19,"db_write_count":0,"db_cached_count":1,"cpu_s":0.2,"db_duration_s":0.00981,"view_duration_s":0.0002,"duration_s":0.36388}
We see the associated gitaly activity in the gitaly log:
{"correlation_id":"01F1JF18REK4P6W35KS3AM89TF","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-workhorse","grpc.meta.deadline_type":"none","grpc.method":"InfoRefsReceivePack","grpc.request.fullMethod":"/gitaly.SmartHTTPService/InfoRefsReceivePack","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.SmartHTTPService","grpc.start_time":"2021-03-24T15:53:15Z","grpc.time_ms":100.718,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:15.752Z"}
Next we see the sidekiq PostReceive
job, with associated Git::BranchHooksService
:
{"severity":"INFO","time":"2021-03-24T15:53:16.668Z","class":"PostReceive","args":["project-43212","user-3950","0000000000000000000000000000000000000000 451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e refs/heads/branch/path/name\n","{}"],"retry":3,"queue":"post_receive","version":0,"jid":"db03919cc4addcec56315078","created_at":"2021-03-24T15:53:16.662Z","meta.user":"username","meta.project":"sre/repo_name","meta.root_namespace":"sre","meta.caller_id":"/api/:version/internal/post_receive","meta.feature_category":"source_code_management","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","enqueued_at":"2021-03-24T15:53:16.666Z","pid":43962,"message":"PostReceive JID-db03919cc4addcec56315078: start","job_status":"start","scheduling_latency_s":0.001787}
{"severity":"WARN","time":"2021-03-24T15:53:16.778Z","class":"Git::BranchHooksService","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","project_id":43212,"project_path":"sre/repo_name","message":"Error creating pipeline","errors":"Reference not found","pipeline_params":{"before":"0000000000000000000000000000000000000000","after":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e","ref":"refs/heads/branch/path/name","variables_attributes":[],"checkout_sha":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e"},"retry":0}
{"severity":"INFO","time":"2021-03-24T15:53:16.825Z","class":"PostReceive","args":["project-43212","user-3950","0000000000000000000000000000000000000000 451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e refs/heads/branch/path/name\n","{}"],"retry":3,"queue":"post_receive","version":0,"jid":"db03919cc4addcec56315078","created_at":"2021-03-24T15:53:16.662Z","meta.user":"username","meta.project":"sre/repo_name","meta.root_namespace":"sre","meta.caller_id":"/api/:version/internal/post_receive","meta.feature_category":"source_code_management","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","enqueued_at":"2021-03-24T15:53:16.666Z","pid":43962,"message":"PostReceive JID-db03919cc4addcec56315078: done: 0.15706 sec","job_status":"done","scheduling_latency_s":0.001787,"gitaly_calls":5,"gitaly_duration_s":0.055844,"rugged_calls":1,"rugged_duration_s":0.003038,"redis_calls":42,"redis_duration_s":0.017637,"redis_read_bytes":813,"redis_write_bytes":11267,"redis_cache_calls":25,"redis_cache_duration_s":0.011773,"redis_cache_read_bytes":511,"redis_cache_write_bytes":2554,"redis_queues_calls":6,"redis_queues_duration_s":0.002228,"redis_queues_read_bytes":234,"redis_queues_write_bytes":8124,"redis_shared_state_calls":11,"redis_shared_state_duration_s":0.003636,"redis_shared_state_read_bytes":68,"redis_shared_state_write_bytes":589,"db_count":33,"db_write_count":6,"db_cached_count":0,"duration_s":0.15706,"cpu_s":0.055241,"completed_at":"2021-03-24T15:53:16.825Z","db_duration_s":0.023526}
and the gitaly activity associated with the PostReceive
job, including a FindAllBranchNames
call that gives no errors:
{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"CommitsBetween","grpc.request.deadline":"2021-03-24T15:54:16Z","grpc.request.fullMethod":"/gitaly.CommitService/CommitsBetween","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.CommitService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":25.292,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.713Z"}
{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"FindAllTagNames","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.RefService/FindAllTagNames","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.RefService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":3.806,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.762Z"}
{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"FindAllBranchNames","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.RefService/FindAllBranchNames","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.RefService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":3.777,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.776Z"}
{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"CommitDelta","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.DiffService/CommitDelta","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.DiffService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":16.981,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.797Z"}
{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"FilterShasWithSignatures","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.CommitService/FilterShasWithSignatures","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.CommitService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":0.569,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.814Z"}
We have most recently been attempting to reproduce and/or get more information to see if git for-each-ref is reporting any error without returning a failure status, by doing repeated branch pushes to a repo and calling out git for-each-ref calls in loops. That effort has so far proved unfruitful.
Current state
We have a few different areas of investigation open currently:
NFS/Gitaly
See @zj-gitlab's comments below
Race conditions in the cache
@nick.thomas has identified two possible race conditions in the set cache adapter:
- https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/repository_cache_adapter.rb#L67
- https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/repository_set_cache.rb#L39
These should be updated to use atomic Redis methods, though whether they're the cause of the issue or not we don't currently know.
Nick's comment: #326066 (comment 539076174)
Merge request for this is workflowin review : !57607 (merged)
PostReceive
worker
Backup plan: work around the issue in the
This is an attempt to keep forcibly clearing the cache inside the PostReceive
worker until it includes the branches we're looking for. This isn't a "solution" but it is something we want to have ready in case we just need to work around the issue for now.