Repository pushes while Indexing on ElasticSearch omits data
Summary
During the upgrade to 11.5 we had to upgrade Elasticsearch and perform a full reindex (and run sudo gitlab-rake gitlab:elastic:clear_index_status
). Our indexing took about half a day, and after running sudo gitlab-rake gitlab:elastic:index_repositories_async
(following https://docs.gitlab.com/ee/integration/elasticsearch.html) multiple times with various BATCH options (5, and we scaled out Sidekiq), sudo gitlab-rake gitlab:elastic:index_repositories_status
finally showed 100%.
However, we find that quite a few commits and blobs are missing from the index. I noticed that these are repositories which had changes while it was pending indexing.
Steps to reproduce
- Create repo A, push commits A1...A100,000
- Create repo B, push commits B1...B10
- Start indexing; A is indexed first
- While A is being indexed, push a commit to B, commit B11
- B11 is indexed (by the Post Receive job, which then triggers the Elastic Commit Indexer)
- A finishes indexing, B is being indexed, but the indexer skips it since the IndexStatus record indicates that B has been fully indexed.
Example Project
What is the current bug behavior?
Commits B1 through B10 are thus omitted from the Elasticsearch index.
What is the expected correct behavior?
All commits in repository B should have been indexed.
Relevant logs and/or screenshots
None.
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
System information System: Ubuntu 16.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.4.5p335 Gem Version: 2.7.6 Bundler Version:1.16.6 Rake Version: 12.3.1 Redis Version: 3.2.12 Git Version: 2.18.1 Sidekiq Version:5.2.1 Go Version: unknown
GitLab information Version: 11.5.0-ee Revision: cb71fca Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql DB Version: 9.6.6 URL: HTTP Clone URL: https:///some-group/some-project.git SSH Clone URL: :some-group/some-project.git Elasticsearch: yes Geo: no Using LDAP: yes Using Omniauth: yes Omniauth Providers:
GitLab Shell Version: 8.4.1 Repository storage paths:
- default: /var/opt/gitlab/git-data/repositories Hooks: /opt/gitlab/embedded/service/gitlab-shell/hooks Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
Checking GitLab Shell ...GitLab Shell version >= 8.4.1 ? ... OK (8.4.1) hooks directories in repos are links: ... ok/repository is empty
Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Redis available via internal API: OK
Access to /var/opt/gitlab/.ssh/authorized_keys: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Running? ... no # (running on different server) Try fixing it: sudo -u git -H RAILS_ENV=production bin/background_jobs start For more information see: doc/install/installation.md in section "Install Init Script" see log/sidekiq.log for possible errors Please fix the error above and rerun the checks.
Checking Sidekiq ... Finished
Checking Reply by email ...
IMAP server credentials are correct? ... no Try fixing it: An error occurred: Errno::ETIMEDOUT: Connection timed out - connect(2) for "10.60.1.208" port 143 Check that the information in config/gitlab.yml is correct For more information see: doc/administration/reply_by_email.md Please fix the error above and rerun the checks. Init.d configured correctly? ... skipped MailRoom running? ... skipped
Checking Reply by email ... Finished
Checking LDAP ...
Server: ldapmain LDAP authentication... Success LDAP users with access to your GitLab server (only showing the first 100 results)
Checking LDAP ... Finished
Checking GitLab ...
Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... yes Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... yes
Redis version >= 2.8.0? ... yes Ruby version >= 2.3.5 ? ... yes (2.4.5) Git version >= 2.9.5 ? ... yes (2.18.1) Git user has default SSH configuration? ... no # using fast key lookup Try fixing it: mkdir ~/gitlab-check-backup-1543458009 sudo mv /var/opt/gitlab/.ssh/id_rsa.pub ~/gitlab-check-backup-1543458009 sudo mv /var/opt/gitlab/.ssh/id_rsa ~/gitlab-check-backup-1543458009 For more information see: doc/ssh/README.md in section "SSH on the GitLab server" Please fix the error above and rerun the checks. Active users: ... Elasticsearch version 5.1 - 5.5? ... no (5.6.13) # the check script needs updating :smile: For more information see: doc/integration/elasticsearch.md
Checking GitLab ... Finished
Possible fixes
I have fixed this locally by monkey-patching Gitlab::Elastic::Indexer with these definitions of run
and update_index_status
(for 11.5 code):
module FixRace
def run(from_sha = nil, to_sha = nil)
to_sha = nil if to_sha == Gitlab::Git::BLANK_SHA
head_commit = repository.try(:commit)
if repository.nil? || !repository.exists? || repository.empty? || head_commit.nil?
update_index_status(from_sha, Gitlab::Git::BLANK_SHA) # Pass the from_sha parameter
return
end
run_indexer!(from_sha, to_sha)
update_index_status(from_sha, to_sha) # Pass the from_sha parameter
true
end
private
# Add the from_sha parameter, perform a compare and swap with a database row lock.
def update_index_status(from_sha, to_sha)
head_commit = repository.try(:commit)
# Use the eager-loaded association if available. An index_status should
# always be created, even if the repository is empty, so we know it's
# been looked at.
index_status = project.index_status
index_status ||=
begin
IndexStatus.find_or_create_by(project_id: project.id)
rescue ActiveRecord::RecordNotUnique
retry
end
# Don't update the index status if we never reached HEAD
return if head_commit && to_sha && head_commit.sha != to_sha
sha = head_commit.try(:sha)
sha ||= Gitlab::Git::BLANK_SHA
index_status.with_lock do
# Do not update the indexing status if:
# - there is no record of what was indexed, and we selected a range. It is possible we did not index the
# entire repository.
next if index_status.last_commit.nil? && !from_sha.nil? && from_sha != Gitlab::Git::BLANK_SHA
index_status.update(last_commit: sha, indexed_at: Time.now)
project.index_status(true)
end
end
I then re-indexed after running sudo gitlab-rake gitlab:elastic:clear_index_status
, and even with changes made to other repositories, and also for new repositories, the indexing status record is properly updated only when the entire repository is indexed.