Reduce load caused by `ElasticCommitIndexerWorker`
The ElasticSearch feature on GitLab.com causes us some problems. One of those issues is processing commits is quite I/O heavy.
We see git push activity peak at 15,000 pushes per hour at the moment, so this is the absolute minimum level we need to be thinking about.
One suggestion by @vsizov is to avoid re-reading commit data on push:
I've done some research of Sidekiq post_receive job and it looks pretty doable to not read git data again for ES because we could feed it to ES right there. The downside is that we will bring Elasticsearch logic to post_receive job again, this is something we tried to avoid but this time we can think of making it more durable. I mean, the problem with ES should not break the post receive job because it will break whole GitLab instance.
With those caveats in mind, this seems like a reasonable enhancement that should reduce much of the I/O load caused by keeping the indexes up to date.
Another thing to consider is whether we could somehow avoid the indexing step for commits that have been seen before. If someone forks gitlab-ce
, do we have to process all those commits again? How about if they push a copy that isn't an explicit fork? What happens at the moment?
/cc @vsizov