Projects getting stuck indexing forever and using lots of resources
Problem
Recently I've seen several issues and support tickets related to a project getting stuck being indexed. In many cases we've seen that the project is indexing multiple times in parallel.
Indexing the same project twice in parallel is very bad because Elasticsearch will get conflicting updates and send back errors to all the clients and (I think) this causes them to retry which creates a cascading looping failure.
We avoid this by using a locking mechanism so that a single project is never indexed in parallel. But our lock has a 1 hour timeout. So I have a suspicion that sometimes a project which takes longer than 1hr to index may actually create a self-perpetuating problem where it indexes forever.
Solution
Reduce the likelihood of indexing the same project twice
Considering increasing our timeout on the project lock. Since the lock must have some timeout we may also want to look at a mechanism to kill an indexing process that has exceeded the timeout to have extra protection against running in parallel. Ultimately it's better that a project indexing is killed if it can't finish in time instead of running multiple in parallel that never finish and block other processes and consume resources.
Figure out why the process uses so much memory
I think this is happening because there is actually some retrying for failures built into gitlab-elasticsearch-indexer
which I think was causing gitlab-elasticsearch-indexer#51 (closed) to begin with. That problem was fixed by not retrying 413
status code failures. But I think we actually shouldn't be retrying any failures in gitlab-elasticsearch-indexer
. Sidekiq itself should already be capable of doing retries and retrying in the indexer appears to be causing infinite memory growth. There is one downside, however, in that I think the reason the indexer supports retries is that it's trying to cleverly retry only specific failed documents from the bulk request. The only case I'm aware of for now where that happens is document conflicts which are actually better handled via locking so I think it's still probably best if we just remove retrying from the indexer altogether.