GitLab's automatic Elasticsearch reindexing should support retries for manual slicing

Problem

We've seen multiple times now (gitlab-com/gl-infra/production#2408 (closed) gitlab-com/gl-infra/production#3172 (closed) gitlab-com/gl-infra/production#2408 (closed)) that Elasticsearch reindexing is quite brittle. There are many things we can do to find a root cause each time something causes a failure but the fundamental problem is that reindexing is not resilient and will not retry. There appears to be work ongoing on the Elasticsearch side to improve this in https://github.com/elastic/elasticsearch/issues/60362 and https://github.com/elastic/elasticsearch/issues/42612 but it's not clear how long it will be before we have an adequate solution for our large index to reliably reindex.

Solution

We've had success in all the above issues by retrying slices using manual slicing. We should expand the automated reindex feature to do the following:

Break up the work into many slices (maybe 2x number of shards)
Trigger a reindex for those slices in batches (possibly 20% of slices at a time)
Store the task ID and slice number of all currently running slices
Periodically check the status of those and then retry and update the task ID as soon as any slice is marked as completed but does not have the correct number of docs (ie. total == created + updated + deleted)
Allow administrator users to choose the shard multiplier (used to calculate the number of slices used per index during reindexing, slices = multiplier * # of shards) and maximum total running slices. The values will be stored with the Elastic Reindexing Task and used during the reindexing process.

Edited Apr 30, 2021 by Terri Chu