Advanced Search Background Migrations for Elasticsearch Indexing
Problem
At some point we will need to do a migration to our Elasticsearch index that is more than just reindexing within a cluster. For example when we want to add epics to the index #4745 (closed) or any other issue that may require us to start indexing something new about a document.
For now this issue will serve as a place to brainstorm ideas about this and link up to previous discussions and ideas.
Solutions
Original plan for multi version support
Originally we planned to introduce multi version support using an approach that was fully reliant on GitLab to manage both indexes, reading from the old one and writing to both until the migration finished. Some more information at !18254 (merged) and &1769 . As at writing this (%13.3) most of the code still exists for this in GitLab in a half implemented form.
The 2 primary concerns I had with this approach were:
- Reindexing was done on the GitLab side and it involved reading every single document from the database and sending it to Elasticsearch again. At the rate we index things this will likely take weeks compared to our current in cluster reindexing which can finish in under 24 hrs and can arbitrarily be made faster by scaling up the cluster before the operation.
- Reindexing everything from GitLab to the cluster again may be very wasteful on occasions where you only need to change a small part of the index. For example if we want to add epics to the index it is very wasteful to reindex every document in the index when we could very quickly just index all the epics. There are many situations where we will be trying to perform some migration that can be done more efficiently using a targeted approach (eg. adding a new field to a document type only requires reindexing all the documents that actually have that field). The implementation we had was very generic and as such we couldn't write very custom migrations that did the least amount of work possible but instead we just relied on reindexing everything from scratch for any change.
- The actual implementation itself writes to both indexes at the same time which means one slow index will block indexing and if we are reindexing to a new cluster it may be under very high load which will slow down the incremental indexing as well since they are coupled. This could have probably been fixed by refactoring the code a little but it was another part that still needed to be improved.
Directory of migrations, migrations are workers
I believe that we should do our best to take inspiration from rails DB migrations for whatever solutions we come up with. There are many things to be learnt from these that we don't want to have to learn again.
As such I can imagine a simple approach might be to have a directory of workers. Those workers are numbered like rails migrations and get executed in order. The completion of a worker is persisted in the Elasticsearch cluster. A periodic sidekiq worker is looking for new workers in this directory that haven't yet run. These worker names should start with a timestamp like DB migrations to ensure they are run in order.
We might be tempted to persist the migrations that have run in Postgres. This may not be perfect for the situation where a user wants to connect up a new ES cluster to GitLab. It's probably better to persist the migrations themselves in the ES cluster itself so it's more likely to be in sync.
One problem I can think of is that a job may not be appropriate for a very long migration. We would likely want to spawn many smaller sidekiq jobs from a migration. Perhaps we can use a "Poison Pill" approach in which we queue up a bunch of jobs that are to be executed in sequence (1 at a time) and then add one last job after all of those which is the job which signifies the migration is complete and is the one which then writes to the completed migrations index in Elasticsearch.
Sometimes a migration will need to involve actually triggering a reindex of the ES cluster. Since we may need to reindex the cluster before adding some new data. This could probably still make use of the existing reindex feature in GitLab. The sidekiq worker might just kick off the reindexing and keep requeueing itself until the reindex is completed and then it would persist the fact that the reindex is completed.
A sidekiq migration may also be just triggering something that runs on the ES cluster and watching until it completes. For example an in-cluster reindex might make use of script
to transform the data during the migration or filter out certain documents. Or it might perform an "Update all" operation that runs a script on a set of docs to update some part of them.
Release notes
Adding features to Advanced Search would often require a manual reindex to be able to immediately use the new feature.
Now when we add new features, reindexing will happen in the background without manual intervention. Reindexing can still be performed manually when needed.
https://docs.gitlab.com/ee/development/elasticsearch.html#creating-a-new-global-search-migration