[Elasticsearch] Decouple schema and search code for ActiveModels to allow for versioned schema
Original Description
It is often needed to update index mapping when we need to improve search, fix some bug, add a new field, add new analyzer or whatever. For relatively small instances it can be done by simply removing the index and creating a new one. But it's acceptable if indexing can be performed in a few minutes or if you don't have a high availability requirement. For those who should do it with no downtime, we have to prepare documentation on how to do that. It is also needed to prepare pipeline example for Logstash.
The common practice is to use the next strategy:
- Update the application
- Create new empty index with new mapping
- Use Logstash to copy all data from old index to the new one (it's much faster than reading data from database/repository again). Copying should be performed until some date(by date condition) since new data continue to go to the old index. We need to use Scroll API and bulk import here to make it fast. All this stuff can be configured in Logstash.
- Remove an old index, create an alias to the new one. From this point, GitLab will work with the new index.
- Run indexing again for date starting from last reindex (to not lose the data).
One more alternative is to temporary disable elastic search, prepare new index with either GitLab rake tasks or Logtash and enable it again.
cc @dzaporozhets @jacobvosmaer @jnijhof @DouweM
@sytses Not sure if should I mention you in such issues....
UPDATE
I think the more appropriate way to handle this is actually having a versioned index instead. For this purpose, we should have the table es_indexes(name, is_active, progress)
in the database. On every data updates we would update every active index. While the last index is in the process of the building we could use the oldest active one. In this case, we could switch indexes smoothy and rollback to the old one when there is a need. Although this is not a cheap change, I think it is the best one for us.
Refactor our ES classes so we can have multiple versions of them at the same time, so the application still knows how to query an old index.
To scope down a bit, this issue only dealt with ActiveModel decoupling. Decoupling of repository commits/blobs indexing has been extracted as another issue: https://gitlab.com/gitlab-org/gitlab-ee/issues/12548
Right now if we change an attribute name for example we edit the relevant classes and thus the application "forgets" how to query the old index.
We should define how many versions of indexes we'll support (infinite? Current and Previous? A set number? In all cases we'll want to keep the version as an incrementing counter I think) Our answer to this would dictate how we'd deal with the refactor (versioned folders? etc.)
We probably want to allow an infinite number of schemas