Allow Elasticsearch framework to index any data
Problem to solve
The current Elasticsearch framework uses ActiveRecord to keep the index up to date. It uses callbacks on create/update/delete to action the corresponding record in Elasticsearch.
With the potential of using Elasticsearch as a vector store, it may be useful to allow the framework to index any data that does not necessarily have a matching ActiveRecord record.
For example:
X-Ray reports have a single record per project with a nested jsonb field containing multiple libs which each need to have their own embedding. At the moment the framework would only allow a single document on Elasticsearch instead of one record for every lib.
GitLab Duo documentation is another example: it should be possible to index documentation files/chunks without needing a corresponding database record.
Proposal
Update the framework to allow indexing, updating and deleting data in any serialized format. It could be a module that is added to any class along with a serialize
and deserialize
method. The BulkIndexer
can still be used to queue up items to be sent to Elasticsearch.
Another consideration is that we really don't need to tie ourselves to the elasticsearch-rails
gem. At this point we barely use anything from the gem (only search) and even then we've monkey patched many parts of it. If we find ourselves pushing against the gem we should really start developing completely separate codepaths and accept that we might some day get rid of elasticsearch-rails
.
There is also a lot of legacy with the versioning of Elasticsearch classes which has never been used and we'd do ourselves a favour to move away from that as well.
https://docs.gitlab.com/ee/architecture/blueprints/gitlab_events_platform/ could be used as an alternative for the Elasticsearch redis queue framework as a more general-purpose event stream that can make the same guarantees.
We might also want to introduce a way to handle long-running/async serializing. For example when embeddings are generated, we need an API call which could take a while to respond. We don't want to have multiple processes waiting until there is a response. With the current ProcessBookkeepingService
, serializing happens inline because it's only fetching and transforming a database record.