Elasticsearch: return to using a separate index per document type
Elasticsearch has deprecated parent-child relationships and toplevel document types: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html
Use of these features is not possible long-term (Valery: there is a transparent replacement join datatype). In the short term, use of these features causes us two problems:
- The index size is bloated by an as-yet-unquantified size, because every document has every field of every type
- Use of parent-child relationships means we can't spread the documents out evenly across elasticsearch shards. Some shards end up with far more data on them than others, as a result.
We should investigate returning to the old situation of having an index per document type, treating commits
and blobs
as separate data types, of course.
Joins, if necessary, can be done in-application or using the new join
capability.
(Previous discussion related to splitting the repository
type up, now obsolete)
The following discussion from !2709 (merged) should be addressed:
-
@smcgivern started a discussion: I see we have talked about splitting the types in the issue, which makes sense to me: https://gitlab.com/gitlab-org/gitlab-ee/issues/3011#note_37888885
@vsizov @nick.thomas do we already have an issue for that?
Per https://gitlab.com/gitlab-org/gitlab-ee/issues/3011#note_37888885 , currently we store 'commits' and 'blobs' in elasticsearch with a _type
of repository
. This means commits have all the fields of blobs, and vice-versa. It also complicates querying these document types, and causes bugs.
Can we do this with a data migration? To me, asking our users to reindex all their repositories for this is unreasonable.
A thought: how much extra space do these fields take up per document, even though they're empty?