Implement a percentage based rollout for ElasticSearch on GitLab.com
Problem
We believe rolling out ElasticSearch to all our GitLab.com projects will mean a very large volume of data being indexed and searched. It may not be safe to do this as an all or nothing rollout since it doesn't give us enough time to react to problems and scale out our infrastructure or make indexing/searching more efficient. It would also be a lot of manual effort of enabling then rolling back constantly as we learn about a new scaling challenge.
Solution
Roll out to a percentage of groups at a time starting with Gold groups.
This will require some changes to GitLab to support this in a sensible way.
Currently we have an ability to limit the groups that are being indexed/searched in Elasticsearch but it has the following problems:
- It likely does not handle very large numbers of groups in the list (it was only designed to be used for a few groups) and so the admin UI will probably break or timeout when there are hundreds or thousands in this list. It may also have performance impacts in other parts of the code when we check this list.
- This was intended to be used in such a way that we would enable for a set of groups then we'd allow the indexing to finish before enabling it for searching.
- Apart from clicking through the UI or writing one off scripts for the console there is no controlled way to roll this out to large numbers of groups
Extend this logic of rolling out to groups
In order to solve the above problems we'll want to:
- Adapt this feature so that it does not display all the groups that are part of the rollout in the admin UI when the number exceeds some sensible limit (eg. 20)
- Ensure in all places this logic is being used that it scales sensibly when there are thousands of groups in the rollout
- Set an extra boolean
index_statuses.records_initially_indexed
indicating that we've finishedElastic::IndexRecordService#initial_import_project
for the given project - Update our logic in
use_elasticsearch?
to ensure that all projects within the current scope (ie. all projects in the group or just this project for project search) have theindex_statues.initial_import_complete
astrue
- Create a script that can be run from rails console to enable for large numbers of groups at a time
- Ensure that you can remove groups from the rollout without data loss or bugs so that they stop being indexed and searched in case any parts of the system start to become overloaded
TODO
-
Hide projects/namespaces when there are more than 50 in the admin UI -
Store status asSkip per !20760 (comment 258510946)index_statuses.records_initially_indexed
after indexing all DB records is completed for a project -
Determine whether or not to use Elasticsearch based on non-empty SHA inSkip per !20760 (comment 258510946)IndexStatus
and alsoIndexStatus#records_initially_indexed? => true
-
--- Assign review --- -
Validate how the different features behave when there are 100,000 namespaces and 100,000 projects enabled -
What queries are happening when loading the search page scoped to one of those groups -
What queries are happening when loading the search page scoped to a different group that is not enabled -
What queries are happening when loading the search page scoped to one of those projects -
What queries are happening when loading the search page scoped to a different project that is not enabled
-
-
--- Merge --- -
Add an API to trigger rollout to percentages at a time (admin only) => send the desired rollout percentage. We first check if the number is already greater than or equal this and do nothing if so (idempotent) otherwise we grab the next set of ids (ordered by id, which would help us later figure out which ones had been enabled, and we should also log it) and then we enable for them. -
--- Merge --- -
Add support to rollback the percentage via admin API (we should reverse order the elasticsearch_indexed_namespaces
bycreated_at
here so we disable only the most recently enabled. -
--- Assign review --- -
Validate it's safe to remove something from the rollout, make some changes to that project, then re-add it to the rollout? Is it idempotent? -
--- Merge ---