Migration should update 1 project at a time & remove missing projects
What does this MR do and why?
Related to #351381 (closed)
Noticed a few issues while testing the migration from !107730 (merged) in staging.
Note: I do not believe this needs a revert and we can fix in the current milestone. Advanced Search migrations are currently paused in production via a feature flag
- Projects which are missing from the database should not remain in the index. They will never pass the
missing traversal ids
check and the migration will run forever. - Migration is kicking off
update_by_query
for many projects at once but only storing onetask_id
in the migration. This can cause too many tasks to be running in Elasticsearch and also we have no way to track the tasks completion (except for the one stored in the migration). - Update by query is not using batch_size
Fixes
- remove projects not found in the database from the index using
ElasticDeleteProjectWorker
- only process 1 project at a time, 1 project will map to 1 task in the migration
- add max_docs to update_by_query
- update specs
Screenshots or screen recordings
N/A
How to set up and validate locally
Elasticsearch query to remove migration from migrations index
URL: DELETE http://localhost:9200/gitlab-development-migrations/_doc/20221221110300
Elasticsearch query to remove traversal_ids from documents
URL: POST http://localhost:9200/gitlab-development/_update_by_query?wait_for_completion=true&refresh=true
{
"script": "ctx._source.remove('traversal_ids');",
"query": {
"bool": {
"must": {
"terms": {
"type": [
"wiki_blob",
"blob"
]
}
}
}
}
}
Elasticsearch query to count documents missing traversal_ids
URL: GET http://localhost:9200/gitlab-development/_count
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "traversal_ids"
}
},
"must": {
"terms": {
"type": [
"wiki_blob",
"blob"
]
}
}
}
}
}
- setup gdk for Elasticsearch and make sure the indexes are created and populated
- stop rails-background-jobs to make sure the cron worker doesn't process the migration for you
gdk stop rails-background-jobs
- remove all of the existing traversal_ids from the index (see
Elasticsearch query to remove traversal_ids from documents
above) - check that there are blob and wiki_blob documents missing traversal_ids (see
Elasticsearch query to count documents missing traversal_ids
above) - delete the migration record from the migrations index if it exists (see
Elasticsearch query to remove migration from migrations index
above) - start rails console and manually run the migration worker
Elastic::MigrationWorker.new.perform
- monitor the
elasticsearch.log
file for progress. you will need to run this a bunch to fully walk through all of the projects in your index (it may be a lot) - you can start rails-background-jobs back up and let it automatically finish once you are satisfied
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Edited by Terri Chu