Refactor migration to process more data
What does this MR do and why?
Related to #351381 (closed)
Background
Following up from !109379 (merged) and !107730 (merged) after more testing in staging.
Note: Advanced Search migrations are currently paused in production via a feature flag
In staging, the migration is taking too long to run due to a few reasons:
- there are projects in the index that are not found in the database
- many of those projects contains a small number of files (10-20 files)
- the migration only processes 1 project every 45 seconds
At the current rate, staging will not finish for a long time and I'm unsure of whether this issue with database deleted projects exists in production (I suspect it does). The deleted projects are automation projects created from QA and I'll be opening up another issue to look into what is happening there.
Changes
This MR aims to speed up the migration by:
- reducing the throttle to 5 seconds, running more often will allow us to process data faster
- run up to 100 projects at once
- change to tracking multiple tasks in the migration_state along with the associated project_id
Screenshots or screen recordings
N/A
How to set up and validate locally
Elasticsearch query to remove migration from migrations index
URL: DELETE http://localhost:9200/gitlab-development-migrations/_doc/20221221110300
Elasticsearch query to remove traversal_ids from documents
URL: POST http://localhost:9200/gitlab-development/_update_by_query?wait_for_completion=true&refresh=true
{
"script": "ctx._source.remove('traversal_ids');",
"query": {
"bool": {
"must": {
"terms": {
"type": [
"wiki_blob",
"blob"
]
}
}
}
}
}
Elasticsearch query to count documents missing traversal_ids
URL: GET http://localhost:9200/gitlab-development/_count
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "traversal_ids"
}
},
"must": {
"terms": {
"type": [
"wiki_blob",
"blob"
]
}
}
}
}
}
- setup gdk for Elasticsearch and make sure the indexes are created and populated
- stop rails-background-jobs to make sure the cron worker doesn't process the migration for you
gdk stop rails-background-jobs
- remove all of the existing traversal_ids from the index (see
Elasticsearch query to remove traversal_ids from documents
above) - check that there are blob and wiki_blob documents missing traversal_ids (see
Elasticsearch query to count documents missing traversal_ids
above) - delete the migration record from the migrations index if it exists (see
Elasticsearch query to remove migration from migrations index
above) - start rails console and manually run the migration worker
Elastic::MigrationWorker.new.perform
- monitor the
elasticsearch.log
file for progress. you will need to run this a bunch to fully walk through all of the projects in your index (it may be a lot) - check the migration document to make sure it gets updated with tracking details
GET http://localhost:9200/gitlab-development-migrations/_doc/20221221110300
- you can start rails-background-jobs back up and let it automatically finish once you are satisfied
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #351381 (closed)