Skip to content

Migration should update 1 project at a time & remove missing projects

What does this MR do and why?

Related to #351381 (closed)

Noticed a few issues while testing the migration from !107730 (merged) in staging.

Note: I do not believe this needs a revert and we can fix in the current milestone. Advanced Search migrations are currently paused in production via a feature flag

  • Projects which are missing from the database should not remain in the index. They will never pass the missing traversal ids check and the migration will run forever.
  • Migration is kicking off update_by_query for many projects at once but only storing one task_id in the migration. This can cause too many tasks to be running in Elasticsearch and also we have no way to track the tasks completion (except for the one stored in the migration).
  • Update by query is not using batch_size

Fixes

  • remove projects not found in the database from the index using ElasticDeleteProjectWorker
  • only process 1 project at a time, 1 project will map to 1 task in the migration
  • add max_docs to update_by_query
  • update specs

Screenshots or screen recordings

N/A

How to set up and validate locally

Elasticsearch query to remove migration from migrations index

URL: DELETE http://localhost:9200/gitlab-development-migrations/_doc/20221221110300

Elasticsearch query to remove traversal_ids from documents

URL: POST http://localhost:9200/gitlab-development/_update_by_query?wait_for_completion=true&refresh=true

{
	"script": "ctx._source.remove('traversal_ids');",
	"query": {
		"bool": {
			"must": {
				"terms": {
					"type": [
						"wiki_blob",
						"blob"
					]
				}
			}
		}
	}
}
Elasticsearch query to count documents missing traversal_ids

URL: GET http://localhost:9200/gitlab-development/_count

{
	"query": {
		"bool": {
			"must_not": {
				"exists": {
					"field": "traversal_ids"
				}
			},
			"must": {
				"terms": {
					"type": [
						"wiki_blob",
						"blob"
					]
				}
			}
		}
	}
}
  1. setup gdk for Elasticsearch and make sure the indexes are created and populated
  2. stop rails-background-jobs to make sure the cron worker doesn't process the migration for you gdk stop rails-background-jobs
  3. remove all of the existing traversal_ids from the index (see Elasticsearch query to remove traversal_ids from documents above)
  4. check that there are blob and wiki_blob documents missing traversal_ids (see Elasticsearch query to count documents missing traversal_ids above)
  5. delete the migration record from the migrations index if it exists (see Elasticsearch query to remove migration from migrations index above)
  6. start rails console and manually run the migration worker Elastic::MigrationWorker.new.perform
  7. monitor the elasticsearch.log file for progress. you will need to run this a bunch to fully walk through all of the projects in your index (it may be a lot)
  8. you can start rails-background-jobs back up and let it automatically finish once you are satisfied

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Terri Chu

Merge request reports

Loading