Skip to content

Refactor migration to process more data

What does this MR do and why?

Related to #351381 (closed)

Background

Following up from !109379 (merged) and !107730 (merged) after more testing in staging.

Note: Advanced Search migrations are currently paused in production via a feature flag

In staging, the migration is taking too long to run due to a few reasons:

  • there are projects in the index that are not found in the database
  • many of those projects contains a small number of files (10-20 files)
  • the migration only processes 1 project every 45 seconds

At the current rate, staging will not finish for a long time and I'm unsure of whether this issue with database deleted projects exists in production (I suspect it does). The deleted projects are automation projects created from QA and I'll be opening up another issue to look into what is happening there.

Changes

This MR aims to speed up the migration by:

  • reducing the throttle to 5 seconds, running more often will allow us to process data faster
  • run up to 100 projects at once
  • change to tracking multiple tasks in the migration_state along with the associated project_id

Screenshots or screen recordings

N/A

How to set up and validate locally

Elasticsearch query to remove migration from migrations index

URL: DELETE http://localhost:9200/gitlab-development-migrations/_doc/20221221110300

Elasticsearch query to remove traversal_ids from documents

URL: POST http://localhost:9200/gitlab-development/_update_by_query?wait_for_completion=true&refresh=true

{
	"script": "ctx._source.remove('traversal_ids');",
	"query": {
		"bool": {
			"must": {
				"terms": {
					"type": [
						"wiki_blob",
						"blob"
					]
				}
			}
		}
	}
}
Elasticsearch query to count documents missing traversal_ids

URL: GET http://localhost:9200/gitlab-development/_count

{
	"query": {
		"bool": {
			"must_not": {
				"exists": {
					"field": "traversal_ids"
				}
			},
			"must": {
				"terms": {
					"type": [
						"wiki_blob",
						"blob"
					]
				}
			}
		}
	}
}
  1. setup gdk for Elasticsearch and make sure the indexes are created and populated
  2. stop rails-background-jobs to make sure the cron worker doesn't process the migration for you gdk stop rails-background-jobs
  3. remove all of the existing traversal_ids from the index (see Elasticsearch query to remove traversal_ids from documents above)
  4. check that there are blob and wiki_blob documents missing traversal_ids (see Elasticsearch query to count documents missing traversal_ids above)
  5. delete the migration record from the migrations index if it exists (see Elasticsearch query to remove migration from migrations index above)
  6. start rails console and manually run the migration worker Elastic::MigrationWorker.new.perform
  7. monitor the elasticsearch.log file for progress. you will need to run this a bunch to fully walk through all of the projects in your index (it may be a lot)
  8. check the migration document to make sure it gets updated with tracking details GET http://localhost:9200/gitlab-development-migrations/_doc/20221221110300
  9. you can start rails-background-jobs back up and let it automatically finish once you are satisfied

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #351381 (closed)

Edited by Terri Chu

Merge request reports

Loading