Registry import observer worker (!79692) · Merge requests · GitLab.org / GitLab

Steve Abrams requested to merge 349746-observer-phase-2 into master Feb 01, 2022

🔭 Context

We are preparing for Phase 2 of the Container Registry migration which involves importing all existing container repositories to the new platform (Phase 1 involved routing all new container repositories to the new platform). See &7316 (closed) for full details of how the import will work.

Once we start the import process, we need a way to monitor what states container repositories are in so we can quickly identify any problems.

🔬 What does this MR do and why?

This MR introduces a new cron worker ContainerRegistry::Migration::ObserverWorker which will log the counts of each migration_state for container_repositories every 30 minutes. This will allow us to see if for example, too many registry imports are aborting or getting stuck as we ramp up the import speed.

🐘 Database

The BatchCount module is used to count the records.

I considered a query without batch counting using an index on migration_state, but that was over the performance guideline with an INDEX ONLY SCAN. As the table grows, it will only get slower.

Batching creates more queries, but keeps them fast so there is no risk of timeouts. To keep performance of the batch counter high, we add an index on (id, migration_state) so we can ensure an INDEX ONLY SCAN will be used on these queries.

Queries

The batch_count operation will create the following queries. These example use migration_state: 'default', which is the current state of all container repositories on production, but this will run for each state (there are 7). Most of the time the majority container repositories will be in default or import_done with a smaller number existing in the other states.

Postgres.ai setup

This sets a few other migration states since currently all are default on production so we get more realistic query times.

exec CREATE INDEX CONCURRENTLY index_container_repositories_on_id_migration_state ON container_repositories USING btree (id, migration_state);
exec update container_repositories set migration_state = 'import_done' where id % 3 = 0;
exec update container_repositories set migration_state = 'importing' where id % 1000 = 0;

Unfortunately, because the index is added and data is touched on update, we cannot get true cold-cache queries.

Initial max query: 1.68ms https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29516

SELECT MIN("container_repositories"."id") FROM "container_repositories" WHERE "container_repositories"."migration_state" = 'default';

Initial max query: 0.37ms https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29517

SELECT MAX("container_repositories"."id") FROM "container_repositories" WHERE "container_repositories"."migration_state" = 'default';

Batch max query: 20ms https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29531

SELECT "container_repositories"."id" FROM "container_repositories" WHERE "container_repositories"."id" >= 1 AND "container_repositories"."migration_state" = 'default' ORDER BY "container_repositories"."id" ASC LIMIT 1 OFFSET 50000;

Count Query: 55.8ms https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29520

Note, since we updated the values of many records, I took a random lower ID and guessed the upper ID, so we have a count higher than 50000 giving a conservative query time.

SELECT COUNT("container_repositories"."id") FROM "container_repositories" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."id" >= 1 AND "container_repositories"."id" < ;

Total job time

We try to keep sidkiq jobs below 5 minutes.

There are ~2 million rows in the table being counted, which is 40 batches of 50_000. So conservatively, if we say each batch takes 100ms, and we do 40 batches for all 7 migration_states:

(100 * 40 * 7) = 28000ms => 28s

This gives us plenty of room in the 5 minute window.

Migration output

Up:

== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: migrating
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_repositories, [:id, :migration_state], {:name=>"tmp_index_container_repositories_on_id_migration_state", :algorithm=>:concurrently})
   -> 0.0058s
-- execute("SET statement_timeout TO 0")
   -> 0.0008s
-- add_index(:container_repositories, [:id, :migration_state], {:name=>"tmp_index_container_repositories_on_id_migration_state", :algorithm=>:concurrently})
   -> 0.0144s
-- execute("RESET statement_timeout")
   -> 0.0008s
== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: migrated (0.0248s)

Down:

== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: reverting
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_repositories)
   -> 0.0048s
-- execute("SET statement_timeout TO 0")
   -> 0.0009s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"tmp_index_container_repositories_on_id_migration_state"})
   -> 0.0069s
-- execute("RESET statement_timeout")
   -> 0.0008s
== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: reverted (0.0166s)

📷 Screenshots or screen recordings

N/A

How to set up and validate locally

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #349746 (closed)

Edited Feb 02, 2022 by Steve Abrams

Registry import observer worker