Registry import observer worker
🔭 Context
We are preparing for Phase 2 of the Container Registry migration which involves importing all existing container repositories to the new platform (Phase 1 involved routing all new container repositories to the new platform). See &7316 (closed) for full details of how the import will work.
Once we start the import process, we need a way to monitor what states container repositories are in so we can quickly identify any problems.
🔬 What does this MR do and why?
This MR introduces a new cron worker ContainerRegistry::Migration::ObserverWorker
which will log the counts of each migration_state
for container_repositories
every 30 minutes. This will allow us to see if for example, too many registry imports are aborting or getting stuck as we ramp up the import speed.
🐘 Database
The BatchCount
module is used to count the records.
I considered a query without batch counting using an index on migration_state
, but that was over the performance guideline with an INDEX ONLY SCAN
. As the table grows, it will only get slower.
Batching creates more queries, but keeps them fast so there is no risk of timeouts. To keep performance of the batch counter high, we add an index on (id, migration_state)
so we can ensure an INDEX ONLY SCAN
will be used on these queries.
Queries
The batch_count operation will create the following queries. These example use migration_state: 'default'
, which is the current state of all container repositories on production, but this will run for each state (there are 7). Most of the time the majority container repositories will be in default
or import_done
with a smaller number existing in the other states.
Postgres.ai setup
This sets a few other migration states since currently all are default on production so we get more realistic query times.
exec CREATE INDEX CONCURRENTLY index_container_repositories_on_id_migration_state ON container_repositories USING btree (id, migration_state);
exec update container_repositories set migration_state = 'import_done' where id % 3 = 0;
exec update container_repositories set migration_state = 'importing' where id % 1000 = 0;
Unfortunately, because the index is added and data is touched on update, we cannot get true cold-cache queries.
Initial max query: 1.68ms
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29516
SELECT MIN("container_repositories"."id") FROM "container_repositories" WHERE "container_repositories"."migration_state" = 'default';
Initial max query: 0.37ms
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29517
SELECT MAX("container_repositories"."id") FROM "container_repositories" WHERE "container_repositories"."migration_state" = 'default';
Batch max query: 20ms
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29531
SELECT "container_repositories"."id" FROM "container_repositories" WHERE "container_repositories"."id" >= 1 AND "container_repositories"."migration_state" = 'default' ORDER BY "container_repositories"."id" ASC LIMIT 1 OFFSET 50000;
Count Query: 55.8ms
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/8343/commands/29520
Note, since we updated the values of many records, I took a random lower ID and guessed the upper ID, so we have a count higher than 50000 giving a conservative query time.
SELECT COUNT("container_repositories"."id") FROM "container_repositories" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."id" >= 1 AND "container_repositories"."id" < ;
Total job time
We try to keep sidkiq jobs below 5 minutes.
There are ~2 million rows in the table being counted, which is 40 batches of 50_000. So conservatively, if we say each batch takes 100ms, and we do 40 batches for all 7 migration_states:
(100 * 40 * 7) = 28000ms => 28s
This gives us plenty of room in the 5 minute window.
Migration output
Up:
== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: migrating
-- transaction_open?()
-> 0.0000s
-- index_exists?(:container_repositories, [:id, :migration_state], {:name=>"tmp_index_container_repositories_on_id_migration_state", :algorithm=>:concurrently})
-> 0.0058s
-- execute("SET statement_timeout TO 0")
-> 0.0008s
-- add_index(:container_repositories, [:id, :migration_state], {:name=>"tmp_index_container_repositories_on_id_migration_state", :algorithm=>:concurrently})
-> 0.0144s
-- execute("RESET statement_timeout")
-> 0.0008s
== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: migrated (0.0248s)
Down:
== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: reverting
-- transaction_open?()
-> 0.0000s
-- indexes(:container_repositories)
-> 0.0048s
-- execute("SET statement_timeout TO 0")
-> 0.0009s
-- remove_index(:container_repositories, {:algorithm=>:concurrently, :name=>"tmp_index_container_repositories_on_id_migration_state"})
-> 0.0069s
-- execute("RESET statement_timeout")
-> 0.0008s
== 20220202034409 AddTmpIndexOnIdAndMigrationStateToContaienrRepositories: reverted (0.0166s)
📷 Screenshots or screen recordings
N/A
How to set up and validate locally
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #349746 (closed)