Remove duplicate services in background migration
What does this MR do?
This removes duplicated records of the same service type on the same project, using the current (non-deterministic) default order to decide which record to keep.
Database Migrations
db/post_migrate/20201207165956_remove_duplicate_services.rb
Post-deployment migration: This post-deployment migration uses keyset pagination to bulk-queue background jobs, passing a list of project IDs where duplicate services exist.
Example query plans on staging:
- First batch:
SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1) ORDER BY "services"."project_id" ASC LIMIT 1
SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL AND "services"."project_id" >= 21311 GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1) ORDER BY "services"."project_id" ASC LIMIT 1 OFFSET 5000
SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL AND "services"."project_id" >= 21311 AND "services"."project_id" < 30604 GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1)
- Subsequent batches:
SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL AND "services"."project_id" >= 30604 GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1) ORDER BY "services"."project_id" ASC LIMIT 1 OFFSET 5000
SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL AND "services"."project_id" >= 30604 AND "services"."project_id" < 39360 GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1)
- etc.
Migration output:
$ rake db:migrate:up VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: migrating ==========================
== 20201207165956 RemoveDuplicateServices: migrated (0.0946s) =================
$ rake db:migrate:down VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: reverting ==========================
== 20201207165956 RemoveDuplicateServices: reverted (0.0000s) =================
lib/gitlab/background_migration/remove_duplicate_services.rb
Background migration: This background migration deletes duplicate records for each batch of project IDs.
Example query plans on staging:
- Once for each batch:
- Once for each project/type combination (5000+ times per batch):
Expected runtime
On production (via #database-lab
):
- 508 projects affected, resulting in 1 batch / < 2 minutes total runtime.
- 715 duplicate records to be deleted.
On staging (via psql
):
- 959831 projects affected, resulting in 192 batches / ~6.4 hours total runtime.
- ~960050+ duplicate records to be deleted (I couldn't find out the exact number because the query times out).
- Most of these were a side-effect of testing instance integrations: gitlab-com/gl-infra/production#1651 (closed).
If we exclude the inherited records (and clean them separately somehow), we'd have:- 41 projects affected, resulting in 1 batch / < 2 minutes total runtime.
- ~249+ duplicate records to be deleted.
Previous timings from first version
db/post_migrate/20201207165956_remove_duplicate_services.rb
Post-deployment migration: This post-deployment migration uses keyset pagination to bulk-queue background jobs for all projects with services.
Example query plans on staging:
- First batch:
SELECT "projects"."id" FROM "projects" WHERE (EXISTS (SELECT "services"."type" FROM "services" WHERE (services.project_id = projects.id) GROUP BY "services"."type" HAVING (count(type) > 1))) ORDER BY "projects"."id" ASC LIMIT 1
SELECT "projects"."id" FROM "projects" WHERE (EXISTS (SELECT "services"."type" FROM "services" WHERE (services.project_id = projects.id) GROUP BY "services"."type" HAVING (count(type) > 1))) AND "projects"."id" >= 21311 ORDER BY "projects"."id" ASC LIMIT 1 OFFSET 5000
SELECT MIN(projects.id), MAX(projects.id) FROM "projects" WHERE (EXISTS (SELECT "services"."type" FROM "services" WHERE (services.project_id = projects.id) GROUP BY "services"."type" HAVING (count(type) > 1))) AND "projects"."id" >= 21311 AND "projects"."id" < 30604
- Subsequent batches:
SELECT "projects"."id" FROM "projects" WHERE (EXISTS (SELECT "services"."type" FROM "services" WHERE (services.project_id = projects.id) GROUP BY "services"."type" HAVING (count(type) > 1))) AND "projects"."id" >= 30604 ORDER BY "projects"."id" ASC LIMIT 1 OFFSET 5000
SELECT MIN(projects.id), MAX(projects.id) FROM "projects" WHERE (EXISTS (SELECT "services"."type" FROM "services" WHERE (services.project_id = projects.id) GROUP BY "services"."type" HAVING (count(type) > 1))) AND "projects"."id" >= 30604 AND "projects"."id" < 39360
- etc.
Migration output:
$ rake db:migrate:up VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: migrating ==========================
== 20201207165956 RemoveDuplicateServices: migrated (0.0946s) =================
$ rake db:migrate:down VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: reverting ==========================
== 20201207165956 RemoveDuplicateServices: reverted (0.0000s) =================
lib/gitlab/background_migration/remove_duplicate_services.rb
Background migration: This background migration processes each batch of projects.
Example query plans on staging:
- Once for each batch:
SELECT "projects".* FROM "projects" WHERE (EXISTS (SELECT "services"."type" FROM "services" WHERE (services.project_id = projects.id) GROUP BY "services"."type" HAVING (count(type) > 1))) AND "projects"."id" BETWEEN 21311 AND 30603
- Once for each project:
SELECT "services".* FROM "services" WHERE "services"."project_id" = 21309
- Once for each service type on the project:
SELECT COUNT(*) FROM "services" WHERE "services"."project_id" = 21309 AND "services"."type" = 'JiraService' AND "services"."id" != 48495499
- Once for each service type with duplicate records:
DELETE FROM "services" WHERE "services"."project_id" = 21309 AND "services"."type" = 'JiraService' AND "services"."id" != 48495499
Expected runtime:
- We have around 720'000 projects with services on gitlab.com, but it's not known how many of these have duplicate services.
- A previous production query 9 months ago resulted in 669'991 records which we expect to be deleted.
- With a batch size of 500 that would be at maximum ~150 batches with ~5 hours total runtime.
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides - [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team
Related to #290008 (closed)