Draft: Add unique index to lfs_objects_projects table (!76388) · Merge requests · GitLab.org / GitLab

Stan Hu requested to merge sh-add-unique-constraint-lfs-objects-projects into master Dec 08, 2021

What does this MR do and why?

Previously the lfs_objects_projects table had an index for (project_id, lfs_object_iid), but this was not unique. As a result, some activities that used an upsert, such as copying LFS designs from one issue to another, may actually insert a duplicate entry.

On GitLab.com, we currently have 7,304 duplicate rows for a table that has over 130 million rows. To prevent further duplication we add a new unique index and remove the old one.

Relates to #347466

Database migration timings

Type: Post deploy
Duration: 745.8 s
Database size change: -186.11 MiB

Query	Calls	Total Time	Max Time	Mean Time	Rows
WITH duplicates AS ( SELECT MIN(id) AS min_id, project_id, lfs_object_id, COUNT() FROM lfs_objects_projects GROUP BY (project_id, lfs_object_id) having COUNT() > $1 ) DELETE FROM lfs_objects_projects a USING duplicates b WHERE a.id <> b.min_id AND a.project_id = b.project_id AND a.lfs_object_id = b.lfs_object_id /application:test,db_config_name:main/	1	609485.1 ms	609485.1 ms	609485.1 ms	7548
CREATE UNIQUE INDEX CONCURRENTLY "lfs_objects_projects_on_project_id_and_lfs_object_id_uniq_index" ON "lfs_objects_projects" ("project_id", "lfs_object_id") /application:test,db_config_name:main/	1	133998.6 ms	133998.6 ms	133998.6 ms	0
DROP INDEX CONCURRENTLY "index_lfs_objects_projects_on_project_id_and_lfs_object_id" /application:test,db_config_name:main/	1	3.7 ms	3.7 ms	3.7 ms	0
SELECT "feature_gates".* FROM "feature_gates" WHERE "feature_gates"."feature_key" = $1 /application:test,db_config_name:main/	2	1.7 ms	1.6 ms	0.8 ms	2
SELECT "postgres_async_indexes".* FROM "postgres_async_indexes" WHERE "postgres_async_indexes"."name" = $1 LIMIT $2 /application:test,db_config_name:main/	2	0.1 ms	0.0 ms	0.0 ms	0

The slowest query that finds duplicates takes 90 seconds on GitLab.com:

gitlabhq_production=# explain analyze SELECT MIN(id), project_id, lfs_object_id, COUNT(*) FROM lfs_objects_projects GROUP BY (project_id, lfs_object_id) having COUNT(*) > 1;
                                                                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.57..4848953.96 rows=3139923 width=20) (actual time=328.716..87977.801 rows=7340 loops=1)
   Group Key: project_id, lfs_object_id
   Filter: (count(*) > 1)
   Rows Removed by Filter: 94372871
   ->  Index Scan using index_lfs_objects_projects_on_project_id_and_lfs_object_id on lfs_objects_projects  (cost=0.57..3553735.63 rows=94197696 width=12) (actual time=0.252..57959.160 rows=94387759 loops=1)
 Planning Time: 0.453 ms
 Execution Time: 87978.565 ms
(7 rows)

Recreating the indexes takes more time, though. See !76388 (comment 758368443) for full details.

How to set up and validate locally

Create a project with LFS files.
Fork the project.
On the Rails console, create dupes:

source_project = Project.find_by_full_path('first/project')
target_project = Project.find_by_full_path('second/project')
Projects::LfsPointers::LfsLinkService
        .new(target_project)
        .execute(source_project.lfs_objects_oids)

In gdk psql, run:

gitlabhq_production=# WITH dupes AS (SELECT lfs_object_id, project_id, count(*) AS count from lfs_objects_projects group by (lfs_object_id, project_id) having count(*) > 1) SELECT COUNT(*) from dupes;
 count
-------
  1
(1 row)

count should be greater than 0.

Switch to this branch, run the migration, and ensure count is 0.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Dec 09, 2021 by Stan Hu

Draft: Add unique index to lfs_objects_projects table

What does this MR do and why?

Database migration timings

How to set up and validate locally

MR acceptance checklist

Merge request reports