Draft: Add unique index to lfs_objects_projects table
What does this MR do and why?
Previously the lfs_objects_projects
table had an index for
(project_id, lfs_object_iid), but this was not unique. As a result, some
activities that used an upsert, such as copying LFS designs from one
issue to another, may actually insert a duplicate entry.
On GitLab.com, we currently have 7,304 duplicate rows for a table that has over 130 million rows. To prevent further duplication we add a new unique index and remove the old one.
Relates to #347466
Database migration timings
- Type: Post deploy
- Duration: 745.8 s
- Database size change: -186.11 MiB
Query | Calls | Total Time | Max Time | Mean Time | Rows |
---|---|---|---|---|---|
WITH duplicates AS ( |
1 | 609485.1 ms | 609485.1 ms | 609485.1 ms | 7548 |
CREATE UNIQUE INDEX CONCURRENTLY "lfs_objects_projects_on_project_id_and_lfs_object_id_uniq_index" ON "lfs_objects_projects" ("project_id", "lfs_object_id") /*application:test,db_config_name:main*/ |
1 | 133998.6 ms | 133998.6 ms | 133998.6 ms | 0 |
DROP INDEX CONCURRENTLY "index_lfs_objects_projects_on_project_id_and_lfs_object_id" /*application:test,db_config_name:main*/ |
1 | 3.7 ms | 3.7 ms | 3.7 ms | 0 |
SELECT "feature_gates".* |
2 | 1.7 ms | 1.6 ms | 0.8 ms | 2 |
SELECT "postgres_async_indexes".* |
2 | 0.1 ms | 0.0 ms | 0.0 ms | 0 |
The slowest query that finds duplicates takes 90 seconds on GitLab.com:
gitlabhq_production=# explain analyze SELECT MIN(id), project_id, lfs_object_id, COUNT(*) FROM lfs_objects_projects GROUP BY (project_id, lfs_object_id) having COUNT(*) > 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.57..4848953.96 rows=3139923 width=20) (actual time=328.716..87977.801 rows=7340 loops=1)
Group Key: project_id, lfs_object_id
Filter: (count(*) > 1)
Rows Removed by Filter: 94372871
-> Index Scan using index_lfs_objects_projects_on_project_id_and_lfs_object_id on lfs_objects_projects (cost=0.57..3553735.63 rows=94197696 width=12) (actual time=0.252..57959.160 rows=94387759 loops=1)
Planning Time: 0.453 ms
Execution Time: 87978.565 ms
(7 rows)
Recreating the indexes takes more time, though. See !76388 (comment 758368443) for full details.
How to set up and validate locally
- Create a project with LFS files.
- Fork the project.
- On the Rails console, create dupes:
source_project = Project.find_by_full_path('first/project')
target_project = Project.find_by_full_path('second/project')
Projects::LfsPointers::LfsLinkService
.new(target_project)
.execute(source_project.lfs_objects_oids)
- In
gdk psql
, run:
gitlabhq_production=# WITH dupes AS (SELECT lfs_object_id, project_id, count(*) AS count from lfs_objects_projects group by (lfs_object_id, project_id) having count(*) > 1) SELECT COUNT(*) from dupes;
count
-------
1
(1 row)
count
should be greater than 0.
- Switch to this branch, run the migration, and ensure
count
is 0.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.