Skip to content

Draft: Add unique index to lfs_objects_projects table

Stan Hu requested to merge sh-add-unique-constraint-lfs-objects-projects into master

What does this MR do and why?

Previously the lfs_objects_projects table had an index for (project_id, lfs_object_iid), but this was not unique. As a result, some activities that used an upsert, such as copying LFS designs from one issue to another, may actually insert a duplicate entry.

On GitLab.com, we currently have 7,304 duplicate rows for a table that has over 130 million rows. To prevent further duplication we add a new unique index and remove the old one.

Relates to #347466

Database migration timings

  • Type: Post deploy
  • Duration: 745.8 s
  • Database size change: -186.11 MiB
Query Calls Total Time Max Time Mean Time Rows
WITH duplicates AS (
SELECT MIN(id) AS min_id, project_id, lfs_object_id, COUNT()
FROM lfs_objects_projects
GROUP BY (project_id, lfs_object_id) having COUNT(
) > $1
) DELETE
FROM lfs_objects_projects a USING duplicates b
WHERE a.id <> b.min_id AND a.project_id = b.project_id AND a.lfs_object_id = b.lfs_object_id
/*application:test,db_config_name:main*/
1 609485.1 ms 609485.1 ms 609485.1 ms 7548
CREATE UNIQUE INDEX CONCURRENTLY "lfs_objects_projects_on_project_id_and_lfs_object_id_uniq_index" ON "lfs_objects_projects" ("project_id", "lfs_object_id") /*application:test,db_config_name:main*/
1 133998.6 ms 133998.6 ms 133998.6 ms 0
DROP INDEX CONCURRENTLY "index_lfs_objects_projects_on_project_id_and_lfs_object_id" /*application:test,db_config_name:main*/
1 3.7 ms 3.7 ms 3.7 ms 0
SELECT "feature_gates".*
FROM "feature_gates"
WHERE "feature_gates"."feature_key" = $1 /*application:test,db_config_name:main*/
2 1.7 ms 1.6 ms 0.8 ms 2
SELECT "postgres_async_indexes".*
FROM "postgres_async_indexes"
WHERE "postgres_async_indexes"."name" = $1
LIMIT $2 /*application:test,db_config_name:main*/
2 0.1 ms 0.0 ms 0.0 ms 0

The slowest query that finds duplicates takes 90 seconds on GitLab.com:

gitlabhq_production=# explain analyze SELECT MIN(id), project_id, lfs_object_id, COUNT(*) FROM lfs_objects_projects GROUP BY (project_id, lfs_object_id) having COUNT(*) > 1;
                                                                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.57..4848953.96 rows=3139923 width=20) (actual time=328.716..87977.801 rows=7340 loops=1)
   Group Key: project_id, lfs_object_id
   Filter: (count(*) > 1)
   Rows Removed by Filter: 94372871
   ->  Index Scan using index_lfs_objects_projects_on_project_id_and_lfs_object_id on lfs_objects_projects  (cost=0.57..3553735.63 rows=94197696 width=12) (actual time=0.252..57959.160 rows=94387759 loops=1)
 Planning Time: 0.453 ms
 Execution Time: 87978.565 ms
(7 rows)

Recreating the indexes takes more time, though. See !76388 (comment 758368443) for full details.

How to set up and validate locally

  1. Create a project with LFS files.
  2. Fork the project.
  3. On the Rails console, create dupes:
source_project = Project.find_by_full_path('first/project')
target_project = Project.find_by_full_path('second/project')
Projects::LfsPointers::LfsLinkService
        .new(target_project)
        .execute(source_project.lfs_objects_oids)
  1. In gdk psql, run:
gitlabhq_production=# WITH dupes AS (SELECT lfs_object_id, project_id, count(*) AS count from lfs_objects_projects group by (lfs_object_id, project_id) having count(*) > 1) SELECT COUNT(*) from dupes;
 count
-------
  1
(1 row)

count should be greater than 0.

  1. Switch to this branch, run the migration, and ensure count is 0.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Stan Hu

Merge request reports

Loading