Add two partial nonunique indexes (!21230) · Merge requests · GitLab.org / GitLab

charlie ablett requested to merge 30390-cablett-duplicate-labels-due-to-race-condittion into master Dec 05, 2019

What does this MR do?

Add two partial non-unique indexes on labels for [project_id, title] and [group_id, title].

This is the first part of the strategy for #30390 (comment 254445189). By adding the indexes, we can then run the query on production to identify how many duplicate rows have been created due to this race condition.

Using #database-lab to run (only the [title project_id] index):

explain SELECT L1.project_id, L1.title 
FROM labels L1 
WHERE (SELECT COUNT(*) 
  FROM labels L2 
  WHERE L1.project_id=L2.project_id AND L1.title=L2.title)>1;

Query plan without index

Seq Scan on labels l1  (cost=0.00..76555885.70 rows=3308220 width=13)
  Filter: ((SubPlan 1) > 1)
  SubPlan 1
    ->  Aggregate  (cost=7.64..7.65 rows=1 width=8)
          ->  Index Scan using index_labels_on_project_id on labels l2  (cost=0.43..7.64 rows=1 width=0)
                Index Cond: (l1.project_id = project_id)
                Filter: ((l1.title)::text = (title)::text)

Execution summary without index (3.965 min)

Time: 3.965 min
  - planning: 1.421 ms
  - execution: 3.965 min
    - I/O read: 2.451 min
    - I/O write: 2.447 s

Shared buffers:
  - hits: 83687304 (~638.50 GiB) from the buffer pool
  - reads: 156844 (~1.20 GiB) from the OS file cache, including disk I/O
  - dirtied: 4148 (~32.40 MiB)
  - writes: 26587 (~207.70 MiB)

Having created the index:

CREATE INDEX CONCURRENTLY index_labels_on_project_id_and_title 
ON labels(project_id, title) 
WHERE labels.group_id = null;

Query plan with index

Seq Scan on public.labels l1  (cost=0.00..76597123.74 rows=3310013 width=13) (actual time=208.064..79341.877 rows=906 loops=1)
   Filter: ((SubPlan 1) > 1)
   Rows Removed by Filter: 9929134
   Buffers: shared hit=83843340 read=93
   I/O Timings: read=7.654
   SubPlan 1
     ->  Aggregate  (cost=7.64..7.65 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=9930040)
           Buffers: shared hit=83719214
           ->  Index Scan using index_labels_on_project_id on public.labels l2  (cost=0.43..7.64 rows=1 width=0) (actual time=0.005..0.007 rows=1 loops=9930040)
                 Index Cond: (l1.project_id = l2.project_id)
                 Filter: ((l1.title)::text = (l2.title)::text)
                 Rows Removed by Filter: 11
                 Buffers: shared hit=83719214

Execution summary with index (1.322 min)

Time: 1.322 min
  - planning: 1.044 ms
  - execution: 1.322 min
    - I/O read: 7.654 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 83843340 (~639.70 GiB) from the buffer pool
  - reads: 93 (~744.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Still has a table scan and is using 639GB 😱 but since we are making a specific query that won't really be indexed again (rows with identical values in two columns) that might be unavoidable? 🤔

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers

Related to #30390 (closed)

Edited May 19, 2022 by Coung Ngo

Add two partial nonunique indexes

What does this MR do?

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Merge request reports