Skip to content

Remove duplicates from vulnerability_occurrences

Michał Zając requested to merge 212322-remove-duplicate-vulnerabilities into master

What does this MR do?

This MR removes rows from vulnerability_occurrences for which we would get a duplicate UUIDv5. See #292239 (closed) for more information about why this is a problem.

Related to #212322 (closed)

Queries

Total time expected is (6039407/10000) * 0.061ms ~= 36840.3827ms ~= 37s. According to https://docs.gitlab.com/ee/development/database_review.html#timing-guidelines-for-migrations a post-migration has recommended execution time of 10 minutes which should be plenty.

Adding index

CREATE INDEX tmp_idx_deduplicate_vulnerability_occurrences ON vulnerability_occurrences(project_id,report_type,location_fingerprint,primary_identifier_id,id);
Timings
The query has been executed. Duration: 3.453 min

Selecting one batch of duplicates

WITH batch AS (
  SELECT vulnerability_occurrences.report_type, vulnerability_occurrences.location_fingerprint, vulnerability_occurrences.primary_identifier_id, vulnerability_occurrences.project_id 
  FROM vulnerability_occurrences 
  WHERE vulnerability_occurrences.id >= 1
  AND vulnerability_occurrences.id < 10000
) 
SELECT DISTINCT batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id, array_agg(id) as ids 
FROM batch 
AS batch 
INNER JOIN vulnerability_occurrences 
ON vulnerability_occurrences.report_type = batch.report_type
AND vulnerability_occurrences.location_fingerprint = batch.location_fingerprint
AND vulnerability_occurrences.primary_identifier_id = batch.primary_identifier_id
AND vulnerability_occurrences.project_id = batch.project_id
GROUP BY batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id
HAVING (COUNT(*) > 1);

https://explain.depesz.com/s/V4li

Query plan
 Unique  (cost=40808.54..40808.56 rows=1 width=78) (actual time=60.442..60.442 rows=0 loops=1)
   CTE batch
     ->  Index Scan using vulnerability_occurrences_pkey on vulnerability_occurrences vulnerability_occurrences_1  (cost=0.43..9079.63 rows=8972 width=35) (actual time=0.018..12.312 rows=8985 loops=1)
           Index Cond: ((id >= 1) AND (id < 10000))
   ->  Sort  (cost=31728.90..31728.91 rows=1 width=78) (actual time=60.441..60.441 rows=0 loops=1)
         Sort Key: batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id, (array_agg(vulnerability_occurrences.id))
         Sort Method: quicksort  Memory: 25kB
         ->  GroupAggregate  (cost=31728.86..31728.89 rows=1 width=78) (actual time=60.435..60.435 rows=0 loops=1)
               Group Key: batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id
               Filter: (count(*) > 1)
               Rows Removed by Filter: 8985
               ->  Sort  (cost=31728.86..31728.87 rows=1 width=54) (actual time=55.904..56.443 rows=8985 loops=1)
                     Sort Key: batch.report_type, batch.location_fingerprint, batch.primary_identifier_id, batch.project_id
                     Sort Method: quicksort  Memory: 1086kB
                     ->  Nested Loop  (cost=0.56..31728.85 rows=1 width=54) (actual time=0.039..45.779 rows=8985 loops=1)
                           ->  CTE Scan on batch  (cost=0.00..179.44 rows=8972 width=46) (actual time=0.019..15.632 rows=8985 loops=1)
                           ->  Index Scan using index_vulnerability_occurrences_on_unique_keys on vulnerability_occurrences  (cost=0.56..3.51 rows=1 width=43) (actual time=0.003..0.003 rows=1 loops=8985)
                                 Index Cond: ((project_id = batch.project_id) AND (primary_identifier_id = batch.primary_identifier_id) AND (location_fingerprint = batch.location_fingerprint))
                                 Filter: (batch.report_type = report_type)
 Planning Time: 0.447 ms
 Execution Time: 60.784 ms

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Michał Zając

Merge request reports

Loading