Skip to content

Improve the performance of StoreScansWorker

What does this MR do and why?

This MR changes the current way of Security::Finding ingestion from one by one approach to in batches approach. Also marks the findings as deduplicated while creating them to prevent running unnecessary update queries if possible.

If re-running the deduplication logic is required, with this MR, the logic will be running in batches as well.

Related sentry errors:

Database review

Below you can see 2 different versions of the queries introduced by this MR;

Without adding an extra index

Select the lower bound for batch
SELECT
    "security_findings"."id"
FROM
    "security_findings"
WHERE
    "security_findings"."scan_id" = 4855182
    AND "security_findings"."deduplicated" = TRUE
ORDER BY
    "security_findings"."id" ASC
LIMIT 1
Limit  (cost=3013.77..3013.77 rows=1 width=8) (actual time=15.384..15.387 rows=1 loops=1)
 Buffers: shared hit=7 read=21
 I/O Timings: read=15.058 write=0.000
 ->  Sort  (cost=3013.77..3021.22 rows=2981 width=8) (actual time=15.382..15.383 rows=1 loops=1)
       Sort Key: security_findings.id
       Sort Method: top-N heapsort  Memory: 25kB
       Buffers: shared hit=7 read=21
       I/O Timings: read=15.058 write=0.000
       ->  Index Scan using index_security_findings_on_scan_id_and_deduplicated on public.security_findings  (cost=0.57..2998.86 rows=2981 width=8) (actual time=6.117..15.307 rows=167 loops=1)
             Index Cond: ((security_findings.scan_id = 4855182) AND (security_findings.deduplicated = true))
             Buffers: shared hit=4 read=21
             I/O Timings: read=15.058 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/8314/commands/29392

Select the upper bound for the batch
SELECT
    "security_findings"."id"
FROM
    "security_findings"
WHERE
    "security_findings"."scan_id" = 4855182
    AND "security_findings"."deduplicated" = TRUE
    AND "security_findings"."id" >= 16098660
ORDER BY
    "security_findings"."id" ASC
LIMIT 1 OFFSET 50
Limit  (cost=3105.26..3105.26 rows=1 width=8) (actual time=0.254..0.254 rows=1 loops=1)
   Buffers: shared hit=28
   I/O Timings: read=0.000 write=0.000
   ->  Sort  (cost=3105.13..3112.54 rows=2962 width=8) (actual time=0.247..0.250 rows=51 loops=1)
         Sort Key: security_findings.id
         Sort Method: top-N heapsort  Memory: 29kB
         Buffers: shared hit=28
         I/O Timings: read=0.000 write=0.000
         ->  Index Scan using index_security_findings_on_scan_id_and_deduplicated on public.security_findings  (cost=0.57..3006.31 rows=2962 width=8) (actual time=0.041..0.124 rows=167 loops=1)
               Index Cond: ((security_findings.scan_id = 4855182) AND (security_findings.deduplicated = true))
               Filter: (security_findings.id >= 16098660)
               Rows Removed by Filter: 0
               Buffers: shared hit=25
               I/O Timings: read=0.000 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/8314/commands/29397.

Update query
UPDATE
    "security_findings"
SET
    deduplicated = FALSE
WHERE
    "security_findings"."scan_id" = 4855182
    AND "security_findings"."deduplicated" = TRUE
    AND "security_findings"."id" >= 16098660
    AND "security_findings"."id" < 16098763
ModifyTable on public.security_findings  (cost=0.57..25.26 rows=1 width=112) (actual time=0.371..0.371 rows=0 loops=1)
   Buffers: shared hit=110
   I/O Timings: read=0.000 write=0.000
   ->  Index Scan using security_findings_pkey on public.security_findings  (cost=0.57..25.26 rows=1 width=112) (actual time=0.369..0.369 rows=0 loops=1)
         Index Cond: ((security_findings.id >= 16098660) AND (security_findings.id < 16098763))
         Filter: (security_findings.deduplicated AND (security_findings.scan_id = 4855182))
         Rows Removed by Filter: 50
         Buffers: shared hit=110
         I/O Timings: read=0.000 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/8314/commands/29399.

Adter adding an extra index

Execution times improved a lot after creating the following index;

CREATE INDEX index_security_findings_on_scan_id_and_deduplicated_and_id ON security_findings USING btree (scan_id, deduplicated, id);
Select the lower bound for batch
SELECT
    "security_findings"."id"
FROM
    "security_findings"
WHERE
    "security_findings"."scan_id" = 4855182
    AND "security_findings"."deduplicated" = TRUE
ORDER BY
    "security_findings"."id" ASC
LIMIT 1
Limit  (cost=0.57..0.61 rows=1 width=8) (actual time=0.332..0.332 rows=1 loops=1)
 Buffers: shared hit=4 read=4
 I/O Timings: read=0.217 write=0.000
 ->  Index Only Scan using index_security_findings_on_scan_id_and_deduplicated_and_id on public.security_findings  (cost=0.57..109.09 rows=2982 width=8) (actual time=0.330..0.330 rows=1 loops=1)
       Index Cond: ((security_findings.scan_id = 4855182) AND (security_findings.deduplicated = true))
       Heap Fetches: 0
       Buffers: shared hit=4 read=4
       I/O Timings: read=0.217 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/8314/commands/29401

Select the upper bound for the batch
SELECT
    "security_findings"."id"
FROM
    "security_findings"
WHERE
    "security_findings"."scan_id" = 4855182
    AND "security_findings"."deduplicated" = TRUE
    AND "security_findings"."id" >= 16098660
ORDER BY
    "security_findings"."id" ASC
LIMIT 1 OFFSET 50
Limit  (cost=2.52..2.56 rows=1 width=8) (actual time=0.156..0.156 rows=1 loops=1)
   Buffers: shared hit=29
   I/O Timings: read=0.000 write=0.000
   ->  Index Only Scan using index_security_findings_on_scan_id_and_deduplicated_and_id on public.security_findings  (cost=0.57..116.12 rows=2963 width=8) (actual time=0.115..0.151 rows=51 loops=1)
         Index Cond: ((security_findings.scan_id = 4855182) AND (security_findings.deduplicated = true) AND (security_findings.id >= 16098660))
         Heap Fetches: 42
         Buffers: shared hit=29
         I/O Timings: read=0.000 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/8314/commands/29402

Update query
UPDATE
    "security_findings"
SET
    deduplicated = FALSE
WHERE
    "security_findings"."scan_id" = 4855182
    AND "security_findings"."deduplicated" = TRUE
    AND "security_findings"."id" >= 16098660
    AND "security_findings"."id" < 16098763
ModifyTable on public.security_findings  (cost=0.57..3.60 rows=1 width=112) (actual time=0.087..0.088 rows=0 loops=1)
   Buffers: shared hit=7
   I/O Timings: read=0.000 write=0.000
   ->  Index Scan using index_security_findings_on_scan_id_and_deduplicated_and_id on public.security_findings  (cost=0.57..3.60 rows=1 width=112) (actual time=0.036..0.036 rows=0 loops=1)
         Index Cond: ((security_findings.scan_id = 4855182) AND (security_findings.deduplicated = true) AND (security_findings.id >= 16098660) AND (security_findings.id < 16098763))
         Buffers: shared hit=7
         I/O Timings: read=0.000 write=0.000

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/8314/commands/29403

Shall we introduce this index before or with this MR?

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Mehmet Emin INAC

Merge request reports

Loading