Fix Timeout when Auto Resolving SAST Vulnerabilities
What does this MR do and why?
Context:
When one or more SAST rules are removed, the SAST Automatic Vulnerability Resolution process kicks in and auto-resolves all the Vulnerabilities detected from the rules now removed.
Problem:
In the logic of auto resolution, particularly when fetching the dropped vulnerabilities to resolve them, the query to the DB was getting canceled since the associated SQL query was timing out. Here's the issue describing it in detail.
Query
SELECT vulnerabilities.id
FROM vulnerabilities
INNER JOIN projects ON projects.id = vulnerabilities.project_id
INNER JOIN vulnerability_occurrences occurences ON occurences.vulnerability_id = vulnerabilities.id
INNER JOIN vulnerability_identifiers ON vulnerability_identifiers.id = occurences.primary_identifier_id
WHERE
vulnerabilities.state = $1
AND vulnerabilities.resolved_on_default_branch = $2
AND projects.archived = $3
AND vulnerabilities.project_id = $4
AND vulnerability_identifiers.id IN ($5)
ORDER BY vulnerabilities.id ASC
LIMIT 10
Query Plan
Limit (cost=3198.06..3198.06 rows=1 width=8) (actual time=314513.432..314513.454 rows=0 loops=1)
Buffers: shared hit=441460 read=168488 dirtied=34800
I/O Timings: read=306215.850 write=0.000
-> Sort (cost=3198.06..3198.06 rows=1 width=8) (actual time=314513.429..314513.450 rows=0 loops=1)
Sort Key: vulnerabilities.id
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=441460 read=168488 dirtied=34800
I/O Timings: read=306215.850 write=0.000
-> Nested Loop (cost=2.13..3198.05 rows=1 width=8) (actual time=314513.341..314513.361 rows=0 loops=1)
Buffers: shared hit=441457 read=168488 dirtied=34800
I/O Timings: read=306215.850 write=0.000
-> Nested Loop (cost=1.57..3194.46 rows=1 width=16) (actual time=314513.330..314513.335 rows=0 loops=1)
Buffers: shared hit=441457 read=168488 dirtied=34800
I/O Timings: read=306215.850 write=0.000
-> Nested Loop (cost=1.00..3163.43 rows=49 width=8) (actual time=23.446..172950.018 rows=92648 loops=1)
Buffers: shared hit=3143 read=89431 dirtied=3362
I/O Timings: read=170830.319 write=0.000
-> Index Only Scan using vulnerability_identifiers_pkey on public.vulnerability_identifiers identifiers (cost=0.43..11.25 rows=5 width=8) (actual time=6.873..22.494 rows=5 loops=1)
Index Cond: (identifiers.id = ANY ('{318256345,318256346,318256353,318256358,318256361}'::bigint[]))
Heap Fetches: 3
Buffers: shared hit=13 read=14
I/O Timings: read=21.997 write=0.000
-> Index Scan using index_vulnerability_occurrences_on_primary_identifier_id on public.vulnerability_occurrences occurences (cost=0.57..625.12 rows=532 width=16) (actual time=5.659..34547.169 rows=18530 loops=5)
Index Cond: (occurences.primary_identifier_id = identifiers.id)
Buffers: shared hit=3130 read=89417 dirtied=3362
I/O Timings: read=170808.322 write=0.000
-> Index Scan using tmp_index_on_vulnerabilities_non_dismissed on public.vulnerabilities (cost=0.57..0.63 rows=1 width=16) (actual time=1.520..1.520 rows=0 loops=92648)
Index Cond: (vulnerabilities.id = occurences.vulnerability_id)
Filter: ((NOT vulnerabilities.resolved_on_default_branch) AND (vulnerabilities.project_id = 27752814) AND (vulnerabilities.state = 1))
Rows Removed by Filter: 1
Buffers: shared hit=438314 read=79057 dirtied=31438
I/O Timings: read=135385.531 write=0.000
-> Index Scan using idx_projects_on_repository_storage_last_repository_updated_at on public.projects (cost=0.56..3.58 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
Index Cond: (projects.id = 27752814)
Filter: (NOT projects.archived)
Rows Removed by Filter: 0
I/O Timings: read=0.000 write=0.000
Plan Summary
**Time: 5.242 min**
- planning: 11.956 ms
- execution: 5.242 min \<----- avg query execution time
- I/O read: 5.104 min
- I/O write: 0.000 ms
**Shared buffers:**
- hits: 441460 (\~3.40 GiB) from the buffer pool
- reads: 168488 (\~1.30 GiB) from the OS file cache, including disk I/O
- dirtied: 34800 (\~271.90 MiB)
</td>
</tr>
</table>
Postgres.ai Link: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22068/commands/71402 (internal only)
Solution:
Multiple approaches were available to solve the query bottleneck like Scoping with column filters to reduce the size of the cartesian table, Switching from Joins to Sub-queries, and Reordering Jonins to reduce the cartesian table size. However, they were inconsistent with GitLab's guideline limits for different inputs. The approach that worked is along the lines of what was outlined in one of the comments in the MR -- Splitting the main query into two parts along with adding an index for the separated expensive query.
The final solution is as follows: Split the main fetch query into two parts by separating the logic of fetching Vulnerability IDs that have dropped identifiers as primary identifiers (expensive). These Vulnerability IDs are used to further filter by state
and resolved_on_default_branch
columns on vulnerabilities
in a separate query. Here are the queries that are split into two:
Split Query 1: Fetch vulnerability IDs that have dropped identifiers as primary identifiers.
SELECT
vo.vulnerability_id
FROM
vulnerability_identifiers vi
INNER JOIN vulnerability_occurrences vo ON vo.primary_identifier_id = vi.id
WHERE
vi.id IN ($1)
ORDER BY
vi.id
LIMIT 100
Postgres AI(internal): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22630/commands/72962
Split Query 2: Use Query 1 results and filter Vulnerabilities further by state and resolved_on_default_branch columns on vulnerabilities
explain SELECT vulnerabilities.id
FROM vulnerabilities
WHERE
vulnerabilities.state = 1
AND vulnerabilities.resolved_on_default_branch = FALSE
AND vulnerabilities.id IN (3362225, 3362224, 3362223, 3362222, 3362220, 3362218, 3362217, 3362216, 3362215, 3362213, 3362211, 3362210, 3362208, 3362207, 3338595, 3338594, 3338593, 3338592, 3338591, 3338590, 3338589, 3338588, 3338587, 3338586, 3338585, 3338584, 3338583, 3338582, 3338581, 3338580, 3338579, 3338578, 3338577, 3338576, 3338575, 3338574, 3338573, 3338572, 3338571, 3338570, 3338569, 3338568, 3338567, 3338566, 3338565, 3338564, 3338563, 3338562, 3338561, 3338560, 3338559, 3338558, 3338557, 3338556, 3338555, 3338554, 3338553, 3338552, 3338551, 3338550, 3338549, 3338548, 3338547, 3338546, 3338545, 3338544, 3338543, 3338542, 3338541, 3338540, 3338539, 3338538, 3338537, 3338536, 3338535, 3338534, 3338533, 3338532, 3338531, 3338530, 3338529, 3338528, 3338527, 3338526, 3338525, 3338524, 3338523, 3338522, 3338521, 3338520, 3338519, 3338518, 3338517, 3338516, 3338515, 3338514, 3338513, 3338512, 3338511, 3338510)
ORDER BY vulnerabilities.id ASC
LIMIT 100
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/23034/commands/74175
Index Replacement on vulnerability_occurrences table
Although Split Query 1
seems straightforward but is a join made with a large table (vulnerability_occurrences
) so we need to bring down the execution cost by adding the following index :
create index on vulnerability_occurrences(primary_identifier_id, vulnerability_id)
Since there is already an index on vulnerability_occurrences(primary_identifier_id)
named index_vulnerability_occurrences_on_primary_identifier_id
, we need to replace the old index with the above index.
Postgres.ai Link of Split Query 1
post index addition: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/23218/commands/74711
Relevant Issue Numbers
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.