Populate `resolved_on_default_branch` column for existing vulnerabilities
What does this MR do?
This MR introduces a post-migration task to schedule a data migration job which will populate the resolved_on_default_branch
column of existing vulnerabilities.
Migraration up
== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: migrating ====
-- table_exists?(:projects)
-> 0.0007s
-- table_exists?(:vulnerabilities)
-> 0.0005s
== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: migrated (0.1239s)
Migraration down
== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: reverting ====
== 20200806100713 SchedulePopulateResolvedOnDefaultBranchColumn: reverted (0.0000s)
Database Queries
Scheduler post-migration task
The scheduler post-migration task schedules a data migration task for every 100 projects by using the following query multiple times;
SELECT DISTINCT
"vulnerabilities"."project_id"
FROM
"vulnerabilities"
WHERE
"vulnerabilities"."project_id" >= 1
ORDER BY
"vulnerabilities"."project_id" ASC
LIMIT 1 OFFSET 100
Limit (cost=1480.47..1495.27 rows=1 width=8) (actual time=957.839..957.839 rows=1 loops=1)
Buffers: shared hit=637 read=971
I/O Timings: read=917.190
-> Unique (cost=0.43..79123.27 rows=5346 width=8) (actual time=0.056..957.820 rows=101 loops=1)
Buffers: shared hit=637 read=971
I/O Timings: read=917.190
-> Index Only Scan using index_vulnerabilities_on_project_id on public.vulnerabilities (cost=0.43..72160.16 rows=2785242 width=8) (actual time=0.054..942.893 rows=134338 loops=1)
Index Cond: (vulnerabilities.project_id >= 1)
Heap Fetches: 0
Buffers: shared hit=637 read=971
I/O Timings: read=917.190
https://gitlab.slack.com/archives/CLJMDRD8C/p1597843572499800
Background task
Background task takes project_ids
argument as an array of integers and runs the migration for each of them with the following queries;
Load project
SELECT
"projects".*
FROM
"projects"
WHERE
"projects"."id" = $project_id
LIMIT 1
Limit (cost=0.43..3.45 rows=1 width=723) (actual time=11.330..11.330 rows=1 loops=1)
Buffers: shared read=4
I/O Timings: read=11.218
-> Index Scan using projects_pkey on public.projects (cost=0.43..3.45 rows=1 width=723) (actual time=11.327..11.327 rows=1 loops=1)
Index Cond: (projects.id = 278964)
Buffers: shared read=4
I/O Timings: read=11.218
https://gitlab.slack.com/archives/CLJMDRD8C/p1597677861201100
Load Route
This task also loads the route entry for each project as it has some crucial information for project entity like so;
SELECT
"routes".*
FROM
"routes"
WHERE
"routes"."source_id" = $project_id
AND "routes"."source_type" = 'Project'
LIMIT 1
Limit (cost=0.56..3.58 rows=1 width=80) (actual time=13.741..13.742 rows=1 loops=1)
Buffers: shared read=5
I/O Timings: read=13.681
-> Index Scan using index_routes_on_source_type_and_source_id on public.routes (cost=0.56..3.58 rows=1 width=80) (actual time=13.740..13.740 rows=1 loops=1)
Index Cond: (((routes.source_type)::text = 'Project'::text) AND (routes.source_id = 278964))
Buffers: shared read=5
I/O Timings: read=13.681
https://gitlab.slack.com/archives/CLJMDRD8C/p1597677948203500
Load latest successful pipeline with security reports
After loading the project & route entries, the task tries to load the latest successful pipeline ID with security reports to load the artifact entry later based on.
SELECT
"ci_pipelines"."id"
FROM
"ci_pipelines"
WHERE ci_pipelines.project_id = 278964
AND ci_pipelines.ref = 'master'
AND ci_pipelines.status IN ('success')
AND (EXISTS (
SELECT
1
FROM
"ci_builds"
WHERE
"ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."retried" IS FALSE
OR "ci_builds"."retried" IS NULL)
AND (EXISTS (
SELECT
1
FROM
"ci_job_artifacts"
WHERE (ci_builds.id = ci_job_artifacts.job_id)
AND "ci_job_artifacts"."file_type" IN (5, 6, 7, 8, 21, 23)))
AND (ci_pipelines.id = ci_builds.commit_id)))
ORDER BY
"ci_pipelines"."id" DESC
LIMIT 1
Limit (cost=1.84..23444.68 rows=1 width=4) (actual time=0.927..0.927 rows=1 loops=1)
Buffers: shared hit=959
-> Nested Loop Semi Join (cost=1.84..25599582.66 rows=1092 width=4) (actual time=0.926..0.926 rows=1 loops=1)
Buffers: shared hit=959
-> Index Only Scan using index_ci_pipelines_on_project_id_and_ref_and_status_and_id on public.ci_pipelines (cost=0.57..2948.65 rows=79343 width=4) (actual time=0.040..0.040 rows=1 loops=1)
Index Cond: ((ci_pipelines.project_id = 278964) AND (ci_pipelines.ref = 'master'::text) AND (ci_pipelines.status = 'success'::text))
Heap Fetches: 1
Buffers: shared hit=6
-> Nested Loop Semi Join (cost=1.27..322.60 rows=1 width=4) (actual time=0.884..0.884 rows=1 loops=1)
Buffers: shared hit=953
-> Index Scan using index_ci_builds_on_commit_id_and_status_and_type on public.ci_builds (cost=0.70..102.17 rows=113 width=8) (actual time=0.016..0.215 rows=121 loops=1)
Index Cond: ((ci_builds.commit_id = ci_pipelines.id) AND ((ci_builds.type)::text = 'Ci::Build'::text))
Filter: ((ci_builds.retried IS FALSE) OR (ci_builds.retried IS NULL))
Rows Removed by Filter: 0
Buffers: shared hit=129
-> Index Only Scan using index_ci_job_artifacts_on_job_id_and_file_type on public.ci_job_artifacts (cost=0.57..1.94 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=121)
Index Cond: (ci_job_artifacts.job_id = ci_builds.id)
Heap Fetches: 427
Filter: (ci_job_artifacts.file_type = ANY ('{5,6,7,8,21,23}'::integer[]))
Rows Removed by Filter: 4
Buffers: shared hit=824
https://gitlab.slack.com/archives/CLJMDRD8C/p1597933858070400
Load JobArtifact based on latest successful pipeline ID
After loading the latest successful pipeline ID from the database, the task tries to load the job artifacts based on the pipeline ID for security report files(5, 6, 7, 8, 21, 23).
SELECT
"ci_job_artifacts".*
FROM
"ci_job_artifacts"
INNER JOIN "ci_builds" ON "ci_job_artifacts"."job_id" = "ci_builds"."id"
AND "ci_builds"."commit_id" = $commit_id
AND "ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."retried" IS FALSE
OR "ci_builds"."retried" IS NULL)
WHERE
"ci_job_artifacts"."file_type" IN (5, 6, 7, 8, 21, 23)
Nested Loop (cost=1.27..1217.33 rows=1 width=131) (actual time=3.510..3.510 rows=0 loops=1)
Buffers: shared hit=3 read=2
I/O Timings: read=3.480
-> Index Scan using index_ci_builds_on_commit_id_and_status_and_type on public.ci_builds (cost=0.70..129.25 rows=115 width=4) (actual time=3.509..3.509 rows=0 loops=1)
Index Cond: ((ci_builds.commit_id = 177652417) AND ((ci_builds.type)::text = 'Ci::Build'::text))
Filter: ((ci_builds.retried IS FALSE) OR (ci_builds.retried IS NULL))
Rows Removed by Filter: 0
Buffers: shared hit=3 read=2
I/O Timings: read=3.480
-> Index Scan using index_ci_job_artifacts_on_job_id_and_file_type on public.ci_job_artifacts (cost=0.57..9.45 rows=1 width=131) (actual time=0.000..0.000 rows=0 loops=0)
Index Cond: (ci_job_artifacts.job_id = ci_builds.id)
Filter: (ci_job_artifacts.file_type = ANY ('{5,6,7,8,21,23}'::integer[]))
Rows Removed by Filter: 0
https://gitlab.slack.com/archives/CLJMDRD8C/p1597360148010400
Load vulnerability scanners for project
SELECT
"vulnerability_scanners".*
FROM
"vulnerability_scanners"
WHERE
"vulnerability_scanners"."project_id" = $project_id
AND "vulnerability_scanners"."external_id" IN ($external_id_1, $external_id_2)
Index Scan using index_vulnerability_scanners_on_project_id_and_external_id on public.vulnerability_scanners (cost=0.29..5.10 rows=1 width=54) (actual time=2.118..2.118 rows=0 loops=1)
Index Cond: ((vulnerability_scanners.project_id = 278964) AND ((vulnerability_scanners.external_id)::text = ANY ('{external_id_1,external_id_2}'::text[])))
Buffers: shared hit=5 read=2
I/O Timings: read=1.969
https://gitlab.slack.com/archives/CLJMDRD8C/p1597678042205900
Load vulnerability identifiers for project
SELECT
"vulnerability_identifiers".*
FROM
"vulnerability_identifiers"
WHERE
"vulnerability_identifiers"."project_id" = $project_id
AND "vulnerability_identifiers"."fingerprint" IN ($fingerprint_1, $fingerprint_2)
Index Scan using index_vulnerability_identifiers_on_project_id_and_fingerprint on public.vulnerability_identifiers (cost=0.42..5.38 rows=1 width=149) (actual time=2.195..2.195 rows=0 loops=1)
Index Cond: ((vulnerability_identifiers.project_id = 278964) AND (vulnerability_identifiers.fingerprint = ANY ('{"\\x66696e6765727072696e745f31","\\x66696e6765727072696e745f32"}'::bytea[])))
Buffers: shared hit=6 read=3
I/O Timings: read=2.112
https://gitlab.slack.com/archives/CLJMDRD8C/p1597678127208300
Load findings from database
SELECT
"vulnerability_occurrences".*
FROM
"vulnerability_occurrences"
WHERE
"vulnerability_occurrences"."project_id" = $project_id
AND "vulnerability_occurrences"."scanner_id" = $scanner_id
AND "vulnerability_occurrences"."primary_identifier_id" = $identifier_id
AND "vulnerability_occurrences"."location_fingerprint" = $location_fingerprint
LIMIT 1
Limit (cost=0.56..3.58 rows=1 width=1113) (actual time=8.433..8.434 rows=0 loops=1)
Buffers: shared hit=3 read=4
I/O Timings: read=8.373
-> Index Scan using index_vulnerability_occurrences_on_unique_keys on public.vulnerability_occurrences (cost=0.56..3.58 rows=1 width=1113) (actual time=8.432..8.432 rows=0 loops=1)
Index Cond: ((vulnerability_occurrences.project_id = 278964) AND (vulnerability_occurrences.primary_identifier_id = 1) AND (vulnerability_occurrences.location_fingerprint = '\x6c6f636174696f6e5f66696e6765727072696e74'::bytea) AND (vulnerability_occurrences.scanner_id = 1))
Buffers: shared hit=3 read=4
I/O Timings: read=8.373
https://gitlab.slack.com/archives/CLJMDRD8C/p1597678191210700
Update disappeared vulnerabilities
UPDATE
"vulnerabilities"
SET
"resolved_on_default_branch" = true
WHERE
"vulnerabilities"."project_id" = $project_id
AND "vulnerabilities"."id" NOT IN ($vulnerability_id_1, $vulnerability_id_2)
ModifyTable on public.vulnerabilities (cost=0.43..140.02 rows=93 width=295) (actual time=0.084..0.084 rows=0 loops=1)
Buffers: shared hit=3
-> Index Scan using index_vulnerabilities_on_project_id on public.vulnerabilities (cost=0.43..140.02 rows=93 width=295) (actual time=0.063..0.063 rows=0 loops=1)
Index Cond: (vulnerabilities.project_id = 1)
Filter: (vulnerabilities.id <> 1)
Rows Removed by Filter: 0
Buffers: shared hit=3
https://gitlab.slack.com/archives/CLJMDRD8C/p1597844859016800
Expected timing
explain select distinct project_id from vulnerabilities
6590 project ids.
(6590 / 100) * 5 minutes ~= 330 minutes to finish the migration for all projects.
Related to #227114 (closed)
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
- [-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
- [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team