Populate `latest_pipeline_id` values for `vulnerability_statistics` (!63451) · Merge requests · GitLab.org / GitLab

Mehmet Emin INAC requested to merge 271408_populate_latest_pipeline_id_for_vulnerability_statistics into master Jun 07, 2021

What does this MR do?

This MR introduces a new database migration to schedule background jobs, eventually populating the missing values for the latest_pipeline_id column of the vulnerability_statistics table.

Related to #271408 (closed).

Database migration

This MR introduces a new schema migration and data migration to populate the latest_pipeline_id column for 22_538 projects.

Rake schema migration

rake db:migrate VERSION=20210602164044

== 20210602164044 ScheduleLatestPipelineIdPopulation: migrating ===============
-- Scheduled 0 PopulateLatestPipelineIds jobs with a maximum of 100 records per batch and an interval of 120 seconds.

The migration is expected to take at least 0 seconds. Expect all jobs to have completed after 2021-06-07 20:17:12 UTC."
== 20210602164044 ScheduleLatestPipelineIdPopulation: migrated (0.0214s) ======

rake db:rollback VERSION=20210602164044

== 20210602164044 ScheduleLatestPipelineIdPopulation: reverting ===============
== 20210602164044 ScheduleLatestPipelineIdPopulation: reverted (0.0000s) ======

SQLs used during the scheduling process

Initial SQL to find lower bound

SELECT
    "project_settings"."project_id"
FROM
    "project_settings"
    LEFT OUTER JOIN vulnerability_statistics vs ON vs.project_id = project_settings.project_id
WHERE
    "vs"."latest_pipeline_id" IS NULL
    AND (has_vulnerabilities IS TRUE)
ORDER BY
    "project_settings"."project_id" ASC
LIMIT 1

Limit  (cost=0.57..0.69 rows=1 width=4) (actual time=2.055..2.057 rows=1 loops=1)
   Buffers: shared hit=3 read=3 dirtied=1
   I/O Timings: read=2.014 write=0.000
   ->  Merge Left Join  (cost=0.57..2646.99 rows=22116 width=4) (actual time=2.054..2.055 rows=1 loops=1)
         Merge Cond: (project_settings.project_id = vs.project_id)
         Filter: (vs.latest_pipeline_id IS NULL)
         Rows Removed by Filter: 0
         Buffers: shared hit=3 read=3 dirtied=1
         I/O Timings: read=2.014 write=0.000
         ->  Index Only Scan using index_project_settings_on_project_id_partially on public.project_settings  (cost=0.29..546.44 rows=24263 width=4) (actual time=1.936..1.937 rows=1 loops=1)
               Heap Fetches: 0
               Buffers: shared hit=1 read=2
               I/O Timings: read=1.918 write=0.000
         ->  Index Scan using index_vulnerability_statistics_on_unique_project_id on public.vulnerability_statistics vs  (cost=0.29..2012.18 rows=20493 width=16) (actual time=0.111..0.111 rows=1 loops=1)
               Buffers: shared hit=2 read=1 dirtied=1
               I/O Timings: read=0.096 write=0.000

SQL to find upper bound of first batch

SELECT
    "project_settings"."project_id"
FROM
    "project_settings"
    LEFT OUTER JOIN vulnerability_statistics vs ON vs.project_id = project_settings.project_id
WHERE
    "vs"."latest_pipeline_id" IS NULL
    AND (has_vulnerabilities IS TRUE)
    AND "project_settings"."project_id" >= 1
ORDER BY
    "project_settings"."project_id" ASC
LIMIT 1 OFFSET 100

 Limit  (cost=12.82..12.94 rows=1 width=4) (actual time=16.736..16.738 rows=1 loops=1)
   Buffers: shared hit=98 read=40 dirtied=11
   I/O Timings: read=15.529 write=0.000
   ->  Merge Left Join  (cost=0.57..2707.65 rows=22116 width=4) (actual time=0.040..16.721 rows=101 loops=1)
         Merge Cond: (project_settings.project_id = vs.project_id)
         Filter: (vs.latest_pipeline_id IS NULL)
         Rows Removed by Filter: 30
         Buffers: shared hit=98 read=40 dirtied=11
         I/O Timings: read=15.529 write=0.000
         ->  Index Only Scan using index_project_settings_on_project_id_partially on public.project_settings  (cost=0.29..607.10 rows=24263 width=4) (actual time=0.029..2.548 rows=131 loops=1)
               Index Cond: (project_settings.project_id >= 1)
               Heap Fetches: 1
               Buffers: shared hit=24 read=2
               I/O Timings: read=2.398 write=0.000
         ->  Index Scan using index_vulnerability_statistics_on_unique_project_id on public.vulnerability_statistics vs  (cost=0.29..2012.18 rows=20493 width=16) (actual time=0.006..13.981 rows=119 loops=1)
               Buffers: shared hit=74 read=38 dirtied=11
               I/O Timings: read=13.131 write=0.000

SQL to query MIN&MAX project_id values of the batch

SELECT
    MIN("project_settings"."project_id"),
    MAX("project_settings"."project_id")
FROM
    "project_settings"
    LEFT OUTER JOIN vulnerability_statistics vs ON vs.project_id = project_settings.project_id
WHERE
    "vs"."latest_pipeline_id" IS NULL
    AND (has_vulnerabilities IS TRUE)
    AND "project_settings"."project_id" >= 1

Aggregate  (cost=2818.23..2818.24 rows=1 width=8) (actual time=394.307..394.309 rows=1 loops=1)
   Buffers: shared hit=20908 read=874 dirtied=347
   I/O Timings: read=327.782 write=0.000
   ->  Merge Left Join  (cost=0.57..2707.65 rows=22116 width=4) (actual time=0.026..389.403 rows=22547 loops=1)
         Merge Cond: (project_settings.project_id = vs.project_id)
         Filter: (vs.latest_pipeline_id IS NULL)
         Rows Removed by Filter: 1816
         Buffers: shared hit=20908 read=874 dirtied=347
         I/O Timings: read=327.782 write=0.000
         ->  Index Only Scan using index_project_settings_on_project_id_partially on public.project_settings  (cost=0.29..607.10 rows=24263 width=4) (actual time=0.015..102.000 rows=24363 loops=1)
               Index Cond: (project_settings.project_id >= 1)
               Heap Fetches: 163
               Buffers: shared hit=4541 read=218 dirtied=75
               I/O Timings: read=87.900 write=0.000
         ->  Index Scan using index_vulnerability_statistics_on_unique_project_id on public.vulnerability_statistics vs  (cost=0.29..2012.18 rows=20493 width=16) (actual time=0.005..274.581 rows=20493 loops=1)
               Buffers: shared hit=16367 read=656 dirtied=272
               I/O Timings: read=239.881 write=0.000

Background data migration

Initial SQL query to load projects into memory

SELECT
    "projects".*
FROM
    "projects"
    INNER JOIN "project_settings" ON "project_settings"."project_id" = "projects"."id"
    LEFT OUTER JOIN vulnerability_statistics vs ON vs.project_id = project_settings.project_id
WHERE
    "vs"."latest_pipeline_id" IS NULL
    AND (has_vulnerabilities IS TRUE)
    AND "projects"."id" BETWEEN 1 AND 4

Nested Loop Left Join  (cost=1.14..13.53 rows=1 width=746) (actual time=6.549..6.551 rows=0 loops=1)
   Filter: (vs.latest_pipeline_id IS NULL)
   Rows Removed by Filter: 0
   Buffers: shared hit=4 read=6
   I/O Timings: read=6.453 write=0.000
   ->  Nested Loop  (cost=0.85..13.14 rows=1 width=750) (actual time=6.549..6.550 rows=0 loops=1)
         Buffers: shared hit=4 read=6
         I/O Timings: read=6.453 write=0.000
         ->  Index Scan using projects_pkey on public.projects  (cost=0.56..6.23 rows=3 width=746) (actual time=4.910..6.520 rows=2 loops=1)
               Index Cond: ((projects.id >= 1) AND (projects.id <= 4))
               Buffers: shared read=6
               I/O Timings: read=6.453 write=0.000
         ->  Index Only Scan using index_project_settings_on_project_id_partially on public.project_settings  (cost=0.29..2.31 rows=1 width=4) (actual time=0.012..0.012 rows=0 loops=2)
               Index Cond: (project_settings.project_id = projects.id)
               Heap Fetches: 0
               Buffers: shared hit=4
               I/O Timings: read=0.000 write=0.000
   ->  Index Scan using index_vulnerability_statistics_on_unique_project_id on public.vulnerability_statistics vs  (cost=0.29..0.38 rows=1 width=16) (actual time=0.000..0.000 rows=0 loops=0)
         Index Cond: (vs.project_id = project_settings.project_id)
         I/O Timings: read=0.000 write=0.000

Query to load latest successful pipeline ID with security reports

SELECT
    "ci_pipelines"."id"
FROM
    "ci_pipelines"
WHERE ("ci_pipelines"."id" IN (
        SELECT
            "ci_pipelines"."id"
        FROM
            "ci_pipelines"
        WHERE
            ci_pipelines.project_id = 1
            AND ci_pipelines.ref = 'master'
            AND ci_pipelines.status IN ('success', 'failed', 'canceled', 'skipped')
        ORDER BY
            "ci_pipelines"."id" DESC
        LIMIT 100))
AND (EXISTS (
        SELECT
            1
        FROM
            "ci_builds"
        WHERE
            "ci_builds"."type" = 'Ci::Build'
            AND ("ci_builds"."retried" IS FALSE
                OR "ci_builds"."retried" IS NULL)
            AND (EXISTS (
                    SELECT
                        1
                    FROM
                        "ci_job_artifacts"
                    WHERE (ci_builds.id = ci_job_artifacts.job_id)
                    AND "ci_job_artifacts"."file_type" IN (5, 6, 7, 8, 21, 23)))
            AND (ci_pipelines.id = ci_builds.commit_id)))
ORDER BY
    "ci_pipelines"."id" DESC
LIMIT 1

 Limit  (cost=132203.72..132203.73 rows=1 width=4) (actual time=13.512..13.515 rows=0 loops=1)
   Buffers: shared hit=3 read=5
   I/O Timings: read=13.413 write=0.000
   ->  Sort  (cost=132203.72..132203.73 rows=1 width=4) (actual time=13.511..13.513 rows=0 loops=1)
         Sort Key: ci_pipelines.id DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=3 read=5
         I/O Timings: read=13.413 write=0.000
         ->  Nested Loop Semi Join  (cost=47.86..132203.71 rows=1 width=4) (actual time=13.478..13.480 rows=0 loops=1)
               Buffers: shared read=5
               I/O Timings: read=13.413 write=0.000
               ->  Nested Loop  (cost=46.58..260.51 rows=100 width=8) (actual time=13.477..13.479 rows=0 loops=1)
                     Buffers: shared read=5
                     I/O Timings: read=13.413 write=0.000
                     ->  HashAggregate  (cost=46.01..47.01 rows=100 width=4) (actual time=13.476..13.478 rows=0 loops=1)
                           Group Key: ci_pipelines_1.id
                           Buffers: shared read=5
                           I/O Timings: read=13.413 write=0.000
                           ->  Limit  (cost=44.51..44.76 rows=100 width=4) (actual time=13.474..13.475 rows=0 loops=1)
                                 Buffers: shared read=5
                                 I/O Timings: read=13.413 write=0.000
                                 ->  Sort  (cost=44.51..45.60 rows=436 width=4) (actual time=13.474..13.475 rows=0 loops=1)
                                       Sort Key: ci_pipelines_1.id DESC
                                       Sort Method: quicksort  Memory: 25kB
                                       Buffers: shared read=5
                                       I/O Timings: read=13.413 write=0.000
                                       ->  Index Only Scan using index_ci_pipelines_on_project_id_and_ref_and_status_and_id on public.ci_pipelines ci_pipelines_1  (cost=0.70..27.85 rows=436 width=4) (actual time=13.466..13.466 rows=0 loops=1)
                                             Index Cond: ((ci_pipelines_1.project_id = 1) AND (ci_pipelines_1.ref = 'master'::text))
                                             Heap Fetches: 0
                                             Filter: ((ci_pipelines_1.status)::text = ANY ('{success,failed,canceled,skipped}'::text[]))
                                             Rows Removed by Filter: 0
                                             Buffers: shared read=5
                                             I/O Timings: read=13.413 write=0.000
                     ->  Index Only Scan using ci_pipelines_pkey on public.ci_pipelines  (cost=0.57..2.13 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
                           Index Cond: (ci_pipelines.id = ci_pipelines_1.id)
                           Heap Fetches: 0
                           I/O Timings: read=0.000 write=0.000
               ->  Nested Loop Semi Join  (cost=1.28..1320.93 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
                     I/O Timings: read=0.000 write=0.000
                     ->  Index Scan using index_ci_builds_on_commit_id_and_status_and_type on public.ci_builds  (cost=0.70..349.62 rows=297 width=8) (actual time=0.000..0.000 rows=0 loops=0)
                           Index Cond: ((ci_builds.commit_id = ci_pipelines.id) AND ((ci_builds.type)::text = 'Ci::Build'::text))
                           Filter: ((ci_builds.retried IS FALSE) OR (ci_builds.retried IS NULL))
                           Rows Removed by Filter: 0
                           I/O Timings: read=0.000 write=0.000
                     ->  Index Only Scan using index_ci_job_artifacts_on_job_id_and_file_type on public.ci_job_artifacts  (cost=0.57..3.27 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
                           Index Cond: (ci_job_artifacts.job_id = ci_builds.id)
                           Heap Fetches: 0
                           Filter: (ci_job_artifacts.file_type = ANY ('{5,6,7,8,21,23}'::integer[]))
                           Rows Removed by Filter: 0
                           I/O Timings: read=0.000 write=0.000

Query to UPSERT latest_pipeline_id value

INSERT INTO vulnerability_statistics (project_id, letter_grade, latest_pipeline_id, created_at, updated_at)
    VALUES (1, 0, 2, '2021-06-07 20:40:48.221369', '2021-06-07 20:40:48.221369'), (4, 0, 6, '2021-06-07 20:40:48.230279', '2021-06-07 20:40:48.230279')
ON CONFLICT (project_id)
    DO UPDATE SET
        latest_pipeline_id = COALESCE(vulnerability_statistics.latest_pipeline_id, EXCLUDED.latest_pipeline_id), updated_at = EXCLUDED.updated_at

Execution plan on prod-replica for just one project;

 ModifyTable on public.vulnerability_statistics  (cost=0.00..0.01 rows=1 width=70) (actual time=1.832..1.833 rows=0 loops=1)
   Buffers: shared hit=14 read=1 dirtied=1
   I/O Timings: read=1.722 write=0.000
   ->  Result  (cost=0.00..0.01 rows=1 width=70) (actual time=0.008..0.009 rows=1 loops=1)
         Buffers: shared hit=1
         I/O Timings: read=0.000 write=0.000

Does this MR meet the acceptance criteria?

Conformity

I have included a changelog entry, or it's not needed. (Does this MR need a changelog?)
[-] I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

[-] I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
[-] I have tested this MR in all supported browsers, or it's not needed.
[-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Edited Jun 08, 2021 by Mehmet Emin INAC

Populate `latest_pipeline_id` values for `vulnerability_statistics`