Skip to content

Create partitions for package metadata tables

Igor Frenkel requested to merge 382567-partition-package-metadata into master

What does this MR do and why?

This MR partitions the package metadata tables (prefixed by pm_) in order to make queries and data storage more scalable. Initially, the main type of query to run against this dataset will be in fetching licenses for a set of sbom components. But more queries are anticipated (see continuous vulnerability scans for an example).

Related issue: #382567

Partitioning design

The main changes to the package metadata (which was introduced in Update DB schema to store data imported from th... (#373163 - closed)) is to create a list partition of the 3 largest tables. These are partitioned by purl_type. A purl_type is a smallint value which denotes which package registry a package came from: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/package_metadata/package.rb

pm_packages already had the purl_type column. For pm_package_versions and pm_package_version_licenses a denormalization is required to add this column in order to be able to partition these tables.

Primary keys are removed from the pm_packages and pm_package_versions as these would require the primary key to appear in the partition.

Data queries

These tables will be initially used to query package metadata to find licenses for packages (e.g. find licenses for components in ingested sbom).

The anticipated query would look something like:

  SELECT ps.name, pvs.version, ls.spdx_identifier
  FROM pm_packages ps
  JOIN pm_package_versions pvs ON
    pvs.purl_type = ps.purl_type AND pvs.pm_package_id = ps.id
  JOIN pm_package_version_licenses pvls ON
    pvls.purl_type = pvs.purl_type and pvls.pm_package_version_id = pvs.id
  JOIN testp2_pm_licenses ls ON
    pvls.pm_license_id = ls.id
  WHERE ps.purl_type = 1 AND
    (name, version) IN (('depA', 'v1'), ('depA', 'v2'))

Note: In some cases the query could be done by joining sbom_components stored in the DB. But we use (name, version) IN (... sbom values ...) rather than a join to avoid cross joins, for the following reasons:

  • As of today SBOM components are not stored in the DB; the feature has not been enabled.
  • It's expected that SBOM components stored in the DB will be limited to HEADs of branches. In the other cases, we'll still have to parse the SBOM reports, and the SBOM components will be in memory.
  • It should be possible to move the package metadata tables into a separate database, and we can't JOIN tables that belong to different databases.

How to set up and validate locally

  1. run migrations to create the partitions
  2. load database by using the bulk insert script and getting the license csv from this internal link
  3. create some fake "sbom component data"
  4. run client query and analyze

More details for each point below.

1. migration

bundle exec rails db:migrate

2. bulk insert script

The bulk insert script is used to load data from a sample license file into the database https://gitlab.com/ifrenkel/load-package-metadata

Download license data from https://drive.google.com/drive/folders/1lrcLBiTPIEmGMcUBy6luPA8Q3WYKv_s8

Run script to load the data (make sure to set GDK_DIR var or change the way the pg host is set when running queries):

git clone https://gitlab.com/ifrenkel/load-package-metadata /tmp/load-package-metadata
export GDK_DIR="/gdk/postgresql"
ruby /tmp/load-package-metadata/main.rb -csv_path ~/Downloads/licenses.csv -purl_type 5

3. create data sample

Run this query to create a table which has a subset of sample the package metadata which will represent components.

create table if not exists sample_deps (name varchar(255), version varchar(255));
with A as (select ps.id, min(pvs.id) from pm_packages ps
  join pm_package_versions pvs on ps.id = pvs.pm_package_id and ps.purl_type = pvs.purl_type
  where ps.purl_type = 5
  group by ps.id
  order by random()
  limit 95
), B as 
(select name, version from pm_packages ps
  join pm_package_versions pvs on ps.id = pvs.pm_package_id and ps.purl_type = pvs.purl_type
  where (ps.id, pvs.id) in (select  * from A)
)
insert into sample_deps (name, version)
  select * from b;

4. run query to fetch spdx identifier for dependencies

EXPLAIN ANALYZE
  SELECT ps.name, pvs.version, ls.spdx_identifier
  FROM pm_packages ps
  JOIN pm_package_versions pvs ON
    pvs.purl_type = ps.purl_type AND pvs.pm_package_id = ps.id
  JOIN pm_package_version_licenses pvls ON
    pvls.purl_type = pvs.purl_type and pvls.pm_package_version_id = pvs.id
  JOIN pm_licenses ls ON
    pvls.pm_license_id = ls.id
  WHERE ps.purl_type = 5 AND
    (name, version) IN (SELECT name, version FROM sample_deps);

Note: the query avoids a join to the data sample table.

Data characteristics

Query plan

Following the methodology above and issuing a query like:

EXPLAIN ANALYZE
  SELECT ps.name, pvs.version, ls.spdx_identifier
  FROM pm_packages ps
  JOIN pm_package_versions pvs ON
    pvs.purl_type = ps.purl_type AND pvs.pm_package_id = ps.id
  JOIN pm_package_version_licenses pvls ON
    pvls.purl_type = pvs.purl_type and pvls.pm_package_version_id = pvs.id
  JOIN pm_licenses ls ON
    pvls.pm_license_id = ls.id
  WHERE ps.purl_type = 5 AND
    (name, version) IN (?);

With ? replaced by inline values (e.g. IN (('pkg1','ver1'),...)).

The query plan looks like:

 Hash Join  (cost=132.75..2251.21 rows=122500 width=73) (actual time=0.324..2.314 rows=95 loops=1)
   Hash Cond: (pvls.pm_license_id = ls.id)
   ->  Nested Loop  (cost=125.73..1916.53 rows=122500 width=54) (actual time=0.276..2.255 rows=95 loops=1)
         ->  Nested Loop  (cost=125.44..1829.57 rows=245 width=56) (actual time=0.269..1.899 rows=95 loops=1)
               ->  Bitmap Heap Scan on test1_pm_packages_5 ps  (cost=125.02..208.77 rows=94 width=45) (actual time=0.246..0.364 rows=95 loops=1)
                     Recheck Cond: (((purl_type = 5) AND (name = 'em.hiitwirwi:wielsvwi-ibigyxsv-wtevoyxmpw6-5-b_6.55'::text)) OR ((purl_type = 5) AND (name = 'em.hnp.xirwsvjpsa:xirwsvjpsa-rexmzi-gy554'::text)) OR ... snip ...)
                     Heap Blocks: exact=58
                     ->  BitmapOr  (cost=125.02..125.02 rows=95 width=0) (actual time=0.239..0.242 rows=0 loops=1)
                           ->  Bitmap Index Scan on test1_pm_packages_5_purl_type_name_idx  (cost=0.00..1.29 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=1)
                                 Index Cond: ((purl_type = 5) AND (name = 'em.hiitwirwi:wielsvwi-ibigyxsv-wtevoyxmpw6-5-b_6.55'::text))
                           ... snip ...
                           ->  Bitmap Index Scan on test1_pm_packages_5_purl_type_name_idx  (cost=0.00..1.29 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=1)
                                 Index Cond: ((purl_type = 5) AND (name = 'gl.itjp.fpyifvemr.ribyw:hipxe-wsyvgmrk_6.57'::text))
               ->  Index Scan using test1_pm_package_versions_5_pm_package_id_purl_type_idx on test1_pm_package_versions_5 pvs  (cost=0.42..17.23 rows=1 width=29) (actual time=0.006..0.016 rows=1 loops=95)
                     Index Cond: ((pm_package_id = ps.id) AND (purl_type = 5))
                     Filter: (((version = '1.4.2'::text) OR (version = '2.4.1'::text) OR ... snip ... AND (version = '1.5.0'::text))))
                     Rows Removed by Filter: 8
         ->  Index Scan using test1_pm_package_version_lice_pm_package_version_id_purl_t_idx4 on test1_pm_package_version_licenses_5 pvls  (cost=0.29..0.34 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=95)
               Index Cond: ((pm_package_version_id = pvs.id) AND (purl_type = 5))
   ->  Hash  (cost=4.23..4.23 rows=223 width=35) (actual time=0.041..0.041 rows=223 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 24kB
         ->  Seq Scan on test1_pm_licenses ls  (cost=0.00..4.23 rows=223 width=35) (actual time=0.010..0.023 rows=223 loops=1)
 Planning Time: 3.824 ms
 Execution Time: 2.606 ms
(209 rows)

Full plan (i.e. the OR matches on literals are included there) is included as an attachment for MR brevity: query_plan.txt

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Fabien Catteau

Merge request reports

Loading