Create partitions for package metadata tables
What does this MR do and why?
This MR partitions the package metadata tables (prefixed by pm_
) in order to make queries and data storage more scalable. Initially, the main type of query to run against this dataset will be in fetching licenses for a set of sbom components. But more queries are anticipated (see continuous vulnerability scans for an example).
Related issue: #382567
Partitioning design
The main changes to the package metadata (which was introduced in Update DB schema to store data imported from th... (#373163 - closed)) is to create a list
partition of the 3 largest tables. These are partitioned by purl_type
. A purl_type
is a smallint
value which denotes which package registry a package came from: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/package_metadata/package.rb
pm_packages
already had the purl_type
column. For pm_package_versions
and pm_package_version_licenses
a denormalization is required to add this column in order to be able to partition these tables.
Primary keys are removed from the pm_packages
and pm_package_versions
as these would require the primary key to appear in the partition.
Data queries
These tables will be initially used to query package metadata to find licenses for packages (e.g. find licenses for components in ingested sbom).
The anticipated query would look something like:
SELECT ps.name, pvs.version, ls.spdx_identifier
FROM pm_packages ps
JOIN pm_package_versions pvs ON
pvs.purl_type = ps.purl_type AND pvs.pm_package_id = ps.id
JOIN pm_package_version_licenses pvls ON
pvls.purl_type = pvs.purl_type and pvls.pm_package_version_id = pvs.id
JOIN testp2_pm_licenses ls ON
pvls.pm_license_id = ls.id
WHERE ps.purl_type = 1 AND
(name, version) IN (('depA', 'v1'), ('depA', 'v2'))
Note: In some cases the query could be done by joining sbom_components
stored in the DB. But we use (name, version) IN (... sbom values ...)
rather than a join to avoid cross joins, for the following reasons:
- As of today SBOM components are not stored in the DB; the feature has not been enabled.
- It's expected that SBOM components stored in the DB will be limited to HEADs of branches. In the other cases, we'll still have to parse the SBOM reports, and the SBOM components will be in memory.
- It should be possible to move the package metadata tables into a separate database, and we can't JOIN tables that belong to different databases.
How to set up and validate locally
- run migrations to create the partitions
- load database by using the bulk insert script and getting the license csv from this internal link
- create some fake "sbom component data"
- run client query and analyze
More details for each point below.
1. migration
bundle exec rails db:migrate
2. bulk insert script
The bulk insert script is used to load data from a sample license file into the database https://gitlab.com/ifrenkel/load-package-metadata
Download license data from https://drive.google.com/drive/folders/1lrcLBiTPIEmGMcUBy6luPA8Q3WYKv_s8
Run script to load the data (make sure to set GDK_DIR
var or change the way the pg host is set when running queries):
git clone https://gitlab.com/ifrenkel/load-package-metadata /tmp/load-package-metadata
export GDK_DIR="/gdk/postgresql"
ruby /tmp/load-package-metadata/main.rb -csv_path ~/Downloads/licenses.csv -purl_type 5
3. create data sample
Run this query to create a table which has a subset of sample the package metadata which will represent components.
create table if not exists sample_deps (name varchar(255), version varchar(255));
with A as (select ps.id, min(pvs.id) from pm_packages ps
join pm_package_versions pvs on ps.id = pvs.pm_package_id and ps.purl_type = pvs.purl_type
where ps.purl_type = 5
group by ps.id
order by random()
limit 95
), B as
(select name, version from pm_packages ps
join pm_package_versions pvs on ps.id = pvs.pm_package_id and ps.purl_type = pvs.purl_type
where (ps.id, pvs.id) in (select * from A)
)
insert into sample_deps (name, version)
select * from b;
4. run query to fetch spdx identifier for dependencies
EXPLAIN ANALYZE
SELECT ps.name, pvs.version, ls.spdx_identifier
FROM pm_packages ps
JOIN pm_package_versions pvs ON
pvs.purl_type = ps.purl_type AND pvs.pm_package_id = ps.id
JOIN pm_package_version_licenses pvls ON
pvls.purl_type = pvs.purl_type and pvls.pm_package_version_id = pvs.id
JOIN pm_licenses ls ON
pvls.pm_license_id = ls.id
WHERE ps.purl_type = 5 AND
(name, version) IN (SELECT name, version FROM sample_deps);
Note: the query avoids a join to the data sample table.
Data characteristics
Query plan
Following the methodology above and issuing a query like:
EXPLAIN ANALYZE
SELECT ps.name, pvs.version, ls.spdx_identifier
FROM pm_packages ps
JOIN pm_package_versions pvs ON
pvs.purl_type = ps.purl_type AND pvs.pm_package_id = ps.id
JOIN pm_package_version_licenses pvls ON
pvls.purl_type = pvs.purl_type and pvls.pm_package_version_id = pvs.id
JOIN pm_licenses ls ON
pvls.pm_license_id = ls.id
WHERE ps.purl_type = 5 AND
(name, version) IN (?);
With ?
replaced by inline values (e.g. IN (('pkg1','ver1'),...)
).
The query plan looks like:
Hash Join (cost=132.75..2251.21 rows=122500 width=73) (actual time=0.324..2.314 rows=95 loops=1)
Hash Cond: (pvls.pm_license_id = ls.id)
-> Nested Loop (cost=125.73..1916.53 rows=122500 width=54) (actual time=0.276..2.255 rows=95 loops=1)
-> Nested Loop (cost=125.44..1829.57 rows=245 width=56) (actual time=0.269..1.899 rows=95 loops=1)
-> Bitmap Heap Scan on test1_pm_packages_5 ps (cost=125.02..208.77 rows=94 width=45) (actual time=0.246..0.364 rows=95 loops=1)
Recheck Cond: (((purl_type = 5) AND (name = 'em.hiitwirwi:wielsvwi-ibigyxsv-wtevoyxmpw6-5-b_6.55'::text)) OR ((purl_type = 5) AND (name = 'em.hnp.xirwsvjpsa:xirwsvjpsa-rexmzi-gy554'::text)) OR ... snip ...)
Heap Blocks: exact=58
-> BitmapOr (cost=125.02..125.02 rows=95 width=0) (actual time=0.239..0.242 rows=0 loops=1)
-> Bitmap Index Scan on test1_pm_packages_5_purl_type_name_idx (cost=0.00..1.29 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: ((purl_type = 5) AND (name = 'em.hiitwirwi:wielsvwi-ibigyxsv-wtevoyxmpw6-5-b_6.55'::text))
... snip ...
-> Bitmap Index Scan on test1_pm_packages_5_purl_type_name_idx (cost=0.00..1.29 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=1)
Index Cond: ((purl_type = 5) AND (name = 'gl.itjp.fpyifvemr.ribyw:hipxe-wsyvgmrk_6.57'::text))
-> Index Scan using test1_pm_package_versions_5_pm_package_id_purl_type_idx on test1_pm_package_versions_5 pvs (cost=0.42..17.23 rows=1 width=29) (actual time=0.006..0.016 rows=1 loops=95)
Index Cond: ((pm_package_id = ps.id) AND (purl_type = 5))
Filter: (((version = '1.4.2'::text) OR (version = '2.4.1'::text) OR ... snip ... AND (version = '1.5.0'::text))))
Rows Removed by Filter: 8
-> Index Scan using test1_pm_package_version_lice_pm_package_version_id_purl_t_idx4 on test1_pm_package_version_licenses_5 pvls (cost=0.29..0.34 rows=1 width=18) (actual time=0.004..0.004 rows=1 loops=95)
Index Cond: ((pm_package_version_id = pvs.id) AND (purl_type = 5))
-> Hash (cost=4.23..4.23 rows=223 width=35) (actual time=0.041..0.041 rows=223 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
-> Seq Scan on test1_pm_licenses ls (cost=0.00..4.23 rows=223 width=35) (actual time=0.010..0.023 rows=223 loops=1)
Planning Time: 3.824 ms
Execution Time: 2.606 ms
(209 rows)
Full plan (i.e. the OR
matches on literals are included there) is included as an attachment for MR brevity: query_plan.txt
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.