Update package metadata license lookup to use compressed data
What does this MR do and why?
This MR is the first of two, and has been split this way to reduce the size of each MR:
- This MR
-
adds a new
compressed_package_metadata_query
feature flag.This MR stubs
compressed_package_metadata_query
to false in all the tests. -
updates
PackageLicenses#fetch
to use compressed data if thecompressed_package_metadata_query
feature flag is enabled.
-
-
Test both sides of compressed_package_metadata_... (!120207 - merged)
- Tests both sides of the
compressed_package_metadata_query
feature flag.
- Tests both sides of the
Background
When PackageLicenses#fetch
was first implemented, it performed queries that relied on storing every single license for every package version. This was relatively straightforward to implement, and it made sense at the time since we assumed that all the data would take up about 5GB
, however, it turns out that the complete license dataset takes up 26GB.
In an effort to reduce the memory requirements, we've come up with a structure which effectively "compresses" the license information by only recording versions where the licenses change. Packages might have a large number of versions, but most only use a single license, so this allows us to reduce the required storage dramatically, to about 1GB
.
The existing PackageLicenses#fetch
(aka the "uncompressed" fetcher) executes queries against the following tables:
pm_packages
pm_package_versions
pm_package_version_licenses
pm_licenses
The new "compressed" fetcher only needs to query the following tables:
-
pm_packages
The
licenses
jsonb column in this table contains all the necessary license information, in a compressed structure. This allows us to avoid needing to join other tables, and reduces data footprint considerably. -
pm_licenses
This allows us to look up the license identifier for each entry in the
licenses
column fetched from thepm_packages
table.This table contains less than
500
entries and is read into memory, allowing us to avoid separate queries for every license.
PackageLicenses#fetch
work?
How does The Gitlab::LicenseScanning::PackageLicenses#fetch
method accepts the output of the following two methods:
Gitlab::LicenseScanning::PipelineComponents.fetch
Gitlab::LicenseScanning::BranchComponents.fetch
These methods were added by Add classes for fetching SBOM components (!105994 - merged) and output an array of components
, each of which have the following fields:
name
purl_type
version
We then pass this array of components
into the Gitlab::LicenseScanning::PackageLicenses.fetch
which will return the corresponding licenses by querying pm_packages
and pm_licenses
tables which was added by Update DB schema to store data imported from th... (#373163 - closed).
The Gitlab::LicenseScanning::PipelineComponents.fetch
method is required by the new license scanning implementation that is a part of &9400 (closed). This new approach relies on SBoM components instead of a license scanning report and will allow us to remove the need for a separate license scanning job.
Terms and concepts referenced in this MR
-
component
A
Hashie::Mash
object (which will most likely be changed to a PORO in the future) that come from the SBOM report for the pipeline. This SBOM is generated by the GitLab gemnasium analyzer.The conversion of SBOM report to components takes place in Gitlab::LicenseScanning::PipelineComponents#fetch.
Each
component
is a dependency of a project and consists of the following fields:name
purl_type
version
-
pm_packages
The
pm_packages
table contains every known package, across every package manager (maven
,gem
,golang
,nuget
,npm
, etc) and there's currently over5,319,434
entries in thepm_packages
table.The fields of interest in the
pm_packages
table are:-
purl_type
: an integer representing the purl_type of thepackage
. See Enums::Sbom for the complete list of supportpurl_types
. -
name
: the name of the package, for examplemyapi-fastapi
orgithub.com/ocaml/ocamlbuild
. -
licenses
: ajsonb
column containing a serialized list of license information such as:-
default_licenses
- an array of license ids in thepm_licenses
table which are returned when we couldn't determine a matching license -
lowest_version
- the lowest version we support. -
highest_version
- the highest version we support. -
other_versions
- an array ofversions
and thelicense_ids
the correspond to.
-
-
Caches used in this MR
This MR adds two methods for caching data:
-
build_component_versions_cache
We fetch package details from the
pm_packages
table which only contains thepurl_type
andname
. We need a way to know which versions were requested for eachpurl_type
andname
, so we use a hash to store this data for faster lookups. -
There's only about
500
licenses in thepm_licenses
table, and the data doesn't change often, so we use a cache to avoid a separate sql query for every spdx identifier.
Both of these caches are used by the Gitlab::LicenseScanning::PackageLicenses#compressed_fetch method.
Benchmarks
Benchmark results of fetching N records using both compressed
and uncompressed
package metadata. The compressed
package metadata fetcher (the first table) is being implemented in this MR.
Benchmark results with compressed package metadata
===================================================================
user system total real
100 records 0.017064 0.000331 0.017395 ( 0.019686)
200 records 0.026447 0.000283 0.026730 ( 0.031948)
300 records 0.032604 0.000376 0.032980 ( 0.041379)
400 records 0.039828 0.000596 0.040424 ( 0.050058)
500 records 0.049104 0.000743 0.049847 ( 0.060847)
600 records 0.057560 0.000830 0.058390 ( 0.070677)
700 records 0.065156 0.001347 0.066503 ( 0.081667)
800 records 0.072402 0.000933 0.073335 ( 0.090494)
900 records 0.079092 0.001450 0.080542 ( 0.098282)
1000 records 0.087363 0.001284 0.088647 ( 0.110238)
2000 records 0.171223 0.002407 0.173630 ( 0.216396)
3000 records 0.249895 0.003460 0.253355 ( 0.317418)
4000 records 0.327978 0.004005 0.331983 ( 0.410994)
5000 records 0.447468 0.006151 0.453619 ( 0.550493)
===================================================================
Benchmark results with uncompressed package metadata
===================================================================
user system total real
100 records 0.007238 0.000080 0.007318 ( 0.010811)
200 records 0.012162 0.000075 0.012237 ( 0.019070)
300 records 0.019363 0.000179 0.019542 ( 0.031903)
400 records 0.024126 0.000355 0.024481 ( 0.039589)
500 records 0.029296 0.000270 0.029566 ( 0.047063)
600 records 0.035818 0.000891 0.036709 ( 0.058822)
700 records 0.041775 0.000384 0.042159 ( 0.065195)
800 records 0.046606 0.000716 0.047322 ( 0.072832)
900 records 0.052423 0.000537 0.052960 ( 0.083973)
1000 records 0.057706 0.000532 0.058238 ( 0.093362)
2000 records 0.117508 0.002094 0.119602 ( 0.203078)
3000 records 0.167786 0.001746 0.169532 ( 0.277484)
4000 records 0.225191 0.002783 0.227974 ( 0.374707)
5000 records 0.278470 0.002795 0.281265 ( 0.470225)
===================================================================
Data characteristics
Raw SQL
https://paste.depesz.com/s/CJF
SELECT
pm_packages.*
FROM
pm_packages
WHERE
(
pm_packages.purl_type,
pm_packages.name
) IN (
( 4, 'github.com/scop/powerline-go' ),
...
);
Query plan
5,000
random records out of 5,246,788
Planning Time: 23.864 ms
Execution Time: 107.283 ms
- https://explain.depesz.com/s/YsAw
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/18789/commands/62401
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #408906 (closed)