Update package metadata license lookup to use compressed data (!119607) · Merge requests · GitLab.org / GitLab

Adam Cohen requested to merge 408906-update-package-licenses-to-use-deduplicated-data into master May 04, 2023

What does this MR do and why?

This MR is the first of two, and has been split this way to reduce the size of each MR:

This MR
- adds a new compressed_package_metadata_query feature flag.
  
  This MR stubs compressed_package_metadata_query to false in all the tests.
- updates PackageLicenses#fetch to use compressed data if the compressed_package_metadata_query feature flag is enabled.
Test both sides of compressed_package_metadata_... (!120207 - merged)
- Tests both sides of the compressed_package_metadata_query feature flag.

Background

When PackageLicenses#fetch was first implemented, it performed queries that relied on storing every single license for every package version. This was relatively straightforward to implement, and it made sense at the time since we assumed that all the data would take up about 5GB, however, it turns out that the complete license dataset takes up 26GB.

In an effort to reduce the memory requirements, we've come up with a structure which effectively "compresses" the license information by only recording versions where the licenses change. Packages might have a large number of versions, but most only use a single license, so this allows us to reduce the required storage dramatically, to about 1GB.

The existing PackageLicenses#fetch (aka the "uncompressed" fetcher) executes queries against the following tables:

pm_packages
pm_package_versions
pm_package_version_licenses
pm_licenses

The new "compressed" fetcher only needs to query the following tables:

pm_packages

The licenses jsonb column in this table contains all the necessary license information, in a compressed structure. This allows us to avoid needing to join other tables, and reduces data footprint considerably.
pm_licenses

This allows us to look up the license identifier for each entry in the licenses column fetched from the pm_packages table.

This table contains less than 500 entries and is read into memory, allowing us to avoid separate queries for every license.

How does `PackageLicenses#fetch` work?

The Gitlab::LicenseScanning::PackageLicenses#fetch method accepts the output of the following two methods:

Gitlab::LicenseScanning::PipelineComponents.fetch
Gitlab::LicenseScanning::BranchComponents.fetch

These methods were added by Add classes for fetching SBOM components (!105994 - merged) and output an array of components, each of which have the following fields:

name
purl_type
version

We then pass this array of components into the Gitlab::LicenseScanning::PackageLicenses.fetch which will return the corresponding licenses by querying pm_packages and pm_licenses tables which was added by Update DB schema to store data imported from th... (#373163 - closed).

The Gitlab::LicenseScanning::PipelineComponents.fetch method is required by the new license scanning implementation that is a part of &9400 (closed). This new approach relies on SBoM components instead of a license scanning report and will allow us to remove the need for a separate license scanning job.

Terms and concepts referenced in this MR

component

A Hashie::Mash object (which will most likely be changed to a PORO in the future) that come from the SBOM report for the pipeline. This SBOM is generated by the GitLab gemnasium analyzer.

The conversion of SBOM report to components takes place in Gitlab::LicenseScanning::PipelineComponents#fetch.

Each component is a dependency of a project and consists of the following fields:
- name
- purl_type
- version
pm_packages

The pm_packages table contains every known package, across every package manager (maven, gem, golang, nuget, npm, etc) and there's currently over 5,319,434 entries in the pm_packages table.

The fields of interest in the pm_packages table are:
- purl_type: an integer representing the purl_type of the package. See Enums::Sbom for the complete list of support purl_types.
- name: the name of the package, for example myapi-fastapi or github.com/ocaml/ocamlbuild.
- licenses: a jsonb column containing a serialized list of license information such as:
  - default_licenses - an array of license ids in the pm_licenses table which are returned when we couldn't determine a matching license
  - lowest_version - the lowest version we support.
  - highest_version - the highest version we support.
  - other_versions - an array of versions and the license_ids the correspond to.

Caches used in this MR

This MR adds two methods for caching data:

build_component_versions_cache

We fetch package details from the pm_packages table which only contains the purl_type and name. We need a way to know which versions were requested for each purl_type and name, so we use a hash to store this data for faster lookups.
build_licenses_cache

There's only about 500 licenses in the pm_licenses table, and the data doesn't change often, so we use a cache to avoid a separate sql query for every spdx identifier.

Both of these caches are used by the Gitlab::LicenseScanning::PackageLicenses#compressed_fetch method.

Benchmarks

Benchmark results of fetching N records using both compressed and uncompressed package metadata. The compressed package metadata fetcher (the first table) is being implemented in this MR.

Benchmark results with compressed package metadata
===================================================================
                           user     system      total        real
  100 records          0.017064   0.000331   0.017395 (  0.019686)
  200 records          0.026447   0.000283   0.026730 (  0.031948)
  300 records          0.032604   0.000376   0.032980 (  0.041379)
  400 records          0.039828   0.000596   0.040424 (  0.050058)
  500 records          0.049104   0.000743   0.049847 (  0.060847)
  600 records          0.057560   0.000830   0.058390 (  0.070677)
  700 records          0.065156   0.001347   0.066503 (  0.081667)
  800 records          0.072402   0.000933   0.073335 (  0.090494)
  900 records          0.079092   0.001450   0.080542 (  0.098282)
 1000 records          0.087363   0.001284   0.088647 (  0.110238)
 2000 records          0.171223   0.002407   0.173630 (  0.216396)
 3000 records          0.249895   0.003460   0.253355 (  0.317418)
 4000 records          0.327978   0.004005   0.331983 (  0.410994)
 5000 records          0.447468   0.006151   0.453619 (  0.550493)
===================================================================


Benchmark results with uncompressed package metadata
===================================================================
                           user     system      total        real
  100 records          0.007238   0.000080   0.007318 (  0.010811)
  200 records          0.012162   0.000075   0.012237 (  0.019070)
  300 records          0.019363   0.000179   0.019542 (  0.031903)
  400 records          0.024126   0.000355   0.024481 (  0.039589)
  500 records          0.029296   0.000270   0.029566 (  0.047063)
  600 records          0.035818   0.000891   0.036709 (  0.058822)
  700 records          0.041775   0.000384   0.042159 (  0.065195)
  800 records          0.046606   0.000716   0.047322 (  0.072832)
  900 records          0.052423   0.000537   0.052960 (  0.083973)
 1000 records          0.057706   0.000532   0.058238 (  0.093362)
 2000 records          0.117508   0.002094   0.119602 (  0.203078)
 3000 records          0.167786   0.001746   0.169532 (  0.277484)
 4000 records          0.225191   0.002783   0.227974 (  0.374707)
 5000 records          0.278470   0.002795   0.281265 (  0.470225)
===================================================================

Data characteristics

Raw SQL

https://paste.depesz.com/s/CJF

SELECT
    pm_packages.*
FROM
    pm_packages
WHERE
    (
        pm_packages.purl_type,
        pm_packages.name
    ) IN (
        ( 4, 'github.com/scop/powerline-go' ),
        ...
    );

Query plan

5,000 random records out of 5,246,788

Planning Time: 23.864 ms
Execution Time: 107.283 ms

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #408906 (closed)

Edited May 24, 2023 by Adam Cohen

Update package metadata license lookup to use compressed data