Skip to content

Add class for fetching package licenses

Adam Cohen requested to merge 384888-create-package-license-fetcher into master

What does this MR do and why?

This MR adds a new Gitlab::LicenseScanning::PackageLicenses class which accepts the output of the Gitlab::LicenseScanning::PipelineComponents.fetch and Gitlab::LicenseScanning::BranchComponents.fetch methods added by Add classes for fetching SBOM components (!105994 - merged).

The output from the Gitlab::LicenseScanning::PipelineComponents.fetch and Gitlab::LicenseScanning::BranchComponents.fetch methods consists of an array of name, purl_type, version tuples. Feeding this array of tuples into the Gitlab::LicenseScanning::PackageLicenses.fetch will return the corresponding licenses by querying the DB tables implemented in Update DB schema to store data imported from th... (#373163 - closed).

This new class is required by the new license scanning implementation that is a part of &9400 (closed). This new approach relies on SBoM components instead of a license scanning report and will allow us to remove the need for a separate license scanning job. Specifically, the classes will be used to complete the implementation of ::Gitlab::LicenseScanning::SbomScanner which in turn will be used in place of ::Gitlab::LicenseScanning::ArtifactScanner once the FF associated with the epic is rolled out.

Benchmarks

Fetching sequential list of components:

Benchmark results for fetching components from a total of 100,000 records
====================================================================================================
                         user     system      total        real      "time required for fetching 5k records"
Fetching   10 records  0.003736   0.000000   0.003736 (  0.058958)   5000/10*0.059    = 29.4791
Fetching  100 records  0.004555   0.000050   0.004605 (  0.054065)   5000/100*0.0541  = 2.7033
Fetching  200 records  0.008378   0.000045   0.008423 (  0.036221)   5000/200*0.0362  = 0.9055
Fetching  300 records  0.011200   0.000014   0.011214 (  0.048410)   5000/300*0.0484  = 0.8068
Fetching  400 records  0.013611   0.000000   0.013611 (  0.046416)   5000/400*0.0464  = 0.5802
Fetching  500 records  0.017079   0.000000   0.017079 (  0.074067)   5000/500*0.0741  = 0.7407
Fetching  600 records  0.021759   0.000027   0.021786 (  0.074694)   5000/600*0.0747  = 0.6225
Fetching  700 records  0.022511   0.000909   0.023420 (  0.079611)   5000/700*0.0796  = 0.5687
Fetching  800 records  0.036056   0.001980   0.038036 (  0.106466)   5000/800*0.1065  = 0.6654
Fetching  900 records  0.029420   0.000000   0.029420 (  0.109274)   5000/900*0.1093  = 0.6071
Fetching 1000 records  0.034262   0.003928   0.038190 (  0.124803)   5000/1000*0.1248 = 0.624
Fetching 2000 records  0.063244   0.000000   0.063244 (  0.312869)   5000/2000*0.3129 = 0.7822
Fetching 3000 records  0.104679   0.000000   0.104679 (  0.527759)   5000/3000*0.5278 = 0.8796
Fetching 4000 records  0.123818   0.000803   0.124621 (  1.039835)   5000/4000*1.0398 = 1.2998
Fetching 5000 records  0.165510   0.000668   0.166178 ( 10.276034)   5000/5000*10.276 = 10.276
====================================================================================================
Fastest time for retrieving 5,000 records is 0.5687 by fetching 700 records at a time.

Fetching random list of components:

Benchmark results for fetching random components from a total of 100,000 records
====================================================================================================
                           user     system      total        real   "time required for fetching 5k records"
Fetching   10 records  0.005247   0.000019   0.005266 (  0.043985)    5000/10*0.044 = 21.9925
Fetching  100 records  0.004303   0.000000   0.004303 (  0.150138)   5000/100*0.1501 = 7.5069
Fetching  200 records  0.006906   0.000000   0.006906 (  0.130813)   5000/200*0.1308 = 3.2703
Fetching  300 records  0.011695   0.000000   0.011695 (  0.135146)   5000/300*0.1351 = 2.2524
Fetching  400 records  0.011985   0.006058   0.018043 (  0.124392)   5000/400*0.1244 = 1.5549
Fetching  500 records  0.016380   0.000012   0.016392 (  0.113565)   5000/500*0.1136 = 1.1357
Fetching  600 records  0.019718   0.000045   0.019763 (  0.123913)   5000/600*0.1239 = 1.0326
Fetching  700 records  0.025286   0.000000   0.025286 (  0.122520)   5000/700*0.1225 = 0.8751
Fetching  800 records  0.058518   0.002090   0.060608 (  0.177813)   5000/800*0.1778 = 1.1113
Fetching  900 records  0.029267   0.000000   0.029267 (  0.141707)   5000/900*0.1417 = 0.7873
Fetching 1000 records  0.030135   0.000870   0.031005 (  0.151177)   5000/1000*0.1512 = 0.7559
Fetching 2000 records  0.059159   0.000020   0.059179 (  0.424362)   5000/2000*0.4244 = 1.0609
Fetching 3000 records  0.085405   0.000916   0.086321 (  0.565139)   5000/3000*0.5651 = 0.9419
Fetching 4000 records  0.115936   0.000000   0.115936 (  1.083269)   5000/4000*1.0833 = 1.3541
Fetching 5000 records  0.145682   0.000000   0.145682 ( 11.172354)   5000/5000*11.1724 = 11.1724
====================================================================================================
Fastest time for retrieving 5,000 records is 0.7559 by fetching 1,000 records at a time.

Note: the Gitlab::LicenseScanning::PackageLicenses.fetch method accepts a maximum of 1,000 components, otherwise an ArgumentError exception will be raised. This limitation will be removed as part of Update Gitlab::LicenseScanning::PackageLicenses... (#388107 - closed).

Data characteristics

Raw SQL

https://paste.depesz.com/s/coK

SELECT
    pm_packages.purl_type,
    pm_packages.name,
    pm_package_versions.version,
    array_agg( pm_licenses.spdx_identifier )
FROM
    pm_packages
    JOIN pm_package_versions ON pm_package_versions.pm_package_id = pm_packages.id
    JOIN pm_package_version_licenses ON pm_package_version_licenses.pm_package_version_id = pm_package_versions.id
    JOIN pm_licenses ON pm_licenses.id = pm_package_version_licenses.pm_license_id
WHERE
    (
        pm_packages.purl_type,
        pm_packages.name,
        pm_package_versions.version
    ) IN (
        ( 4, 'beego', 'v1.10.0' ),
        ...
    )
GROUP BY
    pm_packages.purl_type,
    pm_packages.name,
    pm_package_versions.version;

Query plan

Fetching 1,000 random records out of 100,000

https://explain.depesz.com/s/MKL4

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #384888 (closed)

Edited by Adam Cohen

Merge request reports

Loading