Add class for fetching package licenses
What does this MR do and why?
This MR adds a new Gitlab::LicenseScanning::PackageLicenses
class which accepts the output of the Gitlab::LicenseScanning::PipelineComponents.fetch
and Gitlab::LicenseScanning::BranchComponents.fetch
methods added by Add classes for fetching SBOM components (!105994 - merged).
The output from the Gitlab::LicenseScanning::PipelineComponents.fetch
and Gitlab::LicenseScanning::BranchComponents.fetch
methods consists of an array of name, purl_type, version
tuples. Feeding this array of tuples into the Gitlab::LicenseScanning::PackageLicenses.fetch
will return the corresponding licenses by querying the DB tables implemented in Update DB schema to store data imported from th... (#373163 - closed).
This new class is required by the new license scanning implementation that is a part of &9400 (closed). This new approach relies on SBoM components instead of a license scanning report and will allow us to remove the need for a separate license scanning job. Specifically, the classes will be used to complete the implementation of ::Gitlab::LicenseScanning::SbomScanner
which in turn will be used in place of ::Gitlab::LicenseScanning::ArtifactScanner
once the FF associated with the epic is rolled out.
Benchmarks
Fetching sequential list of components:
Benchmark results for fetching components from a total of 100,000 records
====================================================================================================
user system total real "time required for fetching 5k records"
Fetching 10 records 0.003736 0.000000 0.003736 ( 0.058958) 5000/10*0.059 = 29.4791
Fetching 100 records 0.004555 0.000050 0.004605 ( 0.054065) 5000/100*0.0541 = 2.7033
Fetching 200 records 0.008378 0.000045 0.008423 ( 0.036221) 5000/200*0.0362 = 0.9055
Fetching 300 records 0.011200 0.000014 0.011214 ( 0.048410) 5000/300*0.0484 = 0.8068
Fetching 400 records 0.013611 0.000000 0.013611 ( 0.046416) 5000/400*0.0464 = 0.5802
Fetching 500 records 0.017079 0.000000 0.017079 ( 0.074067) 5000/500*0.0741 = 0.7407
Fetching 600 records 0.021759 0.000027 0.021786 ( 0.074694) 5000/600*0.0747 = 0.6225
Fetching 700 records 0.022511 0.000909 0.023420 ( 0.079611) 5000/700*0.0796 = 0.5687
Fetching 800 records 0.036056 0.001980 0.038036 ( 0.106466) 5000/800*0.1065 = 0.6654
Fetching 900 records 0.029420 0.000000 0.029420 ( 0.109274) 5000/900*0.1093 = 0.6071
Fetching 1000 records 0.034262 0.003928 0.038190 ( 0.124803) 5000/1000*0.1248 = 0.624
Fetching 2000 records 0.063244 0.000000 0.063244 ( 0.312869) 5000/2000*0.3129 = 0.7822
Fetching 3000 records 0.104679 0.000000 0.104679 ( 0.527759) 5000/3000*0.5278 = 0.8796
Fetching 4000 records 0.123818 0.000803 0.124621 ( 1.039835) 5000/4000*1.0398 = 1.2998
Fetching 5000 records 0.165510 0.000668 0.166178 ( 10.276034) 5000/5000*10.276 = 10.276
====================================================================================================
Fastest time for retrieving 5,000 records is 0.5687 by fetching 700 records at a time.
Fetching random list of components:
Benchmark results for fetching random components from a total of 100,000 records
====================================================================================================
user system total real "time required for fetching 5k records"
Fetching 10 records 0.005247 0.000019 0.005266 ( 0.043985) 5000/10*0.044 = 21.9925
Fetching 100 records 0.004303 0.000000 0.004303 ( 0.150138) 5000/100*0.1501 = 7.5069
Fetching 200 records 0.006906 0.000000 0.006906 ( 0.130813) 5000/200*0.1308 = 3.2703
Fetching 300 records 0.011695 0.000000 0.011695 ( 0.135146) 5000/300*0.1351 = 2.2524
Fetching 400 records 0.011985 0.006058 0.018043 ( 0.124392) 5000/400*0.1244 = 1.5549
Fetching 500 records 0.016380 0.000012 0.016392 ( 0.113565) 5000/500*0.1136 = 1.1357
Fetching 600 records 0.019718 0.000045 0.019763 ( 0.123913) 5000/600*0.1239 = 1.0326
Fetching 700 records 0.025286 0.000000 0.025286 ( 0.122520) 5000/700*0.1225 = 0.8751
Fetching 800 records 0.058518 0.002090 0.060608 ( 0.177813) 5000/800*0.1778 = 1.1113
Fetching 900 records 0.029267 0.000000 0.029267 ( 0.141707) 5000/900*0.1417 = 0.7873
Fetching 1000 records 0.030135 0.000870 0.031005 ( 0.151177) 5000/1000*0.1512 = 0.7559
Fetching 2000 records 0.059159 0.000020 0.059179 ( 0.424362) 5000/2000*0.4244 = 1.0609
Fetching 3000 records 0.085405 0.000916 0.086321 ( 0.565139) 5000/3000*0.5651 = 0.9419
Fetching 4000 records 0.115936 0.000000 0.115936 ( 1.083269) 5000/4000*1.0833 = 1.3541
Fetching 5000 records 0.145682 0.000000 0.145682 ( 11.172354) 5000/5000*11.1724 = 11.1724
====================================================================================================
Fastest time for retrieving 5,000 records is 0.7559 by fetching 1,000 records at a time.
Note: the Gitlab::LicenseScanning::PackageLicenses.fetch
method accepts a maximum of 1,000
components, otherwise an ArgumentError
exception will be raised. This limitation will be removed as part of Update Gitlab::LicenseScanning::PackageLicenses... (#388107 - closed).
Data characteristics
Raw SQL
https://paste.depesz.com/s/coK
SELECT
pm_packages.purl_type,
pm_packages.name,
pm_package_versions.version,
array_agg( pm_licenses.spdx_identifier )
FROM
pm_packages
JOIN pm_package_versions ON pm_package_versions.pm_package_id = pm_packages.id
JOIN pm_package_version_licenses ON pm_package_version_licenses.pm_package_version_id = pm_package_versions.id
JOIN pm_licenses ON pm_licenses.id = pm_package_version_licenses.pm_license_id
WHERE
(
pm_packages.purl_type,
pm_packages.name,
pm_package_versions.version
) IN (
( 4, 'beego', 'v1.10.0' ),
...
)
GROUP BY
pm_packages.purl_type,
pm_packages.name,
pm_package_versions.version;
Query plan
Fetching 1,000 random records out of 100,000
https://explain.depesz.com/s/MKL4
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #384888 (closed)