Update PackageLicenses.fetch to use batched query
What does this MR do and why?
Add class for fetching package licenses (!107186 - merged) added a new Gitlab::LicenseScanning::PackageLicenses
class to allow us to fetch licenses for packages, however, it's currently limited to accept a maximum of 1,000
components, for the performance reasons provided in this comment.
This MR Updates the Gitlab::LicenseScanning::PackageLicenses.fetch
method so it executes queries in batches of 700 records, in order to allow fetching components of arbitrary length.
Benchmarks
Benchmark results for fetching random components from a total of 100,000 records in batches of 700 records
==========================================================================================================
user system total real
Fetching 10 records 0.003636 0.000148 0.003784 ( 0.059537)
Fetching 100 records 0.005846 0.002001 0.007847 ( 0.272536)
Fetching 200 records 0.007075 0.000000 0.007075 ( 0.188485)
Fetching 300 records 0.009253 0.000809 0.010062 ( 0.159228)
Fetching 400 records 0.014246 0.000102 0.014348 ( 0.157200)
Fetching 500 records 0.019136 0.002015 0.021151 ( 0.143743)
Fetching 600 records 0.020793 0.000053 0.020846 ( 0.130405)
Fetching 700 records 0.024426 0.000107 0.024533 ( 0.149306)
Fetching 800 records 0.028300 0.000000 0.028300 ( 0.137121)
Fetching 900 records 0.032117 0.000067 0.032184 ( 0.142841)
Fetching 1000 records 0.031838 0.001977 0.033815 ( 0.146147)
Fetching 2000 records 0.078207 0.000954 0.079161 ( 0.386699)
Fetching 3000 records 0.101700 0.000132 0.101832 ( 0.382380)
Fetching 4000 records 0.136208 0.000050 0.136258 ( 0.474791)
Fetching 5000 records 0.165363 0.003066 0.168429 ( 0.582525)
Fetching 10000 records 0.313918 0.003850 0.317768 ( 1.090370)
Fetching 20000 records 0.667039 0.006161 0.673200 ( 2.270650)
Fetching 30000 records 1.073682 0.008295 1.081977 ( 3.440063)
Fetching 40000 records 1.423224 0.010339 1.433563 ( 4.609655)
Fetching 50000 records 1.885028 0.014472 1.899500 ( 5.858486)
Fetching 60000 records 1.918566 0.031483 1.950049 ( 6.615673)
==========================================================================================================
Data characteristics
Raw SQL
https://paste.depesz.com/s/coK
SELECT
pm_packages.purl_type,
pm_packages.name,
pm_package_versions.version,
array_agg( pm_licenses.spdx_identifier )
FROM
pm_packages
JOIN pm_package_versions ON pm_package_versions.pm_package_id = pm_packages.id
JOIN pm_package_version_licenses ON pm_package_version_licenses.pm_package_version_id = pm_package_versions.id
JOIN pm_licenses ON pm_licenses.id = pm_package_version_licenses.pm_license_id
WHERE
(
pm_packages.purl_type,
pm_packages.name,
pm_package_versions.version
) IN (
( 4, 'beego', 'v1.10.0' ),
...
)
GROUP BY
pm_packages.purl_type,
pm_packages.name,
pm_package_versions.version;
Query plan
Fetching 700 random records out of 100,000
https://explain.depesz.com/s/rBdx
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #388107 (closed)