Skip to content

Update PackageLicenses.fetch to use batched query

Adam Cohen requested to merge 388107-process-components-in-batches into master

What does this MR do and why?

Add class for fetching package licenses (!107186 - merged) added a new Gitlab::LicenseScanning::PackageLicenses class to allow us to fetch licenses for packages, however, it's currently limited to accept a maximum of 1,000 components, for the performance reasons provided in this comment.

This MR Updates the Gitlab::LicenseScanning::PackageLicenses.fetch method so it executes queries in batches of 700 records, in order to allow fetching components of arbitrary length.

Benchmarks

Benchmark results for fetching random components from a total of 100,000 records in batches of 700 records
==========================================================================================================
                            user     system      total        real
Fetching    10 records  0.003636   0.000148   0.003784 (  0.059537)
Fetching   100 records  0.005846   0.002001   0.007847 (  0.272536)
Fetching   200 records  0.007075   0.000000   0.007075 (  0.188485)
Fetching   300 records  0.009253   0.000809   0.010062 (  0.159228)
Fetching   400 records  0.014246   0.000102   0.014348 (  0.157200)
Fetching   500 records  0.019136   0.002015   0.021151 (  0.143743)
Fetching   600 records  0.020793   0.000053   0.020846 (  0.130405)
Fetching   700 records  0.024426   0.000107   0.024533 (  0.149306)
Fetching   800 records  0.028300   0.000000   0.028300 (  0.137121)
Fetching   900 records  0.032117   0.000067   0.032184 (  0.142841)
Fetching  1000 records  0.031838   0.001977   0.033815 (  0.146147)
Fetching  2000 records  0.078207   0.000954   0.079161 (  0.386699)
Fetching  3000 records  0.101700   0.000132   0.101832 (  0.382380)
Fetching  4000 records  0.136208   0.000050   0.136258 (  0.474791)
Fetching  5000 records  0.165363   0.003066   0.168429 (  0.582525)
Fetching 10000 records  0.313918   0.003850   0.317768 (  1.090370)
Fetching 20000 records  0.667039   0.006161   0.673200 (  2.270650)
Fetching 30000 records  1.073682   0.008295   1.081977 (  3.440063)
Fetching 40000 records  1.423224   0.010339   1.433563 (  4.609655)
Fetching 50000 records  1.885028   0.014472   1.899500 (  5.858486)
Fetching 60000 records  1.918566   0.031483   1.950049 (  6.615673)
==========================================================================================================

(benchmarking source code)

Data characteristics

Raw SQL

https://paste.depesz.com/s/coK

SELECT
    pm_packages.purl_type,
    pm_packages.name,
    pm_package_versions.version,
    array_agg( pm_licenses.spdx_identifier )
FROM
    pm_packages
    JOIN pm_package_versions ON pm_package_versions.pm_package_id = pm_packages.id
    JOIN pm_package_version_licenses ON pm_package_version_licenses.pm_package_version_id = pm_package_versions.id
    JOIN pm_licenses ON pm_licenses.id = pm_package_version_licenses.pm_license_id
WHERE
    (
        pm_packages.purl_type,
        pm_packages.name,
        pm_package_versions.version
    ) IN (
        ( 4, 'beego', 'v1.10.0' ),
        ...
    )
GROUP BY
    pm_packages.purl_type,
    pm_packages.name,
    pm_package_versions.version;

Query plan

Fetching 700 random records out of 100,000

https://explain.depesz.com/s/rBdx

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #388107 (closed)

Edited by Adam Cohen

Merge request reports

Loading