Partition package metadata tables
Problem to solve
The schema of package metadata has a boundary around the package registry type: purl_type
. This allows an efficient partitioning of the database for package metadata for insert and select queries which is quite important as the dataset is expected to grow large.
Sample data stats
Using sample import data for the maven package registry loaded into schema introduced in Update DB schema to store data imported from th... (#373163 - closed), we can present some stats.
gitlabhq_development=# select count(*) from pm_packages where purl_type = 5;
count
--------
455576
(1 row)
Per-package version data:
gitlabhq_development=# select count(*) from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id where purl_type = 5;
count
---------
7631100
(1 row)
Per-package version license data:
gitlabhq_development=# select count(*) from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id join pm_package_version_licenses pvls on pvs.id = pvls.pm_package_version_id where purl_type = 5;
count
---------
7631183
(1 row)
Package version stats:
gitlabhq_development=# select percentile_cont(0.75) within group (order by cnt) as percentile_75, percentile_cont(0.90) within group (order by cnt) as percentile_90, percentile_cont(0.95) within group (order by cnt) as percentile_95, percentile_cont(0.99) within group (order by cnt) as percentile_99 from (select name, count(*) as cnt from pm_packages ps inner join pm_package_versions pvs on ps.id = pvs.pm_package_id where purl_type = 1 group by (purl_type,name)) as A;
percentile_75 | percentile_90 | percentile_95 | percentile_99
---------------+---------------+---------------+---------------
13 | 36 | 64 | 203
For the table schema: see pm_
tables and relations in !102794 (merged) and https://gitlab.com/gitlab-org/gitlab/-/raw/373163-add-package-metadata/db/structure.sql specifically.
Table stats (there is fake data added for each purl_type
so first create a clean maven
-only table):
drop table if exists test_data_pm_packages_5;
create table test_data_pm_packages_5 (
like pm_packages
including defaults
including constraints
including indexes
);
insert into test_data_pm_packages_5 select * from pm_packages where purl_type = 5;
drop table if exists test_data_pm_package_versions_5;
create table test_data_pm_package_versions_5 (
like pm_package_versions
including defaults
including constraints
including indexes
);
insert into test_data_pm_package_versions_5 select pvs.* from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id where purl_type = 5;
drop table if exists test_data_pm_package_version_licenses_5;
create table test_data_pm_package_version_licenses_5 (
like pm_package_version_licenses
including defaults
including constraints
including indexes
);
insert into test_data_pm_package_version_licenses_5 select pvls.* from pm_packages ps join pm_package_versions pvs on ps.id = pvs.pm_package_id join pm_package_version_licenses pvls on pvls.pm_package_version_id = pvs.id where purl_type = 5;
SELECT i.relname "table",indexrelname "index",
pg_size_pretty(pg_total_relation_size(relid)) As "Total",
pg_size_pretty(pg_relation_size(relid)) as "Table Size",
pg_size_pretty(pg_relation_size(indexrelid)) "Index Size",
reltuples::bigint "Num rows estimate"
FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid
WHERE i.relname like 'test_data_pm_%';
table | index | Total | Table Size | Index Size | Num rows estimate
-----------------------------------------+-----------------------------------------------------------------+---------+------------+------------+-------------------
test_data_pm_package_version_licenses_5 | test_data_pm_package_version_licenses_5_pkey | 991 MB | 322 MB | 298 MB | 0
test_data_pm_package_version_licenses_5 | test_data_pm_package_version_licenses_pm_package_version_id_idx | 991 MB | 322 MB | 213 MB | 0
test_data_pm_package_version_licenses_5 | test_data_pm_package_version_licenses_5_pm_license_id_idx | 991 MB | 322 MB | 158 MB | 0
test_data_pm_package_versions_5 | test_data_pm_package_versions_5_pkey | 1009 MB | 407 MB | 164 MB | 7631146
test_data_pm_package_versions_5 | test_data_pm_package_versions_5_pm_package_id_idx | 1009 MB | 407 MB | 177 MB | 7631146
test_data_pm_package_versions_5 | test_data_pm_package_versions_5_pm_package_id_version_idx | 1009 MB | 407 MB | 261 MB | 7631146
test_data_pm_packages_5 | test_data_pm_packages_5_pkey | 80 MB | 35 MB | 16 MB | 455576
test_data_pm_packages_5 | test_data_pm_packages_5_purl_type_name_idx | 80 MB | 35 MB | 28 MB | 455576
Proposal
purl_type
is a natural partition boundary with partitions created by list. 8 types are currently supported.
Several partitions are possible:
- partition
pm_packages
onpurl_type
- pros: neat partition
- cons: ignores larger tables
- partition
pm_packages
andpm_package_versions
on `purl_type- pros: partitions one of the largest tables
- cons: there's still a large table (pm_package_version_licenses) left over
- partition all 3 tables
pm_package_version_licenses
as above- pros: considers the larger tables
- cons: in order to pass duplicate handling to the database (e.g.
INSERT ... ON CONFLICT DO UPDATE
) the usage of foreign key constraints becomes more complicated as some combination of primary key columns and (in certain cases) unique indexes must exist in the list partition (which currently only accepts one column).
Out of the 3 proposals, the last seems to make the most sense and is a scalable solution when partitioning on purl_type
. The last major "gotcha" or complication (this is true for all 3 proposals) is that when a new purl_type
is added, the schema must be updated to add a new partition over that type.
Implementation plan
Create list partitions on purl_type
and denormalize by adding the purl_type
column to pm_package_versions
and pm_package_version_licenses
so that these tables can be list partitioned as well.
-
update schema using migration guide in https://docs.gitlab.com/ee/development/database/table_partitioning.html -
pm_packages -
remove primary key on id column -
add unique index for foreign keys and inserts with purl_type
-
create list partition on purl_type
-
-
pm_package_versions - denormalize to add
purl_type
to table - add unique index for foreign keys and inserts with
purl_type
- create
list
partition on purl_type
- denormalize to add
-
pm_package_version_licenses - denormalize to add
purl_type
to table - add unique index for foreign keys and inserts with
purl_type
- create
list
partition on purl_type
- denormalize to add
-
add documentation to mention that partitions must be added for new package registry support in package metadata
Testing
-
add bulk insert test script to populate the changed tables (for repeatable results) -
provide table stats -
provide explain on the main anticipated query to fetch spdx_identifiers for a project's dependnecies
Ensure
-
postgresql can still handle duplicates through the on conflict do update
clause -
ensure that indexes utilized are still optimally configured with purl_type
specified as constant