Cardinality error on ingestion of v2 licenses
Summary
PG::CardinalityViolation
error thrown when ingestion version_format v2 license data.
There are package name duplicates in exported ndjson
files. When ingestion tries to issue a bulk upsert, the above error is thrown because duplicate entries for a unique key in the same bulk insert query are not allowed.
This seems to be an issue with pypi
exports only because package names are not normalized. Example: pypatchmatch
and PyPatchMatch
in https://storage.googleapis.com/prod-export-license-bucket-1a6c642fc4de57d4/v2/pypi/1685970065/000000026.ndjson
Steps to reproduce
n/a
Example Project
n/a
What is the current bug behavior?
Cardinality error is thrown when duplicate package names exist in the same bulk insert dataset.
What is the expected correct behavior?
Duplicates should be filtered and no error should be thrown.
Relevant logs and/or screenshots
Log of failures on staging: https://nonprod-log.gitlab.net/app/r/s/0mkS7
Any document with status fail
can be clicked on to see the error and backtrace.
Output of checks
This error is present on staging
with feature flag compressed_package_metadata_synchronization
enabled.
Results of GitLab environment info
compressed_package_metadata_synchronization
needs to be on to turn on v2
ingestion.
Possible fixes
- monolith: deduplicate bulk insert dataset
- exporter: ensure that package names are normalized