Export licenses to version format v2
Why are we doing this work
Imported package metadata has a large amount of duplication and is causing db size issues for its consumers. license-exporter
should be changed to remove the duplication in the dataset in order to store the most compact version_format
possible.
Relevant links
- Research spike: #407454 (closed)
- Discussion of relevant data structure and algorithm: #407454 (comment 1357478354)
Non-functional requirements
- Documentation: n/a
- Feature flag: n/a
- Performance: n/a
- Testing: n/a
Implementation outline
Deduplication is accomplished by grouping data for a package by licenses-to-versions
sets. Because most packages have a single license, they do not need to store any information other than the license name which applies to the full dataset (e.g. { "rails": "MIT" }
).
For packages with multiple licenses-to-versions
sets, the data structure has to evolve. The default license-set will still be stored. Other licenses-to-version
combinations need to store the license-set
and the full list of versions
that correspond to it. The data structure in the above example thus evolves to: { default: "MIT", other: { "Apache": ["7.0.1", "7.0.2", "7.0.3"] } }
.
Additionally, the maximum version seen so far also needs to be stored so as not to misrepresent versions that have not yet been ingested. For example: if rails licenses have been ingested up to 7.0.0
, the database has { rails: { default: MIT } }
and when the caller queries 7.0.1
they will incorrectly infer that the license for this version is MIT
. For this reason the maximum version seen so far is also stored. This can be done via a highest_version
attribute: { default: "MIT", other: { "Apache": ["7.0.1", "7.0.2", "7.0.3"] }, highest_version: "7.0.0" } }
Sets of licenses shouldn't have duplicates.
Exported data should match the constraints we have in the JSON schema for pm_packages.licenses
.
Pseudocode of changes
- the URL written is updated to support version_format
v2
- Old format:
https://bucket/v1/purl_type/sequence/chunk.csv
- New format:
https://bucket/v2/purl_type/sequence/chunk.ndjson
- Old format:
- the bucket data is encoded as
ndjson
- the data structure output is updated to the above
- the export algorithm is changed to
- fetch packages which have been updated since a given timestamp together with all of their licenses
- group license data by licenses together with their corresponding versions
- output resulting json object to file
njdson
file
Filtering unknown licenses
As discussed in the research spike unknown
licenses are a large part of
the dataset and do not need to be stored, deduplicating ingestion should filter these licenses out.
This optimization doesn't apply to packages that have multiple sets of licenses.
For instance, it must export { default: "unknown", other: { "Apache": ["7.0.1", "7.0.2", "7.0.3"] }, highest_version: "7.0.0" } }
if we only know the license of 7.0.1 to 7.0.2 (Apache).
Implementation plan
- Update license-exporter to export in the new v2 format.
- Add SQL queries to fetch new and updated packages using a CURSOR.
- Add these queries to the Database struct type.
- Add CLI flag to switch b/w v1 and v2. For backward compatibility v1 is the default.
- Add NDJSON writer/encoder, similar to the existing CSV writer.
- Refactor
ObjectRotator
to support both formats, and both writers/encoders. - Refactor lock file handling to support both formats.
- Update the existing unit tests, and add new ones wherever this is needed.
- Run
license-exporter
from deployment project.- Update CI configuration file.
- Rename existing scheduled pipelines.
- Rename to
dev export
todev export v1
. - Rename to
dev prod
todev prod v1
.
- Rename to
- Create new scheduled pipelines.
- Create new
dev export v2
pipeline. - Create new
prod export v2
pipeline.
- Create new
Verification steps
Run the export and check the v2
directory of export bucket.
- Export all.
- Export since a given date (explicitly passed).
- Export since last update (extracting from last export).
Check the lock mechanism.
- Lock file is created when there's none.
- Export is skipped when the lock file exists and it's not outdated.
- Lock file is removed and the export runs when the lock file exists and it's outdated.
To be tested on the dev
environment before deploying to prod
.
Test updates
Update License Sanity test to support v2
URLs with expected minimum sizings