Add package metadata ingestion for version format v2
What does this MR do and why?
This MR adds functionality to sync version_format
v2
license data.
This format has a package json
object per line with license data "compressed" under a single attribute.
2 tables are touched in the process of ingestion: pm_packages
and pm_licenses
.
- License data is collected from the slice of objects passed to the ingestion service and upserted into
pm_licenses
. - A map of license spdx_identifiers to their ids is built so it can be used to further
pm_packages
data. - Package data is compressed by translating license name to their db ids and converting the json object under
licenses
to an array. This dataset is then upserted.
How to set up and validate locally
Prepare dataset
Currently only this dataset is available: #409732 (comment 1386970564)
Because the data is not yet in the v2 url format, it needs to be downloaded, converted to have the correct path, and synced in offline
mode (by writing the data to vendor/package_metadata_db/v2
).
download.rb
: download.rb
Run it via: ruby download.rb
Note: Move download to GitLab dir.
Run ingestion via rails runner
ingest.rb
: ingest.rb
Run this via: bundle exec rails runner ingest.rb
Note: The PM_SYNC_INDEV
environment flag controls whether sync runs in the development environment. It is false
by default. Ensure you can sync via export PM_SYNC_INDEV=true
before running ingest.rb
.
Progress
Sync progress can be see in log/application_json.log
where the sync url is indicated.
Progress can also be observed via checkpoints bundle exec rails runner 'puts PackageMetadata::Checkpoint.where(version_format: "v2").all.map(&:attributes)'
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #408901 (closed)