Add service for syncing package metadata with external license db
Problem to solve
The external license database provides the instance with license data. This is stored in object storage (public bucket or local file). The instance needs to import this data into its database.
Proposal
Add a package metadata sync service to import external license db data.
Because of the amount of data that will be stored in the data source, the service should keep track of the last synced position so that it doesn't have to import all the data in the bucket after each invocation.
Using a last sync position
the service will open a connection to the correct object/file in the data source (using a dedicated connector) and stream the csv rows.
Once the csv stream is open, the service will iterate over the [package_name, version, spdx_identifiers]
tuples in slices (e.g. 100 tuples at a time) and save these to the database using PackageMetadata::ImportService. The database is constrained uniquely on the data in the tuple so that duplicate data is not added. The service does not have to take care of duplicates.
Once finished, the service will store the new last sync position
.
Implementation Plan
-
add PackageMetadata::SyncService::Settings
under ee/app/services- provides data on the
base_uri
, supported data formats and purl_types - for
gcp
thebase_uri
will be the bucket name - for
offline
thebase_uri
will likely be a path in a filesystem
- provides data on the
-
add PackageMetadata::SyncService
under ee/app/services-
iterates over all purl_types
formats supported by the instance -
use PackageMetadata::Connector
to retrieve the connector for a service defined by [base_uri
,version_format
,purl_type
] (2 connectors are currently defined gcp and offline) -
retrieve the last sync position by finding PackageMetadata::SyncPosition
for connection URI defined bybase_uri/version_format/purl_type
-
invoke connector's #data_after
method to fetch the data after the last sync position usingsequence_id
andchunk_id
-
invoke PackageMetadata::ImportService to store slices of 3-tuples of format [ package
,version
,license
] yielded by the connector -
if the sequence_id
orchunk_id
and the new data was stored successfully, a newsync position
is stored
-
-
add PackageMetadata::Checkpoint
to store last position in data store- attributes:
version: smallint
,purl_type: smallint
,sequence_id: bigint
,chunk_id: int
- format: <base_uri>/<format_version>/<purl_type>/<sequence_id>/<chunk_id>
- example: https://storage.cloud.google.com/v1/5/1668056400/1
- description and use cases here: #373032 (closed)
- attributes:
Pseudocode illustrating SyncService points above:
module PackageMetadata
class SyncService
def execute
settings = PackageMetadata::SyncService::Settings
base_uri = Settings.base_uri
data_format_version = Settings.data_format_version
purl_types = Settings.supported_purl_types
import_service = PackageMetadata::DataImportService.new
purl_types.each do |purl_type|
checkpoint = PackageMetadata::Checkpoint.for_format_and_purl_type(data_format_version, purl_type)
connector_for(base_uri, data_format_version, purl_type)
.data_after(checkpoint)
.each do |file|
file.each_slice(100) do |data_objects|
Ingestion::IngestionService.execute(data_objects)
end
checkpoint.update(sequence_id: file.sequence, chunk_id: file.chunk)
end
end
end
end
end
Idempotency and sync position storage
The database schema is structured to skip duplicates. So if an error occurs and the most current sync position is not saved, restarting at the previous sync position will not cause data corruption as duplicates will be ignored.