Support advisories and affected packages data sync protocol
Why are we doing this work
A new version format is needed for advisory ingestion. The monolith sync service needs to be able to use this format.
Background
The external license database exports a set of deltas representing its internal dataset over time. A delta is written to a gcp bucket as a set of files at a particular timestamp. The timestamp is the identifier for that delta dataset. The data for a particular dataset is written as a set of chunks which have an upper limit to their size.
As an example:
If data coming into the external license db looks like the following:
- data at t1
- rails,[6.1,6.2],MIT
- data at t2
- rails,[6.3],MIT
Then the exporter writes this to the gcp bucket:
- at t1
v1/rubygem/t1/file.csv
- contents of csv are
- rails,6.1,MIT
- rails,6.2,MIT
- contents of csv are
- at t2
v1/rubygem/t2/file.csv
- contents of csv are
- rails,6.3,MIT
- contents of csv are
This format allows both the producer and consumers to be stateless (aside from storing the last synced timestamp).
Monolith Sync
The monolith uses checkpoints to store the last synced position. If a checkpoint exists (sequence
and chunk
match), only the files after this checkpoint are fetched.
The connectors instantiate a CsvFile which is a simple enumerable container responsible for offering a lazy enum interface and parsing the csv data into a DataObject
.
After ingestion is fully finished, the new checkpoint is saved https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L50
Changes
The identifier for this new format is v2
and is part of the path locating file chunks. The following are changed.
- URL
- Storage format
- Object format
1. URL changes
data_type
is added to the url, going from: v1/<purl_type>/<timestamp>/<chunk>.csv
to v2/<purl_type>/[advisories|licenses]/<timestamp>/<chunk>.ndjson
.
2. Storage format
The storage format has been changed from csv
to ndjson.
3. Object format
The object is a json
with the following fields:
-
id
- unique identifier for the advisory -
database
- indicating which database this advisory came from -
advisory
- stores contents of the advisory data -
packages
- stores the packages affected by this advisory and ranges affected
The fields for advisory
and packages
are specified in PackageMetadata::Advisory
and PackageMetadata::AffectedPackage
.
Example:
{
"advisory": {
"id": "CVE-2022-40303",
"database": "trivy-db",
"title": "",
"description": "...",
"cvss_v3": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H",
...
}
"packages": [
{
"name": "libxml2",
"purl_type": "deb",
"dist_version": "10",
"affected_range": "<2.9.4+dfsg1-7+deb10u5",
"severity": "..."
},
{
// ...
}
]
}
Relevant links
- version format discussion #370780 (closed)
- research spike #394723 (closed)
Non-functional requirements
- Documentation: n/a
- Feature flag: n/a
- Performance: n/a
- Testing: n/a
Implementation plan
-
add sync config for advisories - add advisories specific data (bucket, offline location, etc.)
-
add data objects -
update data object fabrication
Below is the old implementation plan which was superseded with above after most of the needed functionality was added in Refactor interface between sync protocol and da... (!120795 - merged)
Old implementation plan
Update checkpoint
-
create migration to add version_format
anddata_type
to checkpoints
Refactor interface between sync protocol and da... (!120795 - merged))
Update connectors (work ongoing in-
extract common CsvFile
functionality out of offline and gcp connectors and change this class toDataFile
-
update both connectors to accept data_type
and select the correcturl
/path
based on it -
update connector iterators to instantiate a DataFile
withdata_type
(e.g. gcp) -
update DataFile
to accept adata_type
parameter so as to determine file suffix (e.g. for gcp)- offline archive_path
- gcp file_prefix
Update data parsing
-
rename PackageMetadata::DataObject to PackageMetadata::LicenseDataObject
-
add new object PackageMetadata::AdvisoryDataObject
with fields to populatePackageMetadata::Advisory
andPackageMetadata::AffectedPackage
(similar toPackageMetadata::LicenseDataObject
) -
rename .from_csv
to.parse
https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/data_object.rb#L12 -
update .parse
to supportjson
as well ascsv
https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/data_object.rb#L12 based ondata_type
supplied byconnector