Skip advisory scans during initial sync
Problem to solve
Whan an instance syncs with the Package Metadata DB (PM DB) for the very first time, it triggers vulnerability scans for a very large number of advisories.
- This might not scale, and it possibly impacts the responsiveness of the instance by enqueuing too many jobs.
- It doesn't seem needed. Users should already have regular scans on their projects before continuous scans is enabled.
The same problem applies when all the advisories of the PM DB are exported again. This might be needed to fix export data.
To a lesser extent, a similar problem might occur when an instances catches up with PM DB after not being able to sync for a significant amount of time.
This problem was raised in #371063 (comment 1494215853).
Overview of advisory scan fan out. On initial sync, N
would be the total amount of advisories in the package metadata database.
Links
- Implementation issue for advisory scan worker: Add worker to scan newly ingested advisories (#371063 - closed)
- Continuous scan on advisory ingestion: Trigger vulnerability scans on advisory changes (&10025 - closed).
- Regular scans on SBOM ingestion: SBOM-based dependency scanning findings for def... (&8026 - closed).
Proposal
TBD. See current proposals:
A. Disable FF until the initial sync is completed
B. Tag initial export in order to skip scans
Pros
- It's explicit.
- It's robust, and supports a scenario where an instance can't sync for a while.
- It supports all sorts of use-cases, like re-exporting for technical reasons without triggering scans.
Cons
- We have to update the export format, the exporter, and the syncer.
updated_at
timestamp
C. Skip scans based on
Pros
- It supports re-exporting everything.
Cons
- A new field must be added to the exported JSON object.
- It doesn't support re-ingesting everything.
- Instances that can't sync for a while miss scans.
C2. Skip scans based on published date
Same as C. Skip scans based on ingestion date
, but reuse the existing published_date
field.
Pros
- It supports re-exporting everything.
- It supports re-ingesting everything.
Cons
- Instances that can't sync for a while miss scans.
- We miss old advisory recently added to the advisory database.
D. Skip scans when chunk is too large
Yet another option is to skip scans when the ingested chunk contains too many advisories.
TODO: evaluate pros and cons
E. Use a scan queue
Create a queue of advisories to be scanned with a set amount of advisories scanned at any one time.
Pros:
- Comprehensive - can scan the entire backlog without worrying about missing something.
- Monolith abstractions already exist for this limited capacity worker and state_machines-activerecord.
- Doesn't need elaborate schemes for denoting initial scans or where scans should start.
- No changes on the external service nor the protocol side.
Cons:
- Will scan all ingested advisories - so the resources needed to scan all of the ingested data will still be used, there just won't be a thundering herd problem.
Implementation Plan
Proposal C chosen for expediency and explicit nature of solution.
- ingestion
- change ingestion service to only publish scan events for advisories matching
- Update
AdvisoryIngestionTask
to addpubdate
to advisory map https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/ingestion/advisory/advisory_ingestion_task.rb#L20 - Update
Advisory::IngestionService
to only publish events that are newer than following criterianow - pubdate < 1.day
- Update
- change ingestion service to only publish scan events for advisories matching