Skip advisory scans during initial sync

Problem to solve

Whan an instance syncs with the Package Metadata DB (PM DB) for the very first time, it triggers vulnerability scans for a very large number of advisories.

This might not scale, and it possibly impacts the responsiveness of the instance by enqueuing too many jobs.
It doesn't seem needed. Users should already have regular scans on their projects before continuous scans is enabled.

The same problem applies when all the advisories of the PM DB are exported again. This might be needed to fix export data.

To a lesser extent, a similar problem might occur when an instances catches up with PM DB after not being able to sync for a significant amount of time.

This problem was raised in #371063 (comment 1494215853).

Overview of advisory scan fan out. On initial sync, N would be the total amount of advisories in the package metadata database.

Proposal

TBD. See current proposals:

A. Disable FF until the initial sync is completed

#371063 (comment 1494215853)

B. Tag initial export in order to skip scans

#371063 (comment 1496510422)

Pros

It's explicit.
It's robust, and supports a scenario where an instance can't sync for a while.
It supports all sorts of use-cases, like re-exporting for technical reasons without triggering scans.

Cons

We have to update the export format, the exporter, and the syncer.

C. Skip scans based on `updated_at` timestamp

#371063 (comment 1496516167)

Pros

It supports re-exporting everything.

Cons

A new field must be added to the exported JSON object.
It doesn't support re-ingesting everything.
Instances that can't sync for a while miss scans.

C2. Skip scans based on published date

Same as C. Skip scans based on ingestion date, but reuse the existing published_date field.

Pros

It supports re-exporting everything.
It supports re-ingesting everything.

Cons

Instances that can't sync for a while miss scans.
We miss old advisory recently added to the advisory database.

D. Skip scans when chunk is too large

#371063 (comment 1496519259)

Yet another option is to skip scans when the ingested chunk contains too many advisories.

TODO: evaluate pros and cons

E. Use a scan queue

Create a queue of advisories to be scanned with a set amount of advisories scanned at any one time.

Pros:

Comprehensive - can scan the entire backlog without worrying about missing something.
Monolith abstractions already exist for this limited capacity worker and state_machines-activerecord.
Doesn't need elaborate schemes for denoting initial scans or where scans should start.
No changes on the external service nor the protocol side.

Cons:

Will scan all ingested advisories - so the resources needed to scan all of the ingested data will still be used, there just won't be a thundering herd problem.

Implementation Plan

Proposal C chosen for expediency and explicit nature of solution.

ingestion
- change ingestion service to only publish scan events for advisories matching
  - Update AdvisoryIngestionTask to add pubdate to advisory map https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/ingestion/advisory/advisory_ingestion_task.rb#L20
  - Update Advisory::IngestionService to only publish events that are newer than following criteria
    - now - pubdate < 1.day

Does this feature require an audit event?

/cc @ifrenkel @hacks4oats @gonzoyumo

Edited Aug 28, 2023 by Oscar Tovar

Skip advisory scans during initial sync

Problem to solve

Links

Proposal

A. Disable FF until the initial sync is completed

B. Tag initial export in order to skip scans

C. Skip scans based on updated_at timestamp

C2. Skip scans based on published date

D. Skip scans when chunk is too large

E. Use a scan queue

Implementation Plan

Does this feature require an audit event?

C. Skip scans based on `updated_at` timestamp