Package metadata license sync starting from first sequence and chunk rather than from last sync position
Summary
Package metadata sync for licenses starting from first sequence.
PackageMetadata::Checkpoint
stores the location of the last delta file synced. After the release of Ingest version_format v2 for licenses (#408901 - closed), the checkpoints have been created anew.
Example for maven. License data is stored in the gcp bucket as a series of deltas:
- v1/maven/1677169710/00000000.csv
- v1/maven/1677169710/00000001.csv
- ...
- v1/maven/1678456940/0000000.csv
- ...
- v1/maven/1681480938/0000000.csv
When sync has fully iterated through all the files above, the checkpoint will store 1681480938
and 0
so as to point at the last position.
After deployment of !120027 (merged) sync started form scratch (e.g. v1/maven/1677169710/00000000.csv
).
Steps to reproduce
- Access
rails console
. - Run
PackageMetadata::Checkpoint.all
. - 2 types of checkpoints now exist
advisories
andlicenses
. Theadvisories
checkpoints are up-to-date whilelicenses
are syncing with sequences in the past.
Example Project
What is the current bug behavior?
- Checkpoints for
licenses
were created on2023-09-08
and re-synced the full delta dataset. - Checkpoints for
advisories
also exist even though this support hasn't yet been added:- These checkpoints show
created_at
times 3-4 months ago which indicates these are the ones that should point atlicenses
.
- These checkpoints show
What is the expected correct behavior?
Only one set of licenses
should exist and sync should start from the correct last sequence/chunk position.
Relevant logs and/or screenshots
From staging:
Output of checks
This bug happens on GitLab.com
/label reproduced on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: \\\\\\\\\\\\\\\`sudo gitlab-rake gitlab:env:info\\\\\\\\\\\\\\\`) (For installations from source run and paste the output of: \\\\\\\\\\\\\\\`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production\\\\\\\\\\\\\\\`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of: \\\\\\\`sudo gitlab-rake gitlab:check SANITIZE=true\\\\\\\`) (For installations from source run and paste the output of: \\\\\\\`sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true\\\\\\\`) (we will only investigate if the tests are passing)
Cause
This behaviour occurs because at the time of adding data_type
the incorrect default was set to existing checkpoints. The checkpoints were inadvertently set to advisories
!118939 (diffs)
data_type
in the migration should have been 2
to represent licenses.
The incorrect data_type had no effect until !120027 (merged) was deployed. This merge request changed the unique identifier from purl_type
to (data_type, version_format, purl_type)
. The effect of this was that the checkpoint was not found when next looking up checkpoints by the new unique identifier.
Possible fixes
- Switch data_types to change
licenses
to1
https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/enums/package_metadata.rb#L25 - Remove checkpoints of type
advisories