Geo: Introduce "verification disabled" state
What does this MR do and why?
Problem 1: Currently, if a resource cannot be verified due to being in object storage, then the resource is marked "verification succeeded" in order to stop verification from happening, and to avoid a permanent loop of "verification failed, resync, repeat". But "verification succeeded" is an incorrect representation in the data and to the sysadmin.
Problem 2: Currently, if a resource cannot be verified due to the primary has not checksummed it yet, then the resource falls into a loop of "verification failed, resync, repeat" until the resource becomes checksummed on the primary. This is wasteful, though at least the problem is transient.
This MR introduces a "verification disabled" state for these cases. No wasteful loops, and no inaccurate representation of what's verified.
Maintainer: Please don't squash the commits.
Part of #299819 (closed)
Screenshots
See Javiera's screenshots during testing: !87034 (comment 955254106)
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
How to validate fix for object stored blobs
First, reproduce the problem on master branch:
- With master branch, and GDK + Geo
- Configure object storage
- Visit
/admin/geo/sites
- Find the secondary site on the page
- Click the Edit button (pencil icon)
- Check
Allow this secondary site to replicate content on Object Storage
and clickSave
- Upload a file in an issue
- On the secondary Rails console, wait until
Upload.last.replicator.registry.synced?
returnstrue
- On the secondary Rails console,, output verification state:
Upload.last.replicator.registry.verification_state
-
🚫 Notice that it is2
, for "verification success", which is inaccurate.
Try to repro the problem again on MR branch:
- On the secondary:
git checkout mk/scope-verification-properly; gdk restart rails
- If
ps aux | grep sidekiq-cluster | grep -v "grep"
returns more than 1 line per running GDK, then kill thempkill -lf 'sidekiq-cluster'
and wait for GDK to start some again - Upload a file in an issue
- On the secondary Rails console, wait until
Upload.last.replicator.registry.synced?
returnstrue
- On the secondary Rails console, output verification state:
Upload.last.replicator.registry.verification_state
-
✅ Notice that it is4
, for "verification disabled", which is accurate. - On the secondary Rails console:
Geo::MetricsUpdateWorker.new.perform
to immediately update the status in the UI -
✅ Notice no failure in the Upload verification progress bar, and notice that the verification progress bar total is 1 less than the replication progress bar
How to validate fix for "not yet checksummed problem"
First, reproduce the problem on master branch.
On the primary:
- Stop Sidekiq so verification doesn't occur automatically:
gdk stop rails-background-jobs
- Open Rails console:
bin/rails console
- Clear primary checksum for an upload:
u = Upload.first; u.verification_checksum = nil; u.verification_pending!
On the secondary:
- Stop Sidekiq so verification doesn't occur automatically:
gdk stop rails-background-jobs
- Kill any lingering sidekiq processes if needed:
pkill -lf 'sidekiq-cluster'
- Open Rails console:
bin/rails console
- Trigger verification for that upload, then output verification state:
u = Upload.first; u.replicator.verify; u.replicator.registry.verification_state
-
🚫 Notice thatverification_state
is3
, meaning "verification failed". - Refresh this site's status data:
Geo::MetricsUpdateWorker.new.perform
- In browser, visit
/admin/geo/sites
-
🚫 Notice 1 failure in the Upload replication progress bar -
🚫 This represents a transient verification failure when the resource is not yet checksummed on the primary. If we run Sidekiq, this will cause a verification => sync loop until the resource is checksummed on the primary, at which point verification will succeed on the secondary.
Now we can validate the fix on the MR branch.
(Note this is continued from above.) On the secondary:
git checkout mk/scope-verification-properly; gdk restart rails-web
- Exit the already open Rails console (apparently
reload!
isn't enough):exit
- Open Rails console:
bin/rails console
- Resync the upload (because it's currently marked failed sync), verify it, then output verification state:
u = Upload.first; u.replicator.send(:download); u.replicator.verify; u.replicator.registry.verification_state
-
✅ Notice thatverification_state
is4
, meaning "verification disabled". - Refresh this site's status data:
Geo::MetricsUpdateWorker.new.perform
- In browser, visit
/admin/geo/sites
-
✅ Notice no failure in the Upload verification progress bar, and notice that the verification progress bar total is 1 less than the replication progress bar - This shows that there will be no verification => sync loop. When the resource becomes checksummed on the primary, then a
checksum_succeeded
event will be created, which causes all secondaries to immediately reverify the resource. That is the exact right time to attempt verification.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.