Fix: Geo: Orphaned uploads lead to "Sync timed out after 28800" (!142456) · Merge requests · GitLab.org / GitLab

Kyle Yetter requested to merge 417197-geo-orphaned-uploads-lead-to-sync-timed-out-after-28800 into master Jan 22, 2024

What does this MR do and why?

For object/file/blob downloading during Geo replication, if the download from secondary to primary fails, we catch the exception, log it, and mark the sync registry as failed for the object. However, if an exception is raised due to inconsistent data or other unexpected problem outside of the actual HTTP downloading, the BlobDownloadService will raise an exception without marking the sync as having failed; the sync is stuck in "started" state until a separate cleanup process notices it has been stuck in started state for hours, and marks it as failed.

This MR adds extra exception handling to detect unexpected errors (anything under StandardError), mark the replicator's registry as failed, note the error details on the registry, and report the bug to GitLab exception tracking. For now, it still re-raises the exception to avoid changing behavior of the system as is unexpectedly by suddenly swallowing exceptions where they used to be raised. But it corrects the transfer state of the object in question.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

Review the issue for full history and additional replication notes. Here is the basic issue replication process:

Replication Details

Preliminary Setup

establish a primary and secondary server
in "Admin > Geo > Sites", edit the sites to ensure primary replicates to secondary and turn on object storage replication

On Primary

Create a project and an issue
On the issue, add a comment that attaches an image upload
In rails console, run the following to create a data inconsistency:

u = Upload.last
u.model_id = 129381 # something nonexistent but still present
u.save(validate: false)

On Secondary

Go to localhost:3001 (the secondary server's url) in a browser
Visit "Admin > Geo > Sites" and cruise the list of resources under the secondary server. Click on "Uploads" toward the end
Click on "Resync All"

Behavior Prior To This MR

After a few moments, you should be able to refresh and see there is at least one blob record corresponding to the image upload attachment setup earlier in the replication:

It will be marked as "started" and just not appear to change otherwise until it times out after many hours

In the rails console on the secondary, you can also see this in the data via:

  up = Upload.last
  rep = up.replicator
  reg = rep.registry
  reg.started? # => true
  reg.failed?  # => false
  reg.last_sync_failure # => nil, or something irrelevant

Expected Behavior After This MR:

Shortly after the resync all is triggered:

a refresh of sync status in Uploads should mark the problem upload record as "failed"

In the rails console on the secondary, you should be able to see this in the data:

  up = Upload.last
  rep = up.replicator
  reg = rep.registry
  reg.started? # => false
  reg.failed?  # => true
  reg.last_sync_failure # => "Encountered system exception while attempting to sync: undefined method `id' for nil:NilClass"

Numbered steps to set up and validate the change are strongly suggested.

Related to #417197 (closed)

Edited Feb 21, 2024 by Kyle Yetter

Fix: Geo: Orphaned uploads lead to "Sync timed out after 28800"