Skip to content

Fix: Geo: Orphaned uploads lead to "Sync timed out after 28800"

What does this MR do and why?

For object/file/blob downloading during Geo replication, if the download from secondary to primary fails, we catch the exception, log it, and mark the sync registry as failed for the object. However, if an exception is raised due to inconsistent data or other unexpected problem outside of the actual HTTP downloading, the BlobDownloadService will raise an exception without marking the sync as having failed; the sync is stuck in "started" state until a separate cleanup process notices it has been stuck in started state for hours, and marks it as failed.

This MR adds extra exception handling to detect unexpected errors (anything under StandardError), mark the replicator's registry as failed, note the error details on the registry, and report the bug to GitLab exception tracking. For now, it still re-raises the exception to avoid changing behavior of the system as is unexpectedly by suddenly swallowing exceptions where they used to be raised. But it corrects the transfer state of the object in question.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

Review the issue for full history and additional replication notes. Here is the basic issue replication process:

Replication Details

Preliminary Setup

  • establish a primary and secondary server
  • in "Admin > Geo > Sites", edit the sites to ensure primary replicates to secondary and turn on object storage replication

On Primary

  1. Create a project and an issue
  2. On the issue, add a comment that attaches an image upload
  3. In rails console, run the following to create a data inconsistency:
u = Upload.last
u.model_id = 129381 # something nonexistent but still present
u.save(validate: false)

On Secondary

  1. Go to localhost:3001 (the secondary server's url) in a browser
  2. Visit "Admin > Geo > Sites" and cruise the list of resources under the secondary server. Click on "Uploads" toward the end
  3. Click on "Resync All"

Behavior Prior To This MR

  1. After a few moments, you should be able to refresh and see there is at least one blob record corresponding to the image upload attachment setup earlier in the replication:
  • It will be marked as "started" and just not appear to change otherwise until it times out after many hours
  • In the rails console on the secondary, you can also see this in the data via:
      up = Upload.last
      rep = up.replicator
      reg = rep.registry
      reg.started? # => true
      reg.failed?  # => false
      reg.last_sync_failure # => nil, or something irrelevant

Expected Behavior After This MR:

  1. Shortly after the resync all is triggered:
  • a refresh of sync status in Uploads should mark the problem upload record as "failed"
  • In the rails console on the secondary, you should be able to see this in the data:
      up = Upload.last
      rep = up.replicator
      reg = rep.registry
      reg.started? # => false
      reg.failed?  # => true
      reg.last_sync_failure # => "Encountered system exception while attempting to sync: undefined method `id' for nil:NilClass"

Numbered steps to set up and validate the change are strongly suggested.

Related to #417197 (closed)

Edited by Kyle Yetter

Merge request reports

Loading