Geo::EventWorker fails on delete event if the object doesn't exist
When setting up Geo Replication at a new secondary, from an existing primary site with non-trivial data/history, the Geo::EventWorker worker can experience a lot of failures when it receives a "delete" operation for an existing object (like an MR diff or a Job Artifact) when they've expired at the primary, but the underlying object hasn't replicated to object storage yet.
This comes from the attempt to do a HeadObject in ee/app/services/geo/file_registry_removal_service.rb:155:in
object_file'`, where if the object doesn't exist it gets a 404 (or, in AWS and lacking ListObject ListBucket, a surprising 403).
The error is:
Expected(200) <=> Actual(403 Forbidden)
excon.error.response
:body => ""
:cookies => [
]
:headers => {
"Content-Type" => "application/xml"
"Date" => "Thu, 11 Jan 2024 21:59:05 GMT"
"Server" => "AmazonS3"
"x-amz-id-2" => "<REDACTED>"
"x-amz-request-id" => "<REDACTED<"
}
:host => "<BUCKET_ENDPOINT>"
:local_address => "SOEMTHING"
:local_port => 47480
:method => "HEAD"
:omit_default_port => false
:path => "/merge_request_diffs/mr-XXXXXX/diff-YYYYYY"
:port => 443
:query => nil
:reason_phrase => "Forbidden"
:remote_ip => "SOMETHING"
:scheme => "https"
:status => 403
:status_line => "HTTP/1.1 403 Forbidden\r\n"
AFAICT this is the fog
gem, using excon
and expecting a 200
and is not getting it.
It would be nice if the delete service could either check first and skip the head if the object doesn't exist, or detect and swallow this particular problem if it does occur, because it causes some alarm when first seeing it, and if monitoring and alerting is configured there can be a large % of Sidekiq jobs failing during backfill leading to either a need for alert silences (risky), or alert fatigue.
Observed in practice in Dedicated; see internal issue