Enqueue NewMergeRequestWorker for broken MRs after Redis Sidekiq outage
Summary
Merge requests created during a period of Redis Sidekiq downtime do not function correctly, indefinitely. MRs in this state have the following properties:
- Display a warning flash message.
- Diff is not viewable.
- Pipeline status is pending.
- Cannot be merged.
This is because NewMergeRequestWorker
was never enqueued for these MRs, resulting in (at least) the associated merge_request_diff
not being created.
As a workaround, an affected MR can be closed, a new commit pushed to the branch and a new MR opened but this is not immediately obvious to the user.
The user impact is not easily observed in error charts or budgets as it produces 404
rather than 5**
status codes.
Impact
- The issue affects all MRs created during a Sidekiq outage. Since there is no way to automatically recover, these remain in a broken state.
- Diffs, pipeline state and ability to merge can all be affected.
- A production incident, in which Redis was unavailable for 30 mins, took several hours to fully recover from.
The issue manifests as 404s on the merge request diff_metadata
and diffs_batch
endpoints. The incidence declined very slowly over time, however this is most likely due to user action, implementing the workaround. It is visible in this screenshot:
(source)
Recommendation
During the related incident, the affected merge requests were identified by looking for those without merge_request_diffs
, then the NewMergeRequestWorker
job was enqueued for each. This resolved the issue.
Automatically re-enqueuing this job would allow the system to recover without further intervention in the event of a Sidekiq outage. This could be done by:
- Record the job ID on the MR.
- Upon 404 on the
diff_metadata
ordiffs_batch
endpoints, check that the job exists and has not completed. - Enqueue the job again if required.
Implementing a generic mechanism similar to Sidekiq Pro's Pro Reliability Client would help to solve this for other use cases also.
Verification
Successful implementation would be observable on this chart during Sidekiq downtime. A separate index could be used for a non-production environment, where downtime could be simulated.
Alternatively, MRs that are stuck in this state are easily identifiable by the error messages, lack of pipeline status and non-ability to view code or merge. Deleting the associated diff and checking the MR page should see the MR recover.