Skip to content

Fail stuck jobs when UpdateAllMirrorsWorker Runs

What does this MR do and why?

Problem

Due to a Sidekiq outage, the pull mirroring process got stuck because RepositoryUpdateMirrorWorker jobs failed. The StuckImportJob runs a 24H schedule and this meant the jobs were never retried after the Sidekiq was brought back online.

Solution

Fail all jobs that have been stuck since a given time threshold

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Database Query Testing

Original query returns no records as there are no stuck mirrors: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/32622/commands/100679

Simulated Testing

I went ahead to simulate a scenario where we have 15k stuck projects. The actual incident had about 6.6k stuck projects. So we are using ~2.2x here.



Query retrieving 15k records: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/32622/commands/100697

Query retrieving 3k records with limit: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/32622/commands/100700

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Related to #477716

Edited by Olaoluwa Oluro

Merge request reports

Loading