Investigate: worker with `sticky` data consistency reads stale data
- Sentry issue: #364635 (closed)
- Incident: gitlab-com/gl-infra/production#7223 (closed)
Description
- MR Switch to `sticky` data consistency for Reposit... (!87995 - merged) changed data consistency from
always
tosticky
forRepositoryUpdateMirrorWorker
- Feature flag from this MR was globally enabled on 2022-05-24
- On the same date, we observed an increased number of StuckImportJob errors
-
RepositoryUpdateMirrorWorker
recorded logs with an error description - https://log.gprd.gitlab.net/goto/b1c8e040-e7ed-11ec-8656-f5f2137823ba - On 2022-06-09, we reverted data consistency from
sticky
toalways
-> the number of errors significantly decreased
Theory
The simplified chain of events to pull the repository mirror
- UpdateAllMirrorsWorker runs regularly by Cron
- It spawns
ProjectImportScheduleWorker
s for each project that requires pull mirror to be updated ProjectImportScheduleWorker
changes status of the project toscheduled
- After that, we create a
RepositoryUpdateMirrorWorker
to perform the update -
RepositoryUpdateMirrorWorker
checks the status of the project before it starts processing -
RepositoryUpdateMirrorWorker
cannot start the update because the project has afinished
status
I think that RepositoryUpdateMirrorWorker
somehow reads stale data from the replica that did not receive a scheduled
update (from step 3). It happens in around ~1.2% of cases.
We see that problem in logs: 'Project was in an inconsistent state: finished'.
After we restored data consistency for RepositoryUpdateMirrorWorker
to always
, then this problem almost disappeared.
The possible reason for this behavior is that we read data from the replica that is not up-to-date. However, it should not happen. Related code: https://gitlab.com/gitlab-org/gitlab/blob/8c4b269470e817269375f0d972d7eb5aca13566d/lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb#L52
Edited by Vasilii Iakliushin