Support concurrent create and delete while refreshing project statistics
What does this MR do and why?
This MR seeks to address race condition on counter that happens when there are concurrent job artifacts being created or deleted during a refresh. Detailed problem description can be found in #373469.
There are a few scenarios that need to be addressed:
- At the beginning of a refresh, the counter and database column are reset to zero. After this happens, there might still be decrements to the counter that brings the counter to negative. Since the records have been deleted from the database, the counter will not be corrected, resulting in a net negative value.
- At the beginning of a refresh, the counter and database column are reset to zero. After this happens, there might still be increments to the counter that increases the counter. These record ids might be rediscovered by the refresh, resulting in a double increment.
- During the refresh, there might be job artifacts being deleted as a result of regular activity. If this happens before the relevant records are accounted for by the refresh, it would lead to a net negative counter value, similar to point 1 above.
Solution
This MR changes the counter to temporarily use a refresh key during the refresh. During this period, the counter will ignore increments that should not affect the counter and deduplicates repeated increments.
The main changes are:
- Modifying BufferedCounter to toggle its increment behaviour during a refresh. There are various Lua scripts added for this implementation, because they need to happen atomically in Redis.
- Adds a
finalizing
step to the refresh process, where the recalculated value in the refresh key is moved into the existing counter key.
The refresh process would look as such:
-
BuildArtifactSizeRefresh
record is created increated
state. At this point, there is no effect as it is just waiting to be processed. -
RefreshBuildArtifactsSizeStatisticsWorker
picks up the refresh and move it intorunning
. At this point, new increments are pushed into a separaterefresh_key
in Redis, while at the same time deduplicated. The deduplication ensures that no decrement happens unless there is an existing increment on the same id, to avoid a net negative situation. On the other hand, repeated increments from a job artifact that was created then rediscovered by the refresh will also be deduplicated, to avoid a double increment. -
BuildArtifactSizeRefresh
would go intofinalizing
state at the end of the batching of the existing records. This is to delay the final step to move the recalculated amount to the counter key. The delay is needed to ensure that any job artifacts that were deleted but have not had the increments reach Redis don't result in net negative value.
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
How to set up and validate locally
- Enable feature flag
project_statistics_bulk_increment
- Create a project and run bunch of pipelines with artifacts to set up a baseline statistics. Note: make sure
Ci::ArchiveTraceWorker
have completed. - Validate that
project.reload.job_artifacts.sum(&:size)
is equal toproject.statistics.build_artifacts_size
. - Validate a refresh without any activity.
- Enqueue a refresh using
Projects::BuildArtifactsSizeRefresh.enqueue([project])
. - Manually enqueue
Projects::ScheduleRefreshBuildArtifactsSizeStatisticsWorker
inAdmin - Monitor - Background Jobs - Cron
, to avoid waiting for the cron scheduler. This kicks off the refresh process. - Verify that:
- refresh is in finalizing state
Projects::BuildArtifactsSizeRefresh.last.state
is 4. -
project.reload.statistics.build_artifacts_size
is 0 -
Projects::FinalizeProjectStatisticsRefreshWorker
is enqueued - Wait until
FinalizeProjectStatisticsRefreshWorker
executes and see thatFlushIncrementsWorker
is enqueued. At this point, we can validate that the counter key is now the correct amount.Gitlab::Counters::BufferedCounter.new(project.statistics, :build_artifacts_size).get
is the same asproject.reload.job_artifacts.sum(&:size)
. - Wait until
FlushIncrementsWorker
has completed and validate thatproject.reload.job_artifacts.sum(&:size)
is equal toproject.statistics.build_artifacts_size
.
- refresh is in finalizing state
- Enqueue a refresh using
- Validate a refresh whilst there are deletion and new pipelines.
- Enqueue a refresh using
Projects::BuildArtifactsSizeRefresh.enqueue([project])
. - Before enqueuing
Projects::ScheduleRefreshBuildArtifactsSizeStatisticsWorker
, perform various actions such asDestroyBatchService
, create new pipeline, delete job artifacts - Manually enqueue
Projects::ScheduleRefreshBuildArtifactsSizeStatisticsWorker
inAdmin - Monitor - Background Jobs - Cron
. - Continue creating new pipelines or deleting job artifacts from the console, while the refresh is ongoing.
- Wait until
Projects::BuildArtifactsSizeRefresh.last.state
is 4. - Wait until the
Projects::FinalizeProjectStatisticsRefreshWorker
andFlushIncrementsWorker
have completed. - Validate that
project.reload.job_artifacts.sum(&:size)
is equal toproject.statistics.build_artifacts_size
.
- Enqueue a refresh using
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #373469