Update package_metadata sync concurrency
What does this MR do and why?
PackageMetadata::SyncWorker
synchronizes csv data from a gcp bucket or file against the GitLab instance's database using bulk insert. The job is long-running (especially on initial import) which causes multiple instances of the worker to run at the same time. The bug affects performance by causing a significant slowdown in bulk upsert query speed. 10x slow-downs are common for when 2 or more workers are doing bulk inserts concurrently. Deadlocks are less frequent but do occur causing the job to fail.
In the normal flow of events the job is started by Sidekiq::Cron
at the next scheduled interval. Multiple jobs are prevented from starting by the Deduplication Middleware.
The error state happens when sidekiq shuts down due to a node restart or a re-deploy (this is frequent). When this happens, the running job is terminated but pushed back to the default
queue to allow another sidekiq worker to pick up the job. Now there are two triggers for the job. (1) A normal start involving Sidekiq::Cron
and (2) a start by another node's worker picking up the job from the default
queue.
This is a rough visualization of what happens.
sequenceDiagram
participant Redis
participant Cron as Sidekiq::Cron
participant DedupeMiddleware
participant Sidekiq
participant Sync as SyncWorker1
participant Sync2 as SyncWorker2
activate Redis
activate Sidekiq
activate Cron
Note right of Cron: (1) cron tries to schedule job
Cron->>DedupeMiddleware: is_duplicate?(sync_job)
DedupeMiddleware-->>Cron: false
Cron->>Sidekiq: schedule sync_job
Sidekiq->>Sync: start
activate Sync
Sync->>Sync: work
Note right of Cron: (2) cron tries to schedule job
Cron->>DedupeMiddleware: is_duplicate?(sync_job)
DedupeMiddleware-->>Cron: true
Cron->>Cron: noop
Note right of Sidekiq: (3) sidekiq node shuts down
Sidekiq--xSync: shutdown
deactivate Sync
Sidekiq->>Redis: add sync_job to default queue
deactivate Cron
deactivate Sidekiq
Note right of Sidekiq: (4) sidekiq node restarts
activate Sidekiq
activate Cron
Sidekiq->>Redis: get default queue jobs
Redis-->>Sidekiq: [..., sync_job, ...]
Sidekiq->>Sync: start
activate Sync
Sync->>Sync: work
Note right of Cron: (5) cron tries to schedule job
Cron->>DedupeMiddleware: is_duplicate?(sync_job)
DedupeMiddleware-->>Cron: false
Cron->>Sidekiq: schedule sync_job
Sidekiq->>Sync2: start
activate Sync2
Sync2->>Sync2: work
deactivate Sync
deactivate Sync2
deactivate Cron
deactivate Sidekiq
deactivate Redis
The changes in this MR
This MR changes the sync job to take a global lock (via Gitlab:ExclusiveLease) so that only a single PackageMetadata::SyncWorker
job may run at any given time.
Alongside the global lock, the job run time is reduced to 5 minutes. This makes job execution more predictable, reduces the chance of the job being interrupted by a sidekiq
restart, and abides by the sidekiq development guidelines regarding a 5 minute maximum: https://docs.gitlab.com/ee/development/sidekiq/worker_attributes.html
The job also self-preempts based on the global lock's ttl. After every checkpoint PackageMetadata::SyncService
checks whether it's out of time. If it is, the job stops.
The deduplication strategy is also removed to ensure that ExclusiveLease
is the SSOT for uniquely identifying the job.
How to set up and validate locally
Before this fix
git checkout master
- start sidekiq and tail the log for sync worker messages (
bundle exec sidekiq | tee -a /tmp/sidekiq.log
, `tail -f /tmp/sidekiq.log| jq 'select(.class == "PackageMetadata::SyncWorker") | .' ) - wait for cron to start the job (can bump frequency to every minute)
- terminate sidekiq
ps | grep sidekiq
,kill -TERM [PID]
(there will be a delay while sidekiq tries to shutdown gracefully) - start sidekiq again
- verify 2 or more
PackageMetadata::SyncWorker
job_status: start
messages (without correspondingjob_status: done
messages almost immediately after) in the logs
After the fix
git checkout 393433-reduce-chance-of-concurrent-sync-jobs
- start sidekiq and tail the log for sync worker messages (
bundle exec sidekiq | tee -a /tmp/sidekiq.log
, `tail -f /tmp/sidekiq.log| jq 'select(.class == "PackageMetadata::SyncWorker") | .' ) - wait for cron to start the job (can bump frequency to every minute)
- terminate sidekiq
ps | grep sidekiq
,kill -TERM [PID]
- start sidekiq again
- verify only 1
PackageMetadata::SyncWorker
job_status: start
message without a correspondingdone
; otherstart
messages are followed by adeduplication
message
Note, to make it more predictable change the code of SyncService#execute
to a simple loop with a log message and sleep. Example:
class SyncService
def execute
count = 0
loop do
count += 1
Gitlab::PackageMetadata::Logger.debug(class: self.class.name, message: "In execute loop (iteration: #{count}")
sleep(2)
end
end
end
Gitlab::PackageMetadata::Logger
debug messages will appear under $GITLAB_DIR/log/package_metadata.log
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #393433 (closed)