Sidekiq spawns multiple instances of the package metadata sync job
Summary
When executing via sidekiq it is possible for several instances of the PackageMetadata::SyncWorker
to be running at the same time.
This appears to be related to sidekiq shutdown/restart stopping job execution but not removing its job id from the queue. When sidekiq is restarted the job is requeued and started again. But there will also be a job kicked off via cron and the deduplicate option is ignored: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/package_metadata/sync_worker.rb#L12
Though this has no effect on data integrity, it does have an effect on query performance. In certain cases multiple running jobs can have up to a 10x slow-down of bulk inserts.
Steps to reproduce
This is a semi-reliable way to reproduce the issue.
- modify cron job to run every minute https://gitlab.com/gitlab-org/gitlab/-/blob/4a4680538c36f988983e590b91342dd986d877c2/config/initializers/1_settings.rb#L835
- run sidekiq locally (
bundle exec sidekiq
) - wait for cron job to start (sidekiq.log will show
start
job_status for PackageMetadata::SyncWorker - once job is running interrupt it with a kill command
- restart sidekiq you should see 2 starts (
job_status: start
) forPackageMetadata::SyncWorker
with different job ids
What is the current bug behavior?
Several jobs executing PackageMetadata::SyncWorker
run concurrently.
What is the expected correct behavior?
Only a single instance of PackageMetadata::SyncWorker
should run.
Relevant logs and/or screenshots
This is a slice of the log showing what happens when jobs are interrupted.
- job with id
db2c023935492790ed4cfccc
starts at 19:00 (utc) - it gets interrupted twice (at 19:01 and 19:23) (you can tell by the
interrupted_count
column) - finally it is interrupted again somewhere between 19:23 at 19:47 and then added to the interrupted queue
- after this cron starts another job
92654433867411977c08fe69
at 20:00 which also gets interrupted twice - cron starts yet another job
bb5248fcfe0100295a48820c
at 21:00
Additionally, the 2nd entry in the linked log slice shows a side-effect of multiple SyncWorker
jobs running at the same time: the possibility of dead locks.
Jobs can be correlated with an interrupt by checking the logs for a node associated with the job. For example, looking at the json.tag
of db2c023935492790ed4cfccc
and then checking all the logs for that tag, we can see that a sidekiq shutdown initiated that interrupt.
Possible fixes
- investigate options for making deduplication ignore the
interrupt
case - ensure that
PackageMetadata::SyncWorker
uses its own lock to find out whether a job is executing - reduce chunk size so that more checkpoints can be taken and progress made more quickly with less interrupts catching the job
For 1 and 2 there seem quite a few edge cases to think about. Preventing an interrupt but having the job fail ensures that no new job starts until the duplication key is removed for the cron job. This seems to not have a straight forward way of removing a dead deduplication key and has edge cases in itself.
Number 3 seems to be the most feasible and straightforward fix. Reducing chunk size will allow the sync to make more frequent progress and allow the worker to check whether its duration within a fixed TTL. It will also avoid the case when the chunk size is so large that no checkpoint is created before before the job is killed and no progress is made. A higher frequency job interval matching the approximate duration doesn't need elaborate locking mechanisms or accounting for interrupts.
Reducing chunk size
A test in local environment returns the following numbers for different chunk sizes:
10k lines (~260KB)
full ingestion duration: 875.695
avg time to process 1 chunk: 3.327064638783272
50k lines (~1.5MB per chunk)
full ingestion duration: 919.703
avg time to process 1 chunk: 17.345245283018865
100k lines (~2.5MB per chunk)
full ingestion duration: 880.572
avg time to process 1 chunk: 32.602777777777774
500k lines (~13MB per chunk)
full ingestion duration: 880.45
avg time to process 1 chunk: 146.7055
1m lines (~26MB per chunk)
full ingestion duration: 893.13
avg time to process 1 chunk: 297.63533333333334
1.5m lines (~39M per chunk)
full ingestion duration: 863.292
avg time to process 1 chunk: 431.5145
A 5 minute TTL is recommended for workers in the low urgency queue. Accounting for differences in resourcing, a safer bet seems to be 50K lines. At ~15s even a 10x increase in duration would allow the service to checkpoint before 5m elapses.
Implementation plan
-
update license-exporter to export a smaller (~2MB) chunk https://gitlab.com/gitlab-org/security-products/license-db/license-exporter/-/blob/main/storage/objectrotator.go#L15 -
update PackageMetadata::SyncService -
add a TTL of 5 minutes -
after every checkpoint make sure that duration is not greater than ttl-approximate_runtime_for_a_chunk
https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L38
-
-
update frequency of cronjob to 5 minutes https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/initializers/1_settings.rb#L834