Namespaces::ProcessSyncEventsWorker: The `until_executed` does not allow re-scheduling worker
Noticed in: https://gitlab.com/gitlab-com/Product/-/issues/2852#note_1011169306
Problem
When there's a large backlog of SyncEvents
(on namespaces or projects),
the worker will re-schedule itself to be executed again to empty the queue.
The change introduced via !83286 (merged) broke this mechanism, making this "deduplicated" since it was executed during execution of itself.
This does show, that this mechanism is not correct, as if the until_executed
and worker
tries to re-schedule itself it will never succeed.
Symptoms
{"severity":"INFO","time":"2022-06-29T20:58:13.627Z","retry":25,"queue":"namespaces_process_sync_events","version":0,"args":[],"class":"Namespaces::ProcessSyncEventsWorker","jid":"cdfa166743baedeb19a50794","created_at":"2022-06-29T20:58:13.621Z","meta.feature_category":"pods","correlation_id":"27538673072803a620ee850db8957ca4","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:namespaces_process_sync_events:99b01e5f7dc2a9b47ce31029252a21c1a7e02ef97b2a00a70d65582147f34ddb","size_limiter":"validated","enqueued_at":"2022-06-29T20:58:13.626Z","job_size_bytes":2,"pid":23095,"message":"Namespaces::ProcessSyncEventsWorker JID-cdfa166743baedeb19a50794: start","job_status":"start","scheduling_latency_s":0.001591}
{"severity":"INFO","time":"2022-06-29T20:58:25.174Z","retry":25,"queue":"namespaces_process_sync_events","version":0,"args":[],"class":"Namespaces::ProcessSyncEventsWorker","jid":"2b8d251960c7589bf10533c1","created_at":"2022-06-29T20:58:25.172Z","meta.feature_category":"pods","correlation_id":"27538673072803a620ee850db8957ca4","meta.caller_id":"Namespaces::ProcessSyncEventsWorker","meta.root_caller_id":"Namespaces::ProcessSyncEventsWorker","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:namespaces_process_sync_events:99b01e5f7dc2a9b47ce31029252a21c1a7e02ef97b2a00a70d65582147f34ddb","duplicate-of":"cdfa166743baedeb19a50794","job_size_bytes":2,"pid":23095,"job_status":"deduplicated","message":"Namespaces::ProcessSyncEventsWorker JID-2b8d251960c7589bf10533c1: deduplicated: dropped until executed","deduplication.type":"dropped until executed"}
{"severity":"INFO","time":"2022-06-29T20:58:25.176Z","retry":25,"queue":"namespaces_process_sync_events","version":0,"args":[],"class":"Namespaces::ProcessSyncEventsWorker","jid":"cdfa166743baedeb19a50794","created_at":"2022-06-29T20:58:13.621Z","meta.feature_category":"pods","correlation_id":"27538673072803a620ee850db8957ca4","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:namespaces_process_sync_events:99b01e5f7dc2a9b47ce31029252a21c1a7e02ef97b2a00a70d65582147f34ddb","size_limiter":"validated","enqueued_at":"2022-06-29T20:58:13.626Z","job_size_bytes":2,"pid":23095,"message":"Namespaces::ProcessSyncEventsWorker JID-cdfa166743baedeb19a50794: done: 11.549015 sec","job_status":"done","scheduling_latency_s":0.001591,"redis_calls":4,"redis_duration_s":0.0066500000000000005,"redis_read_bytes":253,"redis_write_bytes":1065,"redis_cache_calls":1,"redis_cache_duration_s":0.002337,"redis_cache_read_bytes":203,"redis_cache_write_bytes":52,"redis_queues_calls":1,"redis_queues_duration_s":0.000149,"redis_queues_read_bytes":48,"redis_queues_write_bytes":672,"redis_shared_state_calls":2,"redis_shared_state_duration_s":0.004164,"redis_shared_state_read_bytes":2,"redis_shared_state_write_bytes":341,"db_count":3005,"db_write_count":2001,"db_cached_count":0,"db_replica_count":0,"db_primary_count":3005,"db_main_count":3005,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":6.698,"db_main_duration_s":6.698,"db_main_replica_duration_s":0.0,"cpu_s":5.228127,"mem_objects":1594293,"mem_bytes":61860304,"mem_mallocs":246876,"mem_total_bytes":125632024,"worker_id":"sidekiq_0","rate_limiting_gates":[],"extra.namespaces_process_sync_events_worker.estimated_total_events":515929,"extra.namespaces_process_sync_events_worker.consumable_events":1000,"extra.namespaces_process_sync_events_worker.processed_events":1000,"duration_s":11.549015,"completed_at":"2022-06-29T20:58:25.176Z","load_balancing_strategy":"primary","db_duration_s":6.564283}
Impact
The likely of this occurring is very low in production environment since of high rate of events. However, this might occur for on-premise instances when there's a lot of events enqueued at a single moment. This will result in being unable to empty the queue ever.
Workaround
Revert !83286 (merged)
Workaround (manual)
loop { puts Namespaces::ProcessSyncEventsWorker.new.perform }
{:estimated_total_events=>539371, :consumable_events=>1000, :processed_events=>1000}
{:estimated_total_events=>538371, :consumable_events=>1000, :processed_events=>1000}
{:estimated_total_events=>537371, :consumable_events=>1000, :processed_events=>1000}
{:estimated_total_events=>536371, :consumable_events=>1000, :processed_events=>1000}
# unitl there's no more events
And then
loop { puts Projects::ProcessSyncEventsWorker.new.perform }
...
Fix
Figure out how to make until_executed
to re-schedule itself if it was discovered
that there was another request during execution of a worker.