Implement fair queueing for LFK
What does this MR do and why?
This change implements fair queueing for the loose foreign keys feature. If the child records of a deleted record batch cannot be cleaned up in 3 consecutive executions then the deleted record batch will be "re-scheduled" 5 minutes later. The re-scheduled might not be the perfect phrase here. We order the records by consume_after
when loading a batch, consume_after
also equals to the created_at
by default. This means that if we don't get so many inserts (deletions), we might load and process the re-scheduled records earlier than 5 minutes.
The aim here is to improve the throughput by tracking the more difficult deletions (many children records) and applying delayed cleanup.
Example scenario: A project is deleted with many ci_builds
records
- A new row is inserted into the
loose_foreign_keys_deleted_records
table. - The worker (in <1 minute) loads a batch of rows, including our project deletion row.
- The worker starts deleting all children records of the
project
which is defined in the LFK config. - The worker starts deleting the
ci_builds
records of the project, at this point the worker is stopped because it's over the limits.- A worker is over the limit when it runs for 30s or it processed a lot's of rows at once.
- For the current batch, the
cleanup_attempts
column is incremented. - In one minute, another worker starts and does the same thing.
- On the 3rd attempt, the worker reschedules the current batch by updating the
consume_after
column.
Cleanup cycle for a projects
batch problematic records:
- 14:05 - Batch1 wasn't cleaned up, set cleanup_attempts = 1.
- 14:06 - Batch1 wasn't cleaned up, set cleanup_attempts = 2.
- 14:07 - Batch1 wasn't cleaned up, set cleanup_attempts = 3.
- 14:08 - Batch1 wasn't cleaned up, reschedule Batch1 for a later time (5 minutes later).
- 14:09 - Batch2 is being processed.
- 14:10 - Batch3 is being processed.
- 14:11 - Batch4 is being processed.
- 14:15 - Batch5 is being processed.
- 14:16 - A mixture of Batch1 and Batch6 is being processed.
There is a chance where Batch1
would run for a very long time, this would delay the other deletions in the table and effectively clog the queue. By delaying Batch1
a little, we give some time to the other deleted records in the table to get cleaned up.
- Note 1: we're not dealing with individual records, the cleanup queries are running on batches. This means that if a batch has one record which is "difficult" to clean up then the whole batch will be rescheduled.
- Note 2: the feature is behind a feature flag:
lfk_fair_queueing
How to set up and validate locally
Modify the app/models/loose_foreign_keys/modification_tracker.rb
file and change the over_limit?
method:
def over_limit?
true
end
Modify the app/services/loose_foreign_keys/process_deleted_records_service.rb
and remove the following line:
break if modification_tracker.over_limit?
This simulates the "many children rows" scenario.
Enable the feature:
Feature.enable(:lfk_fair_queueing)
Insert a DeletedRecord:
record = LooseForeignKeys::DeletedRecord.create!(fully_qualified_table_name: 'public.projects', primary_key_value: 555)
record.reload
Invoke the worker:
LooseForeignKeys::ProcessDeletedRecordsService.new(connection: ApplicationRecord.connection).execute
Verify the increment:
record.reload.cleanup_attempts
Invoke the worker 3 more times and inspect the record:
record.reload
The cleanup_attempts
must be 0 and the consume_after
should be 5 minutes in the future.
Database
All queries derived from .update_by_partition
have similar performance characteristics:
UPDATE
query example: https://explain.depesz.com/s/RihN#html
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #343547 (closed)