[CA PoC] Implement periodical data sync worker
Why are we doing this work
When we use the ClickHouse backend for Contribution Analytics, we want to see newly created events with minimal delay. To achieve this, we should periodically send data to ClickHouse.
Implementation plan
Implement a cursor-based worker class that will query the events
table in batches and inserts data to the events
table in the ClickHouse database.
- The worker should prefer PG replicas.
- Use the
RuntimeLimiter
class to prevent Sidekiq error budget violation. - Have an upper limit of the maximum number of rows we query per job: 1_000_000 (batch size 10_000)
- The worker is scheduled in every 5 minutes.
- The worker should resume where the previous job stopped.
- Have a feature flag in place.
The resume/restore logic should be implemented with a cursor. The cursor contains the last inserted events.id
from the PostgreSQL database.
The cursor can be stored in ClickHouse (one table with 1 value) or in Redis. Storing the cursor in ClickHouse might be preferable so we reduce the external dependencies in the job.
To avoid syncing the whole table, we could manually set the cursor to an arbitrary value (for example: last manually synced event from an export or an events.id
value from 1 month ago).
Example logic implementation (pseudo code):
cursor_value = 5
Event.where("id > ?", cursor_value).each_batch do |relation|
sql = format_clickhouse_insert_query(relation)
execute(sql)
cursor_value = relation.last.id
break if over_time?
end
# persist cursor_value in CH here