[CA PoC] Eventual consistency for the events table
Release note
Introducing the New ClickHouse-Based Contribution Analytics
We implemented a new analytics database, leveraging the advanced capabilities of ClickHouse, and the Contribution Analytics on GitLab.com will now run through the ClickHouse Cloud cluster.
Overview
As the PoC finalizes, we reached a point where the ClickHouse events
table is fully populated (soon) and new events are being inserted automatically to ClickHouse.
Within this issue, we aim to address the consistency related issues which come from the following sources:
- User is deleted: all
events
records must be deleted. - Namespace is deleted (covers project deletion): all
events
records must be deleted. - Namespace path (group hierarch) changes (moved or deleted): the
path
column needs to be updated.
Idea 1: triggers and queues
We already have existing tooling for this, we might need to find a way to safely hook into these classes:
-
::Ci::ProcessSyncEventsService
syncs namespace hierarchy changes to the CI DB (it was implemented as part of the DB decomposition work). -
Users::MigrateRecordsToGhostUserService
handles user deletion where theevents
records (PG) are destroyed in batches.
Idea 2: periodical consistency check
This approach is similar to the VSA consistency check where we periodically scan the events
table in ClickHouse and detect consistency issues. When an inconsistency is detected, the service will automatically fix the inconsistency.
Since the events
table may contain billions of entries, scanning the entire table is not going to work well. Instead of that, we need to scan distinct values, such as: distinct author_id
and distinct path
. We need to explore the options about how to iterate over distinct elements in CH.
How to detect a hierarchy related inconsistency:
clickhouse_event = { path: '1/2/3/' }
namespace_path = Namespace.find(3).traversal_ids.join('/') + '/'
if clickhouse_event[:path] != namespace_path
# project or namespace was relocated
end
How to detect deleted users:
clickhouse_event = { path: '1/2/3/', author_id: 5 }
user = User.find_by(id: 5)
if user.nil?
# User was deleted
end
General challenges
Modifying data in ClickHouse happens async, we need to do research and see what's the efficient way of updating/deleting data.
See: https://clickhouse.com/blog/handling-updates-and-deletes-in-clickhouse