[CA PoC] Implement the ClickHouse based data collector
Implementation plan
Implement a new DataCollector
(Gitlab::ContributionAnalytics::ClickHouseDataCollector
) which will use the ClickHouse client to invoke SQL queries.
The CH SQL query and DB schema should be already researched in #414934 (closed)
Since there is no ActiveRecord-like interface to build SQL queries, we need to build the query as raw string. When interpolating data (ids, dates), ensure that we're escaping the values. We can use the escaping utility methods from Arel and ActiveRecord (https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/Quoting.html).
The new data collector can be enabled for any group with a feature flag. The merge request should include the feature flag check and the feature flag definition.
How
The !125296 (merged) merge request adds the events
table however, this is not optimized for contribution analytics queries.
The events
table will receive all inserts and updates, the task is to create a downstream table (materialized view) that is optimized for contribution analytics.
- In the
db/click_house/main
folder, add a new sql file with a current timestamp that contains the table creation.
CREATE TABLE contribution_analytics_events
(
id UInt64 DEFAULT 0,
path String DEFAULT '',
author_id UInt64 DEFAULT 0,
target_type LowCardinality(String) DEFAULT '',
action UInt8 DEFAULT 0,
created_at Date DEFAULT toYear(now()),
updated_at DateTime64(6, 'UTC') DEFAULT now()
)
ENGINE = MergeTree
ORDER BY (path, created_at, author_id, id);
- Add another sql that "fills up" the table using materialized view.
CREATE MATERIALIZED VIEW contribution_analytics_events_mv
TO contribution_analytics_events
AS
SELECT
id,
argMax(path, events.updated_at) as path,
argMax(author_id, events.updated_at) as author_id,
argMax(target_type, events.updated_at) as target_type,
argMax(action, events.updated_at) as action,
argMax(date(created_at), events.updated_at) as created_at,
max(events.updated_at) as updated_at
FROM events
where (("events"."action" = 5 AND "events"."target_type" = '')
OR ("events"."action" IN (1, 3, 7, 12)
AND "events"."target_type" IN ('MergeRequest', 'Issue')))
GROUP BY id
At this point all inserts will also show up in the new contribution_analytics_events
table.
You can query the contribution analytics data with the following query:
SELECT count(*) as count_all,
"contribution_analytics_events"."author_id" AS events_author_id,
"contribution_analytics_events"."target_type" AS events_target_type,
"contribution_analytics_events"."action" AS events_action
FROM (
SELECT
id,
argMax(author_id, contribution_analytics_events.updated_at) as author_id,
argMax(target_type, contribution_analytics_events.updated_at) as target_type,
argMax(action, contribution_analytics_events.updated_at) as action
FROM contribution_analytics_events
WHERE startsWith(path, '24/')
AND "contribution_analytics_events"."created_at" >= '2022-05-01'
AND "contribution_analytics_events"."created_at" <= '2023-05-07'
GROUP BY id
) contribution_analytics_events
GROUP BY "contribution_analytics_events"."action","contribution_analytics_events"."target_type","contribution_analytics_events"."author_id"
Note 1: Constructing of the path
value:
group.traversal_ids.join('/') + '/'
Note 2: usage of date type:
The contribution_analytics_events
table casts the created at to date to save space and also it helps the query performance. We only filter the data by days, no need to use timestamp filters.
Note 3: complex group by query:
ClickHouse does de-duplication in the background which means that for some time duplicate values might show up, for example when you imported the same data twice. To address this problem, we need to deduplicate the data by the primary key and then run the aggregation (count).
Note 4: snippet for filling the table:
def format_row(event)
namespace = event.project.try(:project_namespace) || event.group
path = namespace.traversal_ids.join('/')
action = Event.actions[event.action]
[
event.id,
"'#{path}/'",
event.author_id,
event.target_id,
"'#{event.target_type}'",
action,
event.created_at.to_f,
event.updated_at.to_f
].join(',')
end
values = []
Event.all.each do |event|
values << "(#{format_row(event)})"
end
insert_query = <<~SQL
INSERT INTO events
(id, path, author_id, target_id, target_type, action, created_at, updated_at)
VALUES
#{values.join(',')}
SQL
ClickHouse::Client.execute(insert_query, :main)