[ClickHouse] Automatically refresh materialized views
Problem statement
Materialized views in ClickHouse provide a great way to restructure/aggregate the data in a different way and improve the query performance.
Materialized views work as a trigger: as new data arrives to the source table, the data will be sent to the materialized view. However, deletes and arbitrary updates will not be synced. This means that after some time the data in the materialized view can diverge from the source table.
Example materialized view:
-
events
table contains all events. -
contribution_analytics_events_mv
materialized view contains part of theevents
table for speeding up the Contribution Analytics feature. - The
contribution_analytics_events_mv
materialized view stores its data in thecontribution_analytics_events
table.
Proposal
A periodical process could re-build these materialized views in the background without downtime. Unfortunately, ClickHouse doesn't provide a standard way of doing this so we need to roll our own solution.
Algorithm:
- Determine the materialized view name and the underlying table name. (we always create MVs with a separate table).
- Get the actual
CREATE TABLE
statements from ClickHouse. - Create a new table and a new materialized view with a
tmp_
prefix. - Once the materialized view is created, new records will start getting populated in our temporary table.
- Copy the existing data from the source table using the same query that we used in the materialized view create statement.
- Replace the old materialized view table with the new one.
- Drop the new materialized view since it's now pointing to the old table.
- Drop the new table (it's the old table).
Caveats:
- Duplicated data, the materialized view should have deduplication (ReplacingMergeTree) because step 5 can copy data that was added between step 3 and 5.
- Temporarily elevated disk usage because we "duplicate" the materialized view.
- Temporarily affects insert speed because of the extra materialized view.
- No transactions: we need a distributed lock to prevent concurrent MV rebuilds.
- Batching: for large tables, a single insert might take too much time, we'll need to implement backfill queries that work on ranges.
- Coordination: assuming that this process can take long time, we need a background worker that executes these steps with runtime limits in mind (do some work, then stop, continue later).
Naive implementation
def read_table_definition(table)
show_table_query = "SHOW CREATE TABLE #{table}"
result = ClickHouse::Client.select(show_table_query, :main)
raise 'Table not found' if result.empty?
result.first['statement']
end
materialized_view_name = 'contribution_analytics_events_mv'
underlying_table_name = 'contribution_analytics_events'
materialized_view_create_query = read_table_definition(materialized_view_name)
underlying_table_create_query = read_table_definition(underlying_table_name)
id = SecureRandom.hex(4)
tmp_materialized_view_name = "tmp_#{materialized_view_name}_#{id}"
tmp_underlying_table_name = "tmp_#{underlying_table_name}_#{id}"
puts tmp_materialized_view_name
puts tmp_underlying_table_name
ClickHouse::Client.execute(underlying_table_create_query.gsub(underlying_table_name, tmp_underlying_table_name), :main)
ClickHouse::Client.execute(materialized_view_create_query.gsub(materialized_view_name, tmp_materialized_view_name), :main)
# TODO: figure out how to batch this
backfill_query = <<~SQL
INSERT INTO #{tmp_underlying_table_name}
SELECT
id,
argMax(path, events.updated_at) AS path,
argMax(author_id, events.updated_at) AS author_id,
argMax(target_type, events.updated_at) AS target_type,
argMax(action, events.updated_at) AS action,
argMax(DATE(created_at), events.updated_at) AS created_at,
max(events.updated_at) AS updated_at
FROM gitlab_clickhouse_main_production.events
WHERE ((events.action = 5) AND (events.target_type = '')) OR ((events.action IN (1, 3, 7, 12)) AND (events.target_type IN ('MergeRequest', 'Issue')))
GROUP BY id
SQL
ClickHouse::Client.execute(backfill_query, :main)
ClickHouse::Client.execute("EXCHANGE TABLES #{underlying_table_name} AND #{tmp_underlying_table_name}", :main)
ClickHouse::Client.execute("DROP TABLE #{tmp_underlying_table_name}", :main)
ClickHouse::Client.execute("DROP TABLE #{tmp_materialized_view_name}", :main)
Edited by Adam Hegyi