[ClickHouse] Automatically refresh materialized views

Problem statement

Materialized views in ClickHouse provide a great way to restructure/aggregate the data in a different way and improve the query performance.

Materialized views work as a trigger: as new data arrives to the source table, the data will be sent to the materialized view. However, deletes and arbitrary updates will not be synced. This means that after some time the data in the materialized view can diverge from the source table.

Example materialized view:

events table contains all events.
contribution_analytics_events_mv materialized view contains part of the events table for speeding up the Contribution Analytics feature.
The contribution_analytics_events_mv materialized view stores its data in the contribution_analytics_events table.

Proposal

A periodical process could re-build these materialized views in the background without downtime. Unfortunately, ClickHouse doesn't provide a standard way of doing this so we need to roll our own solution.

Algorithm:

Determine the materialized view name and the underlying table name. (we always create MVs with a separate table).
Get the actual CREATE TABLE statements from ClickHouse.
Create a new table and a new materialized view with a tmp_ prefix.
Once the materialized view is created, new records will start getting populated in our temporary table.
Copy the existing data from the source table using the same query that we used in the materialized view create statement.
Replace the old materialized view table with the new one.
Drop the new materialized view since it's now pointing to the old table.
Drop the new table (it's the old table).

Caveats:

Duplicated data, the materialized view should have deduplication (ReplacingMergeTree) because step 5 can copy data that was added between step 3 and 5.
Temporarily elevated disk usage because we "duplicate" the materialized view.
Temporarily affects insert speed because of the extra materialized view.
No transactions: we need a distributed lock to prevent concurrent MV rebuilds.
Batching: for large tables, a single insert might take too much time, we'll need to implement backfill queries that work on ranges.
Coordination: assuming that this process can take long time, we need a background worker that executes these steps with runtime limits in mind (do some work, then stop, continue later).

Naive implementation

def read_table_definition(table)
  show_table_query = "SHOW CREATE TABLE #{table}"

  result = ClickHouse::Client.select(show_table_query, :main)

  raise 'Table not found' if result.empty?
  result.first['statement']
end

materialized_view_name = 'contribution_analytics_events_mv'
underlying_table_name = 'contribution_analytics_events'

materialized_view_create_query = read_table_definition(materialized_view_name)
underlying_table_create_query = read_table_definition(underlying_table_name)

id = SecureRandom.hex(4)
tmp_materialized_view_name = "tmp_#{materialized_view_name}_#{id}"
tmp_underlying_table_name = "tmp_#{underlying_table_name}_#{id}"

puts tmp_materialized_view_name
puts tmp_underlying_table_name

ClickHouse::Client.execute(underlying_table_create_query.gsub(underlying_table_name, tmp_underlying_table_name), :main)
ClickHouse::Client.execute(materialized_view_create_query.gsub(materialized_view_name, tmp_materialized_view_name), :main)

# TODO: figure out how to batch this
backfill_query = <<~SQL
INSERT INTO #{tmp_underlying_table_name}
SELECT
    id,
    argMax(path, events.updated_at) AS path,
    argMax(author_id, events.updated_at) AS author_id,
    argMax(target_type, events.updated_at) AS target_type,
    argMax(action, events.updated_at) AS action,
    argMax(DATE(created_at), events.updated_at) AS created_at,
    max(events.updated_at) AS updated_at
FROM gitlab_clickhouse_main_production.events
WHERE ((events.action = 5) AND (events.target_type = '')) OR ((events.action IN (1, 3, 7, 12)) AND (events.target_type IN ('MergeRequest', 'Issue')))
GROUP BY id
SQL

ClickHouse::Client.execute(backfill_query, :main)

ClickHouse::Client.execute("EXCHANGE TABLES #{underlying_table_name} AND #{tmp_underlying_table_name}", :main)
ClickHouse::Client.execute("DROP TABLE #{tmp_underlying_table_name}", :main)
ClickHouse::Client.execute("DROP TABLE #{tmp_materialized_view_name}", :main)

Edited Nov 10, 2023 by Adam Hegyi