Service for rebuilding materialized view
What does this MR do and why?
This MR adds a service for rebuilding ClickHouse materialized view with fresh data. Worker and scheduling will be added as a follow-up.
Context
To get more context, please check the MV docs first: https://clickhouse.com/docs/en/guides/developer/cascading-materialized-views
In ClickHouse, materialized views are INSERT
triggers. The Contribution Analytics feature implements fast reporting queries by using a materialized view on the events
table: contributions_mv
(stores actual data in the contributions
table).
The events
table changes over time as records are updated or deleted, these changes are not picked up by the materialized view so the view becomes stale. The RebuildMaterializedViewService
can rebuild a materialized views in batches without data loss.
The service requires the following input data:
- view_name: name of the materialized view (only ReplacingMergeTree views are supported at the moment).
- tmp_view_name: what will be the temporary view name while we backfill the data.
- view_table_name: name of the table that backs the materialized view (each MV has its own table that holds data).
- tmp_view_table_name: name of the temporary table which will receive the backfill data.
- source_table_name: name of the source table where the MV gets the data.
Procedure:
- Create the tmp view and tmp table based on the existing table definitions (clone).
- At this point the tmp table will start receiving inserts.
- The materialized view contains a
SELECT
statement which is used for populating the records. This raw query can be taken from theinformation_schema
. Load raw query from the DB and store it in a variable. - Use the ClickHouse iterator to batch over source table via the id column ("primary key").
- For each batch, build a query that uses the raw query from the MV definition and replaces the
FROM
clause with our batched queries. This way, the query is limited to the given batch. - Invoke an insert query into the tmp table which inserts data from the previously built query.
- When batching is done, swap the view table and tmp view tables. (atomic)
- Clean up the
tmp
table andtmp
view.
Example batched query which will be used for the INSERT INTO
statement:
SELECT id,
argMax(PATH, events.updated_at) AS PATH,
argMax(author_id, events.updated_at) AS author_id,
argMax(target_type, events.updated_at) AS target_type,
argMax(action, events.updated_at) AS action,
argMax(DATE(created_at), events.updated_at) AS created_at,
max(events.updated_at) AS updated_at
FROM
-- this subquery comes from the batch iterator
(SELECT *
FROM "events"
WHERE "events"."id" >= 72686301
AND "events"."id" <= 82686301) "events"
WHERE ((events.action IN (5,
6))
AND (events.target_type = ''))
OR ((events.action IN (1,
3,
7,
12))
AND (events.target_type IN ('MergeRequest',
'Issue',
'WorkItem')))
GROUP BY id
The job and service can be manually triggered on staging. To make it production-ready, we need to do a following (next MR):
- Redis locking.
- Error handling.
- Stop and resume, rebuilding large MVs can take long time.
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
How to test it
- Ensure that CH is set up and migrated for your gdk: https://docs.gitlab.com/ee/development/database/clickhouse/clickhouse_within_gitlab.html
- Enable the FF:
Feature.enable(:event_sync_worker_for_click_house)
- If your GDK is seeded, you should have some
events
, invoke the sync job:ClickHouse::EventsSyncWorker.new.perform
- See if you have some data in the
contributions
materialized view table:ClickHouse::Client.select("select id from contributions final", :main)
- Delete some records by id (simulating the inconsistency change):
ClickHouse::Client.execute("delete from contributions where id > 160", :main)
- Rebuild the MV:
ClickHouse::RebuildMaterializedViewCronWorker.new.perform
- Records should be restored:
ClickHouse::Client.select("select id from contributions final", :main)
Related to #431453 (closed)