Check if table truncate task must run
What does this MR do and why?
This MR adds a check that runs if main
and ci
run on different databases. It verifies that data has been truncated. If not, it will advise to run the rake task for truncation.
After switching to decomposed database, it could happen that a migration fails because data has not been cleaned up. A possible scenario is this:
-
projects
table on CI database still has data - A DML migration is run for
projects
table, fixing some duplicate values. Data migrations are only run onmain
database - A DDL migration is run: an unique index is added that is relying on the previous DML migration. DDL migrations are run on both databases. Since the DML migration did not run on CI, adding the index will fail on CI.
So we want to encourage self-managed customers to run the rake task gitlab:db:truncate_legacy_tables:#{database_name}
before they upgrade their installation and run migrations.
GitLab provides a pre-upgrade rake task. This task is supposed to run before a self-managed customer upgrades GitLab. In our documentation, we recommend to run sudo gitlab-rake gitlab:check
(see https://docs.gitlab.com/ee/update/plan_your_upgrade.html)
Database performance
We check for each to-be-truncated table if it contains data. As soon as we find one table with data, we return 'true'.
So we have two issues to solve:
- Most efficient way of determining a table is empty
- Most efficient way of checking a lot of tables
Most efficient way of determining a table is empty
Worst case scenario is that a large table in main
(projects) has been copied to ci
. So we want to verify that projects
on ci
is empty.
The most naive implementation is SELECT COUNT(*) FROM projects
. This will take a long time because it will do a parallel index scan. Example
Another approach is just selecting one random row: SELECT * FROM projects LIMIT 1
. Example Much better, 5 ms
Third approach: we do not care about the actual data so SELECT EXISTS( SELECT * FROM projects)
. Example. Even better, 1ms
Most efficient way of checking a lot of tables
I considered just checking a few known tables, like projects
or ci_builds
. But that does not rule out that a migration causes issues in another table.
PostgreSQL keeps track of a row estimate, I considered using that but it can still be off. Not a reliable source if we want to be sure.
So I opted for a big UNION
query:
(SELECT EXISTS( SELECT * FROM abuse_events ))
UNION
... other tables ...
(SELECT EXISTS( SELECT * FROM zoom_meetings))
If these 600 main tables in ci
database do have data, this query finishes in about 2 seconds. I tested this against postgres.ai but unfortunately, the query does not end up in the history.
Screenshots or screen recordings
How to set up and validate locally
shards
table belongs to main
schema so it should be empty on ci
. So we add data to shards
table:
- Unlock tables
bundle exec rake gitlab:db:unlock_writes
- Insert data in ci database:
gdk psql -d gitlabhq_development_ci -c "insert into shards (name) values('test');"
- Test the check script
bundle exec rake gitlab:check
. It should outputTables are truncated? ... no
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #401679 (closed)