Come up with an alternative to Postgres cascading deletes/nullify => Loose foreign keys

Problem statement

When eliminating foreign keys between the Main DB and CI DB tables, we'll need to maintain referential integrity on the application layer. Ensuring strong consistency is not optimal due to performance reasons. (2 phase commit)

Example: cleanup associated records with ActiveRecord

Note: we do this in the PoC MR.

class Project < ApplicationRecord
  has_many :builds, class_name: 'Ci::Build', inverse_of: :project, dependent: :delete_all
end

# ...

project.destroy

The following example will not work as expected with two databases since the actual build record deletion happens outside of the scope of the transaction on the Main DB (where the projects table is located). The cross_database_modification_check utility helps to locate these cases.

Transaction begins on the Main database.
Delete the project record.
Delete the associated ci_builds rows.
Transaction committed on the Main database.

The deletion in the ci_builds table is actually not part of the transaction. If a rollback is initiated after the ci_builds records are deleted, those rows will not be restored.

Wrapping the operation with an extra transaction on the CI DB would address the problem however, it may cause long-running transactions which could affect the availability of GitLab.

Current plan

Asynchronous data cleanup using Postgres triggers and a queuing table in Postgres that is picked up by a sidekiq cron worker.

You can see more details in the thread #338535 (comment 655802260) and other comments below that.

The current Draft MR is at !69165 (merged) .

Overall we are referring to this concept as "Loose foreign keys".

Previous Ideas: eventual consistency

Ideas

Option 1: Extend ActiveRecord

By extending ActiveRecord's dependent option, we could remove the associated records in an after commit callback:

class Project < ApplicationRecord
  has_many :builds, class_name: 'Ci::Build', inverse_of: :project, dependent: :delete_all_after_commit
end

In this case, the deletion happens after the main transaction is committed. At the point where the project row is already deleted.

Option 2: Extra module

The change above requires monkey-patching ActiveRecord (contribute upstream later). We can achieve the same thing via a simple Module:

class Project < ApplicationRecord
  include EventuallyConsistentAssociation

  has_many :builds, class_name: 'Ci::Build', inverse_of: :project

  # Requires an association named `builds` to be present
  cleanup_after_commit :builds, with: :delete # or :nullify 
end

Both approaches would mitigate the "association removal inside the transaction" problem however, there is a chance where the cleanup would not finish, for example, the application crashes while deleting the rows. Keeping unused and inaccessible rows in the database costs us disk space and it can cause inconsistency issues later.

To address this problem, a periodical cleanup job is needed to detect orphaned records and delete them efficiently (batching). By using the dependent: :delete_all_after_commit or cleanup_after_commit methods, we can easily detect all ActiveRecord models that needs cleanup and we can build a generic cleanup runner.

# pseudo code
Project.each_batch do |projects|
  min, max, count = projects.pluck('MIN(id), MAX(id), COUNT(id)').first

  # Note: distinct will be slow, we'll need to use loose-index scan
  build_project_id_count = Ci::Build.distinct(:project_id).where(id: min..max).count

  if count != build_project_id_count
    # cleanup is needed
  end
end

Running such a job can be time-consuming so we need to ensure that we scan the minimum amount of rows. In case we want to be eventually consistent between projects and ci_builds we need about 2N index lookups, where the N is the number of projects in the GitLab instance.

Making the cleanup runner more efficient

Let's track the deleted projects and other referenced models from the main DB in a separated table: deleted_$table_name. We create a table for each Model where we have FKs crossing the DB boundary.

Example: deleted_projects

When a project is deleted, during the transaction which deletes the Project record we insert a new row to deleted_projects:

project_id
2
6
7

After the transaction commits, we trigger a background job. Or a scheduled job will pick up the deleted project id or ids.

For each id, the job will perform DELETE statements in the CI DB. When all queries are finished, we can safely clean up the record from deleted_projects. This approach eliminates the periodical table scanning. We might want to run it through from time to time to make sure we cleaned up everything.

Edited Sep 01, 2021 by Dylan Griffith