Add Foreign Key on projects.namespaces_id (!31675) · Merge requests · GitLab.org / GitLab

Yannis Roussos requested to merge 198603-add-foreign-key-on-projects-namespace-id-and-clean-up-ghost-projects into master May 11, 2020

What does this MR do?

Related to #198603 (closed)

As seen in #198593 (comment 276245548), projects can be orphaned when their namespace entries get removed.

At the moment, we have 66 orphaned projects in production:

=> SELECT COUNT(*)
-> FROM projects
-> WHERE NOT EXISTS
-> (
(>  SELECT 1
(>  FROM namespaces
(>  WHERE projects.namespace_id = namespaces.id
(> );

 count
-------
    66

This MR addresses this as follows:

Adds a Foreign Key constraint from projects.namespace_id to namespaces.id that will restrict namespaces from being deleted without all the projects that reference them being deleted first.

The foreign key is initially added with validate: false because there are inconsistent values for the FK column. But even without validating the FK, adding it stops any new updates (deletes) or inserts from breaking the referential integrity enforced by the Foreign Key.

Notice: The FK is added as an on_delete: :restrict foreign key. The reason for that is twofold:
1. There is already a has_many :projects, dependent: :destroy in the Namespace model that we decided in #198603 (closed) to leave as is (Project has a couple before_destroy and after_destroy callbacks that do necessary cleanup at various levels). Adding an on_delete: :cascade would most probably solve the issue of the orphaned projects, but would do so without proper cleanup (projects have a lot of stuff attached to them, both on the database and other layers).
2. We can not leave the Foreign Key without an on_delete directive as add_concurrent_foreign_key sets it to :cascade otherwise.
  
  on_delete: :restrict is more or less similar to a Postgres Foreign Key without an ON DELETE clause, with some minor differences that do not affect the way we run our migrations:
  
  [..] RESTRICT prevents deletion of a referenced row. NO ACTION means that if any referencing rows still exist when the constraint is checked, an error is raised; this is the default behavior if you do not specify anything. The essential difference between these two choices is that NO ACTION allows the check to be deferred until later in the transaction, whereas RESTRICT does not. [..]
Fixes existing inconsistencies:
- Creates a lost-and-found group owned by the Ghost User
- Finds all the Orphaned projects (i.e. those with a namespace_id not matching any namespace)
- Makes them private, marks them as archived and moves them under the lost-and-found group
After all the records in projects contain valid namespace_id values, the Foreign Key is validated.

The idea is that we take the inconsistent orphaned projects out of the circulation, so we can delete them at a later stage using Project.destroy or keep them around as we try to figure out what caused them not to be deleted when the dependent: :destroy fired on the related namespace.

Discussion on Migration scheduling

All three migrations are scheduled as POST migrations.

(a) The rational for using a post migration for adding the Foreign Key is that it is added to the projects table and references the namespaces table.

Both are very large, high-traffic tables and a SHARE ROW EXCLUSIVE lock (which conflicts with and is locked out by writes) is required on the target (namespaces) table.

We have seen in the past migrations waiting to acquire locks on such tables for more than the accepted maximum 3 minutes that we expect regular migrations to last. For that reason this migration has been set as a post migration.

The other two migrations naturally have to follow that and are also scheduled as POST migrations as they have to always run after the first one.

(b) The data migration is not using a Background Migration because there are only 66 affected records on GitLab.com

We iterate over projects with a BATCH_SIZE=10K. With 13.6M projects on GitLab.com ~1360 iterations will be run:

each requires on average ~160ms for checking if orphaned projects are inside the batch (total time required ~220 seconds)
worst case scenario is that 66 of those batches will trigger an update (~200ms each for a total time of ~14 seconds)

In general, we expect less than 5% (=66/13.6M x 10K) of batches to trigger an update
Expected total run time: ~235 seconds (== 220 seconds + 14 seconds)

We should not expect other GitLab instances to have way more orphaned projects, so the expectations set above should hold for even the largest self-hosted instances as well.

Note: we could just update all the projects at once, but that would require a very expensive query for finding all orphaned projects at once (explain run on GitLab.com production):

EXPLAIN (ANALYZE, BUFFERS)
SELECT projects.id
FROM projects
WHERE NOT EXISTS
(
 SELECT 1
 FROM namespaces
 WHERE projects.namespace_id = namespaces.id
)

QUERY PLAN (https://explain.depesz.com/s/LSyG)

 Gather  (cost=1000.87..468311.89 rows=1 width=4) (actual time=3351.870..8365.334 rows=66 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=20560807 read=33993
   I/O Timings: read=2063.350
   ->  Merge Anti Join  (cost=0.87..467311.79 rows=1 width=4) (actual time=4135.495..8047.470 rows=22 loops=3)
         Merge Cond: (projects.namespace_id = namespaces.id)
         Buffers: shared hit=20560807 read=33993
         I/O Timings: read=2063.350
         ->  Parallel Index Only Scan using index_projects_on_namespace_id_and_id on projects  (cost=0.43..204457.93 rows=5658434 width=8)
(actual time=0.078..3384.464 rows=4525693 loops=3)
               Heap Fetches: 398434
               Buffers: shared hit=11690056 read=22228
               I/O Timings: read=1471.513
         ->  Index Only Scan using namespaces_pkey on namespaces  (cost=0.43..174315.17 rows=7123305 width=4) (actual time=0.052..3528.588
rows=7130030 loops=3)
               Heap Fetches: 2848729
               Buffers: shared hit=8870751 read=11765
               I/O Timings: read=591.837
 Planning Time: 0.728 ms
 Execution Time: 8365.482 ms

(c) The final migration to validate the Foreign Key is scheduled in the same release, after the other two migrations are done.

As the data migration is pretty fast and we have not used a background migration, there seems to be no reason to delay the validation and execute it in a followup release.

Migrations running

$ bundle exec rake db:migrate
== 20200511080113 AddProjectsForeignKeyToNamespaces: migrating ================
-- add_foreign_key(:projects, :namespaces, {:column=>:namespace_id, :on_delete=>:restrict, :validate=>false, :name=>"fk_projects_namespace_id"})
   -> 0.0031s
== 20200511080113 AddProjectsForeignKeyToNamespaces: migrated (0.0088s) =======

== 20200511083541 CleanupProjectsWithMissingNamespace: migrating ==============
== 20200511083541 CleanupProjectsWithMissingNamespace: migrated (0.1507s) =====

== 20200511220023 ValidateProjectsForeignKeyToNamespaces: migrating ===========
-- foreign_keys(:projects)
   -> 0.0055s
-- execute("ALTER TABLE projects VALIDATE CONSTRAINT fk_projects_namespace_id;")
   -> 0.0012s
== 20200511220023 ValidateProjectsForeignKeyToNamespaces: migrated (0.0072s) ==

Down

$ bundle exec rake db:rollback STEP=3
== 20200511220023 ValidateProjectsForeignKeyToNamespaces: reverting ===========
== 20200511220023 ValidateProjectsForeignKeyToNamespaces: reverted (0.0000s) ==

== 20200511083541 CleanupProjectsWithMissingNamespace: reverting ==============
== 20200511083541 CleanupProjectsWithMissingNamespace: reverted (0.0000s) =====

== 20200511080113 AddProjectsForeignKeyToNamespaces: reverting ================
-- foreign_keys(:projects)
   -> 0.0037s
-- remove_foreign_key(:projects, {:column=>:namespace_id, :name=>"fk_projects_namespace_id"})
   -> 0.0045s
== 20200511080113 AddProjectsForeignKeyToNamespaces: reverted (0.0130s) =======

When everything is done, we have the following FK constraint active in the database:

gitlabhq_development=# \d projects
...
Foreign-key constraints:
    "fk_projects_namespace_id" FOREIGN KEY (namespace_id) REFERENCES namespaces(id) ON DELETE RESTRICT
...

As expected, the Foreign Key between when the first migration finishes and when the third migration ends is set as NOT VALID

gitlabhq_development=# \d projects
...
Foreign-key constraints:
    "fk_projects_namespace_id" FOREIGN KEY (namespace_id) REFERENCES namespaces(id) ON DELETE RESTRICT NOT VALID
...

Data Migration Implementation Details

The data migration is pretty simple (move all orphaned projects under a new group owned by the Ghost user) but there are some practical considerations that increase its size and force us to duplicate code from our models:

The Ghost User is not guaranteed to exist in all instances
1. We have to create the Ghost user if it does not already exist (or fetch the existing one).
  
  That includes its namespace and everything else generated when User.ghost runs.
  
  The data migration makes sure that the generated user, namespace and user_datails are the same as when User.ghost runs.
2. We have to take into account the the user name and/or its email is already used (that's why we have to duplicate the unique_internal and create_unique_internal while skipping any calls to Users::UpdateService and setting everything else manually).
While creating the lost-and-found group, we have to also make sure that it is unique. We have to be mindful that there may be other groups with that name (in GitLab.com for example, the earliest option available right now is lost-and-found3)

We assign the Ghost User as the owner of the group, so we have to also cover the Member and GroupMember models.

Finally, we have to take into account the fact that the group may be lost-and-found123 while fetching it (but, by design, only one group with that name template may be owned by the Ghost User, which makes our life easier).
The rest of the data migration is pretty simple with the only additional choice made to set all the orphaned projects to private and archived and add their id as a suffix to their name and path so that they are unique inside the lost-and-found group.

The reason for that is that we expect project names to be unique inside their namespace and we could have two orphaned projects with the same name from two different namespaces.

Proof of Concept

Before running any migration, 10 test projects were created with invalid namespace IDs.

gitlabhq_development=# SELECT id, name, path, namespace_id, archived, visibility_level FROM projects WHERE id >= 62;
 id | name  | path  | namespace_id | archived | visibility_level
----+-------+-------+--------------+----------+------------------
 62 | test1 | test1 |        11111 | f        |               10
 63 | test1 | test1 |        11112 | f        |               20
 64 | test2 | test2 |        22222 | f        |               10
 65 | test3 | test3 |        33333 | f        |               20
 66 | test4 | test4 |        44444 | f        |               10
 67 | test5 | test5 |        55555 | f        |               20
 68 | test6 | test6 |        66666 | f        |               10
 69 | test7 | test7 |        77771 | f        |               10
 70 | test7 | test7 |        77772 | f        |               20
 71 | test8 | test8 |        88888 | f        |                0

There are two pairs of projects with the same name (test1 and test7) to showcase the edge case of having orphaned projects with the same name.

In this test setup, there is no Ghost User already created and there are already two namespaces whose names start with lost-and-found.

After the migrations run, the Ghost User and the new Group were properly created, and the Ghost User is the owner of the group:

gitlabhq_development=# SELECT id, name, email FROM users WHERE user_type = 5;
 id |    name    |       email
----+------------+-------------------
 55 | Ghost User | ghost@example.com

gitlabhq_development=# SELECT id, name, type FROM namespaces WHERE id = 72;
 id |      name       | type
----+-----------------+-------
 72 | lost-and-found3 | Group

gitlabhq_development=# SELECT type, source_type, source_id, user_id, access_level FROM members WHERE source_id = 72;
    type     | source_type | source_id | user_id | access_level
-------------+-------------+-----------+---------+--------------
 GroupMember | Namespace   |        72 |      55 |           50

Finally, all the orphaned projects were properly updated and moved to the new group:

gitlabhq_development=# SELECT id, name, path, namespace_id, archived, visibility_level FROM projects WHERE id >= 62;
 id |   name   |   path   | namespace_id | archived | visibility_level
----+----------+----------+--------------+----------+------------------
 62 | test1_62 | test1_62 |           72 | t        |                0
 63 | test1_63 | test1_63 |           72 | t        |                0
 64 | test2_64 | test2_64 |           72 | t        |                0
 65 | test3_65 | test3_65 |           72 | t        |                0
 66 | test4_66 | test4_66 |           72 | t        |                0
 67 | test5_67 | test5_67 |           72 | t        |                0
 68 | test6_68 | test6_68 |           72 | t        |                0
 69 | test7_69 | test7_69 |           72 | t        |                0
 70 | test7_70 | test7_70 |           72 | t        |                0
 71 | test8_71 | test8_71 |           72 | t        |                0

I have additionally checked that the Ghost User and the respective lost-and-found group have the same values for all their attributes as when we generate them by using the code in our models.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

[-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Edited May 31, 2022 by 🤖 GitLab Bot 🤖

Add Foreign Key on projects.namespaces_id