Add Foreign Key on projects.namespaces_id
What does this MR do?
Related to #198603 (closed)
As seen in #198593 (comment 276245548), projects can be orphaned when their namespace entries get removed.
At the moment, we have 66 orphaned projects in production:
=> SELECT COUNT(*)
-> FROM projects
-> WHERE NOT EXISTS
-> (
(> SELECT 1
(> FROM namespaces
(> WHERE projects.namespace_id = namespaces.id
(> );
count
-------
66
This MR addresses this as follows:
-
Adds a Foreign Key constraint from
projects.namespace_id
tonamespaces.id
that will restrict namespaces from being deleted without all the projects that reference them being deleted first.The foreign key is initially added with
validate: false
because there are inconsistent values for the FK column. But even without validating the FK, adding it stops any new updates (deletes) or inserts from breaking the referential integrity enforced by the Foreign Key.Notice: The FK is added as an
on_delete: :restrict
foreign key. The reason for that is twofold:-
There is already a
has_many :projects, dependent: :destroy
in theNamespace
model that we decided in #198603 (closed) to leave as is (Project
has a couplebefore_destroy
andafter_destroy
callbacks that do necessary cleanup at various levels). Adding anon_delete: :cascade
would most probably solve the issue of the orphaned projects, but would do so without proper cleanup (projects have a lot of stuff attached to them, both on the database and other layers). -
We can not leave the Foreign Key without an
on_delete
directive asadd_concurrent_foreign_key
sets it to:cascade
otherwise.on_delete: :restrict
is more or less similar to a Postgres Foreign Key without anON DELETE
clause, with some minor differences that do not affect the way we run our migrations:[..] RESTRICT prevents deletion of a referenced row. NO ACTION means that if any referencing rows still exist when the constraint is checked, an error is raised; this is the default behavior if you do not specify anything. The essential difference between these two choices is that NO ACTION allows the check to be deferred until later in the transaction, whereas RESTRICT does not. [..]
-
-
Fixes existing inconsistencies:
- Creates a
lost-and-found
group owned by the Ghost User - Finds all the Orphaned projects (i.e. those with a
namespace_id
not matching any namespace) - Makes them private, marks them as
archived
and moves them under thelost-and-found
group
- Creates a
-
After all the records in
projects
contain validnamespace_id
values, the Foreign Key is validated.
The idea is that we take the inconsistent orphaned projects out of the circulation, so we can delete them at a later stage using Project.destroy
or keep them around as we try to figure out what caused them not to be deleted when the dependent: :destroy
fired on the related namespace.
Discussion on Migration scheduling
All three migrations are scheduled as POST migrations.
(a) The rational for using a post migration for adding the Foreign Key is that it is added to the projects
table and references the namespaces
table.
Both are very large, high-traffic tables and a SHARE ROW EXCLUSIVE
lock (which conflicts with and is locked out by writes) is required on the target (namespaces
) table.
We have seen in the past migrations waiting to acquire locks on such tables for more than the accepted maximum 3 minutes that we expect regular migrations to last. For that reason this migration has been set as a post migration.
The other two migrations naturally have to follow that and are also scheduled as POST migrations as they have to always run after the first one.
(b) The data migration is not using a Background Migration because there are only 66 affected records on GitLab.com
We iterate over projects with a BATCH_SIZE=10K
. With 13.6M projects on GitLab.com ~1360 iterations will be run:
-
each requires on average ~160ms for checking if orphaned projects are inside the batch (total time required ~220 seconds)
-
worst case scenario is that 66 of those batches will trigger an update (~200ms each for a total time of ~14 seconds)
In general, we expect less than 5% (=66/13.6M x 10K) of batches to trigger an update
-
Expected total run time: ~235 seconds (== 220 seconds + 14 seconds)
We should not expect other GitLab instances to have way more orphaned projects, so the expectations set above should hold for even the largest self-hosted instances as well.
Note: we could just update all the projects at once, but that would require a very expensive query for finding all orphaned projects at once (explain run on GitLab.com production):
EXPLAIN (ANALYZE, BUFFERS)
SELECT projects.id
FROM projects
WHERE NOT EXISTS
(
SELECT 1
FROM namespaces
WHERE projects.namespace_id = namespaces.id
)
QUERY PLAN (https://explain.depesz.com/s/LSyG)
Gather (cost=1000.87..468311.89 rows=1 width=4) (actual time=3351.870..8365.334 rows=66 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=20560807 read=33993
I/O Timings: read=2063.350
-> Merge Anti Join (cost=0.87..467311.79 rows=1 width=4) (actual time=4135.495..8047.470 rows=22 loops=3)
Merge Cond: (projects.namespace_id = namespaces.id)
Buffers: shared hit=20560807 read=33993
I/O Timings: read=2063.350
-> Parallel Index Only Scan using index_projects_on_namespace_id_and_id on projects (cost=0.43..204457.93 rows=5658434 width=8)
(actual time=0.078..3384.464 rows=4525693 loops=3)
Heap Fetches: 398434
Buffers: shared hit=11690056 read=22228
I/O Timings: read=1471.513
-> Index Only Scan using namespaces_pkey on namespaces (cost=0.43..174315.17 rows=7123305 width=4) (actual time=0.052..3528.588
rows=7130030 loops=3)
Heap Fetches: 2848729
Buffers: shared hit=8870751 read=11765
I/O Timings: read=591.837
Planning Time: 0.728 ms
Execution Time: 8365.482 ms
(c) The final migration to validate the Foreign Key is scheduled in the same release, after the other two migrations are done.
As the data migration is pretty fast and we have not used a background migration, there seems to be no reason to delay the validation and execute it in a followup release.
Migrations running
Up
$ bundle exec rake db:migrate
== 20200511080113 AddProjectsForeignKeyToNamespaces: migrating ================
-- add_foreign_key(:projects, :namespaces, {:column=>:namespace_id, :on_delete=>:restrict, :validate=>false, :name=>"fk_projects_namespace_id"})
-> 0.0031s
== 20200511080113 AddProjectsForeignKeyToNamespaces: migrated (0.0088s) =======
== 20200511083541 CleanupProjectsWithMissingNamespace: migrating ==============
== 20200511083541 CleanupProjectsWithMissingNamespace: migrated (0.1507s) =====
== 20200511220023 ValidateProjectsForeignKeyToNamespaces: migrating ===========
-- foreign_keys(:projects)
-> 0.0055s
-- execute("ALTER TABLE projects VALIDATE CONSTRAINT fk_projects_namespace_id;")
-> 0.0012s
== 20200511220023 ValidateProjectsForeignKeyToNamespaces: migrated (0.0072s) ==
Down
$ bundle exec rake db:rollback STEP=3
== 20200511220023 ValidateProjectsForeignKeyToNamespaces: reverting ===========
== 20200511220023 ValidateProjectsForeignKeyToNamespaces: reverted (0.0000s) ==
== 20200511083541 CleanupProjectsWithMissingNamespace: reverting ==============
== 20200511083541 CleanupProjectsWithMissingNamespace: reverted (0.0000s) =====
== 20200511080113 AddProjectsForeignKeyToNamespaces: reverting ================
-- foreign_keys(:projects)
-> 0.0037s
-- remove_foreign_key(:projects, {:column=>:namespace_id, :name=>"fk_projects_namespace_id"})
-> 0.0045s
== 20200511080113 AddProjectsForeignKeyToNamespaces: reverted (0.0130s) =======
When everything is done, we have the following FK constraint active in the database:
gitlabhq_development=# \d projects
...
Foreign-key constraints:
"fk_projects_namespace_id" FOREIGN KEY (namespace_id) REFERENCES namespaces(id) ON DELETE RESTRICT
...
As expected, the Foreign Key between when the first migration finishes and when the third migration ends is set as NOT VALID
gitlabhq_development=# \d projects
...
Foreign-key constraints:
"fk_projects_namespace_id" FOREIGN KEY (namespace_id) REFERENCES namespaces(id) ON DELETE RESTRICT NOT VALID
...
Data Migration Implementation Details
The data migration is pretty simple (move all orphaned projects under a new group owned by the Ghost user) but there are some practical considerations that increase its size and force us to duplicate code from our models:
-
The Ghost User is not guaranteed to exist in all instances
-
We have to create the Ghost user if it does not already exist (or fetch the existing one).
That includes its namespace and everything else generated when
User.ghost
runs.The data migration makes sure that the generated
user
,namespace
anduser_datails
are the same as whenUser.ghost
runs. -
We have to take into account the the user name and/or its email is already used (that's why we have to duplicate the
unique_internal
andcreate_unique_internal
while skipping any calls toUsers::UpdateService
and setting everything else manually).
-
-
While creating the
lost-and-found
group, we have to also make sure that it is unique. We have to be mindful that there may be other groups with that name (in GitLab.com for example, the earliest option available right now islost-and-found3
)We assign the Ghost User as the owner of the group, so we have to also cover the
Member
andGroupMember
models.Finally, we have to take into account the fact that the group may be
lost-and-found123
while fetching it (but, by design, only one group with that name template may be owned by the Ghost User, which makes our life easier). -
The rest of the data migration is pretty simple with the only additional choice made to set all the orphaned projects to private and archived and add their id as a suffix to their name and path so that they are unique inside the
lost-and-found
group.The reason for that is that we expect project names to be unique inside their namespace and we could have two orphaned projects with the same name from two different namespaces.
Proof of Concept
Before running any migration, 10 test projects were created with invalid namespace IDs.
gitlabhq_development=# SELECT id, name, path, namespace_id, archived, visibility_level FROM projects WHERE id >= 62;
id | name | path | namespace_id | archived | visibility_level
----+-------+-------+--------------+----------+------------------
62 | test1 | test1 | 11111 | f | 10
63 | test1 | test1 | 11112 | f | 20
64 | test2 | test2 | 22222 | f | 10
65 | test3 | test3 | 33333 | f | 20
66 | test4 | test4 | 44444 | f | 10
67 | test5 | test5 | 55555 | f | 20
68 | test6 | test6 | 66666 | f | 10
69 | test7 | test7 | 77771 | f | 10
70 | test7 | test7 | 77772 | f | 20
71 | test8 | test8 | 88888 | f | 0
There are two pairs of projects with the same name (test1
and test7
) to showcase the edge case of having orphaned projects with the same name.
In this test setup, there is no Ghost User already created and there are already two namespaces whose names start with lost-and-found
.
After the migrations run, the Ghost User and the new Group were properly created, and the Ghost User is the owner of the group:
gitlabhq_development=# SELECT id, name, email FROM users WHERE user_type = 5;
id | name | email
----+------------+-------------------
55 | Ghost User | ghost@example.com
gitlabhq_development=# SELECT id, name, type FROM namespaces WHERE id = 72;
id | name | type
----+-----------------+-------
72 | lost-and-found3 | Group
gitlabhq_development=# SELECT type, source_type, source_id, user_id, access_level FROM members WHERE source_id = 72;
type | source_type | source_id | user_id | access_level
-------------+-------------+-----------+---------+--------------
GroupMember | Namespace | 72 | 55 | 50
Finally, all the orphaned projects were properly updated and moved to the new group:
gitlabhq_development=# SELECT id, name, path, namespace_id, archived, visibility_level FROM projects WHERE id >= 62;
id | name | path | namespace_id | archived | visibility_level
----+----------+----------+--------------+----------+------------------
62 | test1_62 | test1_62 | 72 | t | 0
63 | test1_63 | test1_63 | 72 | t | 0
64 | test2_64 | test2_64 | 72 | t | 0
65 | test3_65 | test3_65 | 72 | t | 0
66 | test4_66 | test4_66 | 72 | t | 0
67 | test5_67 | test5_67 | 72 | t | 0
68 | test6_68 | test6_68 | 72 | t | 0
69 | test7_69 | test7_69 | 72 | t | 0
70 | test7_70 | test7_70 | 72 | t | 0
71 | test8_71 | test8_71 | 72 | t | 0
I have additionally checked that the Ghost User and the respective lost-and-found
group have the same values for all their attributes as when we generate them by using the code in our models.
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
- [-] Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
- [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team