Draft: POC Enforce Organization Isolation based on `organization_id` on every table
What does this MR do and why?
Related to #394800 (closed) . This is a POC to evaluate whether we can enforce organization isolation using something like foreign keys. The idea is that every table has an organization_id
and every existing foreign key is converted to also having organization_id
as part of a composite foreign key. So this would basically mean both sides of every foreign key must have the same organization_id
.
For example:
ALTER TABLE ONLY issue_links
ADD CONSTRAINT org_fk_c900194ff2 FOREIGN KEY (organization_id, source_id) REFERENCES issues(organization_id, id) ON UPDATE CASCADE ON DELETE CASCADE;
ALTER TABLE ONLY issue_links
ADD CONSTRAINT org_fk_e71bb44f1f FOREIGN KEY (organization_id, target_id) REFERENCES issues(organization_id, id) ON UPDATE CASCADE ON DELETE CASCADE;
This would ensure that issue links only exist between issues belonging to the same organization.
After deeper investigation we realised we don't need full foreign keys because they have 2 additional costs:
- They require a unique index on the referenced columns. It's expensive to build hundreds of new indexes and all of them would already be unique based on the fact that they are composite of another foreign key so it's just wasteful.
- We don't need to validate all existing rows. All existing rows will have
organization_id=1
so it's wasteful having Postgres validating hundreds of foreign keys that are definitely valid
So we realised that we can implement the equivalent functionality we need with a Postgres trigger (foreign keys are implemented behind the scenes as a CONSTRAINT TRIGGER
anyway).
TODO
-
Add organization_id
column to every table -
Get trigger approach working -
Implement a service that can (given a single organization_id
) find all violations of cross-organization data. This could use the foreign keys we used to create the triggers but it could additionally use loose foreign keys.-
Run periodic worker that iterates over all organization_id
(except for 1 because it will be caught by the other ids) and logs violations -
This service data could also be fetched and displayed on some organization page so that they will be able to know what data may not be working correctly -
Could this service also be repurposed for "proposed" organization moves? Could be something like using the group transfer service and then detecting all the violations. Could we do it in a transaction or something?
-
-
Test out implementing the parent side of the foreign key constraint. This would fail to UPDATE
theorganization_id
in the case of their being existing references to the current primary key. We probably don't need "ON DELETE CASCADE" because that would already be covered by the original non-composite FK. We just need to block updating theorganization_id
unless the references are already updated. How do we solve the chicken and egg problem in a single transaction? -
Is it possible that this kind of tooling is only necessary for foreign keys that could theoretically span multiple organizations? If specific foreign keys were known to imply data that belongs to the same namespace/project
then we would also be confident that they must be in the same organization? So then maybe there is just a class of foreign keys likeissue_issue_links.source_id/target_id
andmerge_requests.source_project_id/target_project_id
that we need to be concerned with? Maybe the generalized solution is only tables that contain multiple foreign keys need to be considered. Anything that has only a single foreign key is considered to be hierachical and must "belong to" the parent record and presumably couldn't logically have a parent in another organization.
Statistics
Storage Added | How it was calculated? | |
---|---|---|
All new columns | 313 GiB |
8 * reltuples of all tables as it adds 8 bytes for every tuple in the DB |
All new indexes | 1.8 TiB |
2 * 3 * above value. This is because it will create a new index of 2 8 byte columns and the value of 3 was just an observation that a few of our indexes are 2-3 times the size of the columns they store. Index size can be from bloat, the actual values and the extra tuple data needed to maintain the index |
Total added | 2.1 TiB = 11% |
Challenges
-
How to convert the FK violations into standard user facing errors -
How to ensure the correct organization_id
is set when creating every record. Need some standard way to extract it from a parent but this may be difficult with code likeFooBar.create!(project_id: project.id, some_column: some_value)
. We could consider adding it toSafeRequestStore
and set it on all models in a hook-
If we have a sticky Postgres connection we could set a value in the session. Then a trigger could rewrite the query or set the value on the tuple -
Or we could amend all INSERT
queries in a query analyzer withorganization_id = ...
-
-
How to handle loose foreign key references -
This MR shows we have ~800 foreign keys that can enforce these constraints. There are only 90 loose foreign keys. Maybe normal foreign keys are good enough. But the question remains around whether we might want to migrate more foreign keys to LFK in the future and we can't validate the data integrity with LFKs. -
Async data validation doesn't seem very useful. We can't present the error when the violation happens and have no real way of dealing with the violating data -
LFKs could be implemented as blocking in development to catch bugs. In production it could be a periodic job that allows us to catch data issues retrospectively and potentially catch application bugs
-
-
-
Is the volume of data even reasonable? -
Is it possible to actually create migrations to create ~600 new columns and indexes and ~800 foreign keys -
How long would it take to run this migration in async indexes? -
Maybe we don't need the index at all since the lookup by id would be an index scan. Downside is that it's not an "index only scan" because we need to read the tuple to get the organization_id
=> more IO on create/update/delete
-
-
How long would it take to run this migration if there was downtime? -
Is it possible to trick Postgres into just believing a foreign key is valid -
Underlying foreign key implementation is just foreign keys. So we could instead just use a trigger instead of validating everything. -
Or we could add a NOT VALID
foreign key and just leave it invalid forever
-
-
How long is index building?
-
-
WHERE organization_id
Possible alternative all read queries have - Downside is all instance wide things like periodic sidekiq workers will have to change
- We also might lose all index only scans because none of the indexes would have
organization_id
in them
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
Before | After |
---|---|
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.