Praefect should identify repositories by a persistent unique ID

Praefect currently identifies repositories by their (virtual_storage, relative_path) tuple. This can be problematic:

If a repository's relative path is changed, we need to update all references to the repository in the database. GitLab occasionally changes the relative path. This seems to be done at least for repository deletions and geo replications. This is not ideal as in addition to updating the database references, we need to update the paths on the disk. This is a fairly dangerous operation right now as the path change could succeed on disk but fail to be updated in the database. Praefect would have then lost the repositories. This is a problem already for some existing issues: #3256 (closed).

To fix that, one could use the project IDs from GitLab to identify the repositories and have Praefect manage the relative paths. That way Praefect could simply ignore any relative path changes, as it should not be the client's business to decide where the repositories get stored. The client cares about accessing the correct repository.

Relying on the project IDs from GitLab would be problematic as well though:

If a repository is removed from the virtual storage and then added back again, the existing state of the removed repository might interfere with the newly created one. This can happen for example when migrating the repository off the virtual storage to another repository storage and then back. Since the repositories are identified by (virtual_storage, relative_path), the migrated repository has the same ID as the previously removed one. This case was causing issues in: https://gitlab.com/gitlab-com/account-management/emea/manomano/-/issues/61#note_520567196. The state related to a removed repository should not interfere with the new repository as from the client's perspective, the old repository was already removed.

To fix both issues, we should start identifying repositories by unique IDs. The IDs need to be generated by Praefect as opposed to using GitLab project IDs to prevent old state of removed repositories affecting new ones. GitLab project ID would still be the same when a repository is migrated back and forth from a virtual storage.

The necessary changes should be doable only in Praefect without having to change any callers. GitLab should keep its (storage, relative_path) tuples unique and could still call Praefect with them. When Praefect receives a call, it would check in a look up table which repository id this (storage, relative_path) combination maps to. Rest of the operation remains mostly the same, except they would be using the repository id instead of the (storage, relative_path) combo. The relative path a Gitaly node would use should be managed by Praefect.

For implementation:

We should extend repositories table with an auto-incrementing repository_id column. This gives each repository currently stored on the virtual storages a unique, permanent id. We'd also extend storage_repositories and repository_assignments with the same column. We'd then drop virtual_storage and relative_path columns repository_assignments and storage_repositories and join with repository_id to query the tables.

When calling Gitalys, we'd convert the repository_id in to a relative path, in a similar fashion as hashed storage is doing.

With these in place, we've fixed both problems described in the issue:

Praefect controls the relative paths (hash of Praefect generated id) on the Gitalys and we don't have to rename directories on disks. When a rename RPC comes in, we simply update the database record in repositories table, which serves as the (virtual_storage, relative_path) -> repository_id look up table.

As each repository gets a new unique id on creation, migrating the repository off the storage and back would result a new id being generated. That would avoid having the stale state from the previous removal affect the repository when it is being migrated back

The relative path should likely be in format @praefect/xx/xx/xxxxxxxxxxxxxxxxxxxx to avoid colliding with the existing names in @hashed, which could happen if the Praefect generated ID matches a GitLab project ID.

The difficult part would be the migration of the existing repositories on disk. If this is something we'd like to pursue, we plan out in more detail how we'd migrate the repositories in to the new path. Updating the queries themselves should not be too much work.

We could split this issue in two steps:

Add the repository_id in the schema and update the queries to use it. Reconciler could use this information to identify unreplicated renames, and reattempt them.
Do the bigger work of managing the relative paths in Praefect.

/cc @mjwood @zj-gitlab

Implementation Steps

Release 14.3

Add repository_id to the database schema in repositories, storage_repositories and repository_assignments tables.
Generate a repository ID for every repository stored in repositories table.
Update queries that create new records in these tables to link them correctly via a repository ID.
- CreateRepository* calls proxied by Praefect
- Replication jobs that create new replicas. As existing replication jobs do not carry the repository ID's yet, the query will use the (virtual_storage, relative_path) tuple to link the ID to the new replica.
- praefect set-replication-factor which may create new assignments.
Include the repository ID in the replication jobs:
- Queued on completion of mutator RPCs
- Scheduled by the reconciler

MRs implementing these steps

After this release, we can assume new records are carrying the repository ID. This allows us to link existing records in the following release and update queries to join the records via repository ID rather than the (virtual_storage, relative_path).

Release 14.4

Add a migration that links the existing records in repositories, storage_repositories and repository_assignments via the repository ID. New records are already being linked, so we shouldn't miss any records. (!3901 (merged))
Update every query to join the records via repository_id rather than the (virtual_storage, relative_path) tuple.
- RepositoryStore queries.
- AssignmentStore queries.
- Reconciler
- All tests should also do joins via repository_id. The tests should also ensure they correctly set up repository id in every case so we can make the schema stricter in 14.5.
Update replication job logic to link the new replicas created via the repository_id carried in the jobs rather than looking up by the (virtual_storage, relative_path).
Update Praefect's router to check use the relative path stored for the replicas in the database rather than expecting the path in the request to be the path to use. This ensures backwards compatibility between releases 14.4 and 14.5 when the latter starts generating paths.

This release decouples the client provided relative path from the relative paths of the replicas. As everything is now connected via the repository_id, we can have different relative path internally for the replicas than what the client uses to access the repository.

Release 14.5

Generate a relative path from the repository_id when routing CreateRepository* requests according to the scheme described here. Store the generated relative path with the replicas records in the database as replica_path.
Replication jobs always replicate to the same relative path on all of the Gitaly nodes. Existing repositories keep using the existing paths for all replicas, new repositories will use the Praefect generated path for the replicas.
All queries should now be using repository_id to join the records, so this releases should make (repository_id, storage) the primary key of repository_assignments and storage_repositories.
Drop the orphan repository deletion logic from reconciler as the the new schema doesn't allow for such database records, and the crawler introduced in #3719 (closed) will clean them up.

Release 14.6

Clean up unnecessary database state by dropping virtual_storage and relative_path columns from repository_assignments and storage_repositories. Use the query SELECT * FROM storage_repositories WHERE relative_path LIKE '%<HASH>'; to find the states and delete these entries from the DB

From this release onwards, newly created repositories map to different state than any previously deleted ones. Backup restoration should begin working without #3546 (closed). Deletes can now be made atomic database updates via #3720 (closed). Renames can be implemented by the repository's record in the repositories table avoiding the need to move the repositories in the disk.

Edited Oct 14, 2022 by Gerardo Gutierrez

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information