Draft: POC - improved user mapping
What does this MR do and why?
This is a POC of the improved user mapping on Direct Transfer. I have added comments to the classes to provide an explanation of their functionality.
While the POC works, there are a few aspects that I believe we should reconsider when implementing the final solution. In the following section, I will explain the changes I would make.
Related to: #443532 (closed)
Demo Video
https://drive.google.com/file/d/1T-QDZDed2ZwqIItiBvq7ophNnQZgJs_W/view?usp=sharing
New tables/models
The POC has introduced two new tables. Here is a brief description of each of them.
import_source_users / Import::SourceUser
This table associates an external user with a user on the destination and controls whether resources from the external user are mapped to placeholder users, importer users, or actual users.
Name | Description |
---|---|
id | |
placeholder_user_id | Reference to a placeholder user or an importer user . After a user accepts the reassignment, this change to null
|
assigned_user_id | Reference to a user that accepted to be assigned to all contributions of the source user. Initially, this information is null. |
namespace_id | |
source_username | |
source_name | |
source_user_identifier | |
source_hostname | |
import_type |
import_detail / Import::Detail
This table associates an imported resource with the external user via the Import::SourceUser
model. This association is needed because when the limit of placeholder users is reached, contributions from different external users are mapped to a single user (importer user), therefore to perform the reassignment we need a map between the external user and the imported resource.
Name | Description |
---|---|
id | |
namespace_id | |
project_id | |
importable_type | Class name (Example: MergeRequest, Issue, Epic, Note) |
importable_id | ID of the resource |
assignee_id | |
author_id | |
closed_by_id | |
created_by_id | |
last_edited_by_id | |
latest_closed_by_id | |
merge_user_id | |
merged_by_id | |
owner_id | |
resolved_by_id | |
updated_by_id | |
user_id |
Notes regarding the table:
-
The field referencing a user (assignee_id, author_id, etc) should be linked to an
Import::SouceUser
record, which then references the source user identifier. -
Although polymorphic association is discouraged, I believe it will be required; otherwise, ~30 tables must be created.
-
Initially, I assumed that the table would only contain a single reference to the user, such as
author_id
. I also thought we could eventually add the resource source ID in the same table to support retries. But, upon further investigation, I discovered that some resources have multiple users associated with them. So, to accommodate this, I added various references to the table, which resulted in a table with several fields that, in most cases, won't be used. A better approach to this table is to have a field that stores the association field name, like the table below, and use a different table in case we need to store the resource source ID.Name Description id namespace_id project_id importable_type Class name (Example: MergeRequest, Issue, Epic, Note) importable_id ID of the resource reference_key Field name (author_id, user_id, owner_id) reference_value
Getting the user's name and username
Knowing the contributors' names and usernames is essential for creating placeholder users. Currently, Direct Transfer doesn't export such information in the NDJSON files. Although we can change Direct Transfer to export such information, this would limit the feature to working as expected when importing from old GitLab instances.
A workaround implemented in the POC is to try to get the users' information via the members
API. However, this approach isn't perfect, as the endpoint won't return information from contributors who aren't members. So, to map non-members, we have a few options:
- Fetch the user information via user API every time a resource is read from the NDJSON files and Direct Transfer doesn't have the user details. The problem with this option is the number of API requests Direct Transfer will have to perform. Also, such calls would probably occur during the pipeline's execution, which isn't ideal, as network errors could cause the pipeline to fail.
- Map the contributions to the Import User and have an extra pipeline populate the Import::SourceUser records. This approach allows us to get user information in batches, reducing the number of requests. However, non-member contributions would be assigned to the
Importer User
Export user information
On the POC, I updated the groups' import_export.yml to export the user details (id, name, username). However, I didn't like the resulting code. A better approach is to create a new NDJSON relation that exports information from all contributors and then create a pipeline that consumes this new NDJSON file.
Another reason to create the new relation is the fact we might use it to report to the user how many placeholder users will be created/required for the migration.
Discovered problems/limitations
The use of the Importer User
doesn't work for the associations that have a user unique constraint, for example, assignees, approvals, reviewers, award_emoji. Because of the user unique constraint, only the first contribution will be imported.
For example, to import a merge request assigeee, a row needs to be created in the merge_request_assigeees
table which has the following structure
Column |
---|
id |
merge_request_id |
user_id |
And the unique index merge_request_id
, user_id
(This means we can't save rows with the same merge_request_id
and user_id
)
When importing a merge request with multiple assignees mapped to the same 'Importer User', only the first one will be created since the Importer User ID will be the same.