Draft: POC - improved user mapping (!148351) · Merge requests · GitLab.org / GitLab

Rodrigo Tomonari requested to merge rodrigo/poc-improved-user-mapping into master Apr 01, 2024

What does this MR do and why?

This is a POC of the improved user mapping on Direct Transfer. I have added comments to the classes to provide an explanation of their functionality.

While the POC works, there are a few aspects that I believe we should reconsider when implementing the final solution. In the following section, I will explain the changes I would make.

Related to: #443532 (closed)

Demo Video

https://drive.google.com/file/d/1T-QDZDed2ZwqIItiBvq7ophNnQZgJs_W/view?usp=sharing

New tables/models

The POC has introduced two new tables. Here is a brief description of each of them.

import_source_users / Import::SourceUser

This table associates an external user with a user on the destination and controls whether resources from the external user are mapped to placeholder users, importer users, or actual users.

Name	Description
id
placeholder_user_id	Reference to a `placeholder` user or an `importer user`. After a user accepts the reassignment, this change to `null`
assigned_user_id	Reference to a user that accepted to be assigned to all contributions of the source user. Initially, this information is null.
namespace_id
source_username
source_name
source_user_identifier
source_hostname
import_type

import_detail / Import::Detail

This table associates an imported resource with the external user via the Import::SourceUser model. This association is needed because when the limit of placeholder users is reached, contributions from different external users are mapped to a single user (importer user), therefore to perform the reassignment we need a map between the external user and the imported resource.

Name	Description
id
namespace_id
project_id
importable_type	Class name (Example: MergeRequest, Issue, Epic, Note)
importable_id	ID of the resource
assignee_id
author_id
closed_by_id
created_by_id
last_edited_by_id
latest_closed_by_id
merge_user_id
merged_by_id
owner_id
resolved_by_id
updated_by_id
user_id

Notes regarding the table:

The field referencing a user (assignee_id, author_id, etc) should be linked to an Import::SouceUser record, which then references the source user identifier.
Although polymorphic association is discouraged, I believe it will be required; otherwise, ~30 tables must be created.

Initially, I assumed that the table would only contain a single reference to the user, such as author_id. I also thought we could eventually add the resource source ID in the same table to support retries. But, upon further investigation, I discovered that some resources have multiple users associated with them. So, to accommodate this, I added various references to the table, which resulted in a table with several fields that, in most cases, won't be used. A better approach to this table is to have a field that stores the association field name, like the table below, and use a different table in case we need to store the resource source ID.

Name	Description
id
namespace_id
project_id
importable_type	Class name (Example: MergeRequest, Issue, Epic, Note)
importable_id	ID of the resource
reference_key	Field name (author_id, user_id, owner_id)
reference_value

Getting the user's name and username

Knowing the contributors' names and usernames is essential for creating placeholder users. Currently, Direct Transfer doesn't export such information in the NDJSON files. Although we can change Direct Transfer to export such information, this would limit the feature to working as expected when importing from old GitLab instances.

A workaround implemented in the POC is to try to get the users' information via the members API. However, this approach isn't perfect, as the endpoint won't return information from contributors who aren't members. So, to map non-members, we have a few options:

Fetch the user information via user API every time a resource is read from the NDJSON files and Direct Transfer doesn't have the user details. The problem with this option is the number of API requests Direct Transfer will have to perform. Also, such calls would probably occur during the pipeline's execution, which isn't ideal, as network errors could cause the pipeline to fail.
Map the contributions to the Import User and have an extra pipeline populate the Import::SourceUser records. This approach allows us to get user information in batches, reducing the number of requests. However, non-member contributions would be assigned to the Importer User

Export user information

On the POC, I updated the groups' import_export.yml to export the user details (id, name, username). However, I didn't like the resulting code. A better approach is to create a new NDJSON relation that exports information from all contributors and then create a pipeline that consumes this new NDJSON file.

Another reason to create the new relation is the fact we might use it to report to the user how many placeholder users will be created/required for the migration.

Discovered problems/limitations

The use of the Importer User doesn't work for the associations that have a user unique constraint, for example, assignees, approvals, reviewers, award_emoji. Because of the user unique constraint, only the first contribution will be imported.

For example, to import a merge request assigeee, a row needs to be created in the merge_request_assigeees table which has the following structure

Column
id
merge_request_id
user_id

And the unique index merge_request_id, user_id (This means we can't save rows with the same merge_request_id and user_id)

When importing a merge request with multiple assignees mapped to the same 'Importer User', only the first one will be created since the Importer User ID will be the same.

Edited Apr 04, 2024 by Rodrigo Tomonari

Draft: POC - improved user mapping