Add pipeline to populate import source user missing details (!163216) · Merge requests · GitLab.org / GitLab

Rodrigo Tomonari requested to merge rodrigo/integrate-improved-user-mapping-in-direct-transfer-6 into master Aug 19, 2024

What does this MR do and why?

Introduce a new pipeline in Direct Transfer to populate the source_name and source_username for the Import::SourceUser records created without them. Additionally, the pipeline updates the associated placeholder user's name and username.

The new pipeline retrieves user details in batches of 100 users via the GraphQL API. After fetching the details, it updates the Import::SourceUser record and the placeholder user using the same creation patterns.

For context, for contributions associated with non-members, Import::SourceUser records are created without name and username because the information aren't present in the NDJSON files that are imported. For contributions associated with members, that isn't a problem, because Import::SourceUser records for members are created in the MembersPipeline, which has access to the user name and username.

The pipeline does not utilize the user_contributions relation introduced in #454522 (closed) because the new relation is still behind a feature flag, making it unavailable in source instances. I created #480245 to roll out the relation, which will later allow us to create a new pipeline that uses the new relation.

Related to: #443557 (closed)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before	After

How to set up and validate locally

Enable the feature flags: importer_user_mapping and bulk_import_importer_user_mapping

Feature.enable(:importer_user_mapping)
Feature.enable(:bulk_import_importer_user_mapping)

Import a new group using Direct Transfer which contains contributions from non-members
1. New group,
2. Import Group
3. Provide a host user and access token
4. Select the group to be imported
Wait for the migration to finish
Access the Placeholder Users tab
1. Go to the top-level namespace page
2. Access the Members section
3. Click on the Placeholder tab
Check if all placeholder users have names and usernames resembling the source user.

SQL Plans

Note: The table on production is empty. In Database Lab, the table was populated using the command below, which creates 8000 source users for 4 distinct namespaces, so in total 32000 records

exec

DO $$
DECLARE temprow RECORD;
BEGIN FOR temprow IN
    SELECT id FROM users LIMIT 8000
  LOOP
    INSERT INTO import_source_users (placeholder_user_id,namespace_id,created_at,updated_at,source_username,source_name,source_user_identifier,source_hostname,import_type) VALUES (temprow.id,9970,'2024-08-22 01:05:58.372040','2024-08-22 01:05:58.372040',null,null, cast(temprow.id as varchar), 'github.com', 'github');
    INSERT INTO import_source_users (placeholder_user_id,namespace_id,created_at,updated_at,source_username,source_name,source_user_identifier,source_hostname,import_type) VALUES (temprow.id,2244,'2024-08-22 01:05:58.372040','2024-08-22 01:05:58.372040',null,null, cast(temprow.id as varchar), 'github.com', 'github');
    INSERT INTO import_source_users (placeholder_user_id,namespace_id,created_at,updated_at,source_username,source_name,source_user_identifier,source_hostname,import_type) VALUES (temprow.id,2247,'2024-08-22 01:05:58.372040','2024-08-22 01:05:58.372040',null,null, cast(temprow.id as varchar), 'github.com', 'github');
    INSERT INTO import_source_users (placeholder_user_id,namespace_id,created_at,updated_at,source_username,source_name,source_user_identifier,source_hostname,import_type) VALUES (temprow.id,2252,'2024-08-22 01:05:58.372040','2024-08-22 01:05:58.372040',null,null, cast(temprow.id as varchar), 'github.com', 'github');
  END LOOP;
END; $$

Ruby method

Import::SourceUser.source_users_with_missing_information(namespace: Group.find(9970), source_hostname: 'github.com', import_type: 'github')
  .each_batch(of: 100, order: :desc) do |batch| 
  batch.each do |source_user|
    puts source_user
  end
end

Raw SQL 1

SELECT
    "import_source_users"."id"
FROM
    "import_source_users"
WHERE
    "import_source_users"."namespace_id" = 9970
    AND "import_source_users"."source_hostname" = 'github.com'
    AND "import_source_users"."import_type" = 'github'
    AND ("import_source_users"."source_name" IS NULL
        OR "import_source_users"."source_username" IS NULL)
ORDER BY
    "import_source_users"."id" DESC
LIMIT 1

Query plan - https://console.postgres.ai/gitlab/gitlab-production-main/sessions/31074/commands/96497

 Limit  (cost=2.77..2.77 rows=1 width=8) (actual time=0.146..0.147 rows=0 loops=1)
   Buffers: shared hit=8
   I/O Timings: read=0.000 write=0.000
   ->  Sort  (cost=2.77..2.77 rows=1 width=8) (actual time=0.145..0.146 rows=0 loops=1)
         Sort Key: import_source_users.id DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=8
         I/O Timings: read=0.000 write=0.000
         ->  Index Scan using index_import_source_users_on_namespace_id_and_status on public.import_source_users  (cost=0.29..2.76 rows=1 width=8) (actual time=0.043..0.043 rows=0 loops=1)
               Index Cond: (import_source_users.namespace_id = 9970)
               Filter: (((import_source_users.source_name IS NULL) OR (import_source_users.source_username IS NULL)) AND (import_source_users.source_hostname = 'github.com'::text) AND (import_source_users.import_type = 'github'::text))
               Rows Removed by Filter: 0
               Buffers: shared hit=5
               I/O Timings: read=0.000 write=0.000

Raw SQL 2

SELECT
    "import_source_users"."id"
FROM
    "import_source_users"
WHERE
    "import_source_users"."namespace_id" = 9970
    AND "import_source_users"."source_hostname" = 'github.com'
    AND "import_source_users"."import_type" = 'gitlab_migration'
    AND ("import_source_users"."source_name" IS NULL
        OR "import_source_users"."source_username" IS NULL)
    AND "import_source_users"."id" <= 8000
ORDER BY
    "import_source_users"."id" DESC
LIMIT 1 OFFSET 100

Query plan - https://postgres.ai/console/gitlab/gitlab-production-main/sessions/31074/commands/96498

 Limit  (cost=2.77..2.78 rows=1 width=8) (actual time=0.059..0.059 rows=0 loops=1)
   Buffers: shared hit=8
   I/O Timings: read=0.000 write=0.000
   ->  Sort  (cost=2.77..2.77 rows=1 width=8) (actual time=0.057..0.058 rows=0 loops=1)
         Sort Key: import_source_users.id DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=8
         I/O Timings: read=0.000 write=0.000
         ->  Index Scan using index_import_source_users_on_namespace_id_and_status on public.import_source_users  (cost=0.29..2.76 rows=1 width=8) (actual time=0.037..0.037 rows=0 loops=1)
               Index Cond: (import_source_users.namespace_id = 9970)
               Filter: (((import_source_users.source_name IS NULL) OR (import_source_users.source_username IS NULL)) AND (import_source_users.id <= 8000) AND (import_source_users.source_hostname = 'github.com'::text) AND (import_source_users.import_type = 'gitlab_migration'::text))
               Rows Removed by Filter: 0
               Buffers: shared hit=5
               I/O Timings: read=0.000 write=0.000

Edited Aug 29, 2024 by Rodrigo Tomonari

Add pipeline to populate import source user missing details

What does this MR do and why?

MR acceptance checklist

Screenshots or screen recordings

How to set up and validate locally

SQL Plans

Merge request reports