Skip to content

Load placeholder references from Redis into PG

What does this MR do and why?

#443554 (closed) introduced a single table to contain all details of placeholder user contributions.

We will record 1 row of data per imported record that is associated with a user, for every importer.

Our table design has the benefit of being simple. But we believe writing each row should be optimised to make less writes to PostgreSQL.

This change writes the placeholder user contribution data first to Redis, which is later collected and batch-loaded to the PostgreSQL table #467511 (closed).

Example

For every time we import something using a placeholder user, we will :

Import::PlaceholderReferences::PushService.from_record(
  import_source: 'github', 
  import_uid: 1, 
  source_user: source_user, 
  record: merge_request, 
  user_reference_column: :author_id
).execute

Or, for a record that we only have the IDs of the imported record, or source user, at that point:

Import::PlaceholderReferences::PushService.new(
  import_source: 'github', 
  import_uid: 1, 
  source_user_id: 2, 
  source_user_namespace_id: 3, 
  model: MergeRequest, 
  numeric_key: 4,
  user_reference_column: :author_id
).execute

At some point later (for example, the end of the import stage) we will queue a worker to load those to PostgreSQL:

Gitlab::Import::LoadPlaceholderReferencesWorker.perform_async('github', 1)
# And while we need to check the flag:
Gitlab::Import::LoadPlaceholderReferencesWorker.perform_async('github', 1, { current_user_id: current_user.id } )

Before finalising the import, the importer can check that the placeholder references have been processed. This is in !158536 (merged).

QA

The below inserts 1_000_000 records to Redis and loads them to PostgreSQL using the services in this MR.

Locally, loading 1 million records took around 2 minutes. 1 million records should be indicative of a "large import" if we're clearing the batches after each import stage, and 2 minutes seems an okay development, local, indicative benchmark. In this scenario we would make close to 1 million fewer calls to PostgreSQL than if we inserted record by record.

import_source = 'github'
import_uid = 1
source_user = Import::SourceUser.first
mock_model = Struct.new(:name, keyword_init: true)

1_000_000.times do |i|
  Import::PlaceholderReferences::PushService.new(
    import_source: import_source,
    import_uid: import_uid,
    source_user_id: source_user.id,
    source_user_namespace_id: source_user.namespace_id,
    model: mock_model.new(name: SecureRandom.hex),
    numeric_key: 1,
    user_reference_column: :author_id
  ).execute

  puts "Pushed #{i}" if (i % 5_000).zero?
end

cache_key = Import::PlaceholderReferences::BaseService.new(
  import_source: import_source,
  import_uid: import_uid
).send(:cache_key)

original_set_count = Gitlab::Cache::Import::Caching.with_redis do |redis|
  redis.scard(Gitlab::Cache::Import::Caching.cache_key_for(cache_key))
end

original_record_count = Import::SourceUserPlaceholderReference.count

start_t = Time.now
Import::PlaceholderReferences::LoadService.new(import_source: import_source, import_uid: import_uid).execute
end_t = Time.now

new_set_count = Gitlab::Cache::Import::Caching.with_redis do |redis|
  redis.scard(Gitlab::Cache::Import::Caching.cache_key_for(cache_key))
end

new_record_count = Import::SourceUserPlaceholderReference.count

puts "set count was #{original_set_count}"
puts "set count is now #{new_set_count}"
puts "loaded records to PG in #{end_t - start_t} seconds"
puts "PG row count increased by #{new_record_count - original_record_count}"

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Related to #467511 (closed)

Edited by Luke Duncalfe

Merge request reports

Loading