Load placeholder references from Redis into PG
What does this MR do and why?
#443554 (closed) introduced a single table to contain all details of placeholder user contributions.
We will record 1 row of data per imported record that is associated with a user, for every importer.
Our table design has the benefit of being simple. But we believe writing each row should be optimised to make less writes to PostgreSQL.
This change writes the placeholder user contribution data first to Redis, which is later collected and batch-loaded to the PostgreSQL table #467511 (closed).
Example
For every time we import something using a placeholder user, we will :
Import::PlaceholderReferences::PushService.from_record(
import_source: 'github',
import_uid: 1,
source_user: source_user,
record: merge_request,
user_reference_column: :author_id
).execute
Or, for a record that we only have the IDs of the imported record, or source user, at that point:
Import::PlaceholderReferences::PushService.new(
import_source: 'github',
import_uid: 1,
source_user_id: 2,
source_user_namespace_id: 3,
model: MergeRequest,
numeric_key: 4,
user_reference_column: :author_id
).execute
At some point later (for example, the end of the import stage) we will queue a worker to load those to PostgreSQL:
Gitlab::Import::LoadPlaceholderReferencesWorker.perform_async('github', 1)
# And while we need to check the flag:
Gitlab::Import::LoadPlaceholderReferencesWorker.perform_async('github', 1, { current_user_id: current_user.id } )
Before finalising the import, the importer can check that the placeholder references have been processed. This is in !158536 (merged).
QA
The below inserts 1_000_000
records to Redis and loads them to PostgreSQL using the services in this MR.
Locally, loading 1 million records took around 2 minutes. 1 million records should be indicative of a "large import" if we're clearing the batches after each import stage, and 2 minutes seems an okay development, local, indicative benchmark. In this scenario we would make close to 1 million fewer calls to PostgreSQL than if we inserted record by record.
import_source = 'github'
import_uid = 1
source_user = Import::SourceUser.first
mock_model = Struct.new(:name, keyword_init: true)
1_000_000.times do |i|
Import::PlaceholderReferences::PushService.new(
import_source: import_source,
import_uid: import_uid,
source_user_id: source_user.id,
source_user_namespace_id: source_user.namespace_id,
model: mock_model.new(name: SecureRandom.hex),
numeric_key: 1,
user_reference_column: :author_id
).execute
puts "Pushed #{i}" if (i % 5_000).zero?
end
cache_key = Import::PlaceholderReferences::BaseService.new(
import_source: import_source,
import_uid: import_uid
).send(:cache_key)
original_set_count = Gitlab::Cache::Import::Caching.with_redis do |redis|
redis.scard(Gitlab::Cache::Import::Caching.cache_key_for(cache_key))
end
original_record_count = Import::SourceUserPlaceholderReference.count
start_t = Time.now
Import::PlaceholderReferences::LoadService.new(import_source: import_source, import_uid: import_uid).execute
end_t = Time.now
new_set_count = Gitlab::Cache::Import::Caching.with_redis do |redis|
redis.scard(Gitlab::Cache::Import::Caching.cache_key_for(cache_key))
end
new_record_count = Import::SourceUserPlaceholderReference.count
puts "set count was #{original_set_count}"
puts "set count is now #{new_set_count}"
puts "loaded records to PG in #{end_t - start_t} seconds"
puts "PG row count increased by #{new_record_count - original_record_count}"
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
Related to #467511 (closed)