Save details of placeholder contributions to table during import
About
In #443554 (closed) we are introducing a single table to contain all details of placeholder user contributions.
We will record 1 row of data per imported record that is associated with a user, for every importer.
Problem
Our table design has the benefit of being simple, but attempting to write each row to the table seems a bit naïve and is likely to lead to scalability problems at times.
Failures to write this data to PostgreSQL will lead to a kind of data loss for customers.
We should er on the side of "over-engineering".
We can use exclusive lease locks, but these don't guarantee that data is written it just protects the resource. During high contention the lock throw a FailedToObtainLockError
which can lead to data loss if the worker continues to fail after retrying. An example is web_hook_logs
which is written to once per webhook that is triggered. Because of table contention problems #352245 (closed) we added an exclusive lease lock !80976 (merged). We would then sometimes experience periods of many FailedToObtainLockError
errors #352245 (comment 856625412). In the case of web_hook_logs
we realised that many writes that led to FailedToObtainLockError
can be discarded !81770 (merged). But when saving user contributions, we need every row to be written.
Proposal?
sequenceDiagram
participant Importer
participant Redis
participant Worker
participant PostgreSQL
Importer->>+Redis: Sends each contribution {.... }
Importer->>+Worker: Queues after each stage
Worker->>Redis: Pops batches of records
Worker->>PostgreSQL: Loads batches of records to table
Importer->>+Redis: Waits until empty set to finish import
Benefits would be:
- Writing to Redis is fast and cheap compared to writing to PostgreSQL.
- Data in Redis gives us some recovery from problems during loading to PostgreSQL, as we would give ourselves up to 1 day after last being touched to get the data into PostgreSQL.
- We can control contention by limiting number of workers that process Redis queue at a time
- If a batch was
1000
, we would make 1000x less insert requests to PostgreSQL
An alternative to "Queues after each stage" could be an idempotent cron that runs every minute that works the full queue. Or perhaps when "Waits until empty set to finish import" the import could also be queueing idempotent workers, just for good measure to help ensure everything gets processed.