Skip to content

Use threads for small project creation with GPT Data Generator

Nailia Iskhakova requested to merge 380-threads-for-project-creation into master

Speed up vertical data generation using connection_pool gem, threads combined with persistent connections. More info in #380 (closed)

From running our default data config with 2500 projects against 3k environment we see 3x speed improvement 🚀 🌟

--- Old method:
<-> Horizontal data: successfully generated after 19 minutes 58 seconds!
█ GPT data generation finished after 19 minutes 58 seconds.

--- New method: ( 6 threads for 3k env)
<-> Horizontal data: successfully generated after 6 minutes 21 seconds!
█ GPT data generation finished after 6 minutes 21 seconds.

Changes:

  • Use connection pool for nested groups/projects creation
  • Changed logic around projects generation. Now we recreate subgroup if projects count is not what expected. Previously we recreated subgroup only if the number of projects was bigger than expected as we could have added more projects using their names, but with addition of threads we can't rely on project names as they're being picked out of order. For example, if before we had subgroup_1 -> project_1, project_2, .. project_5 now it could be subgroup_1 -> project_6, project_7, .. project_10. This is due to the nature of threads and shared variables.
  • 2 new environment variables to additionally tune the connection pool size or timeout. By experimenting, found that size=10 and timeout=60 sec was stable - against local env and 3k.
  • Connection pool size is scaled based on storage node count. Based on the assumption that larger environments have more nodes and more capable to handle higher RPS.
  • Added documentation for troubleshooting any thread errors and guide user on how to tune pool size or timeout.
  • For subgroups creation needed to limit concurrent thread size. Otherwise, if we specified for example 1000 subgroup - it would have created 1000 threads and most of them would fail with timeout. To resolve that we could have increased pool size, but it would bring more strain to the target env and there still would be a moment when it would fail. So currently, limiting this to 50 threads based on experiments.

Closes #380 (closed)

Edited by Nailia Iskhakova

Merge request reports

Loading