Geo: Increase parallelism of repo sync for cloud migration
Currently, we run a single Geo::RepositorySyncWorker
scheduler per Geo secondary. This has a configurable maximum capacity which defaults to 25; in close-to-ideal conditions, this allows us to backfill projects at a rate of ~3,000 / hr or 2.5TB/day.
To increase our rate of sync, we should have additional parallelism by spinning up one scheduler per shard. We can reduce the default max_capacity when we do this.
With Gitaly's fetch_remote
feature turned on, the effect of this will be to have the contents of each repository be streamed directly from the Gitaly server for each shard on the primary, to the Gitaly server for each shard on the secondary. This should allow us to perform more syncs in parallel, without overloading the primary in the process (if max_capacity is 1, for instance, and we have 10 shards, this would ensure that we proceed with a net parallelism of 10, but no shard will ever be performing more than one backfill job at a time).
In https://gitlab.com/gitlab-com/infrastructure/issues/2381, there is some data that suggests that Geo might actually be able to issue a git clone
on every project in GitLab.com. This is similar to the project mirror case where we need to schedule lots of updates for many projects, but it's a bit trickier because we want to clone all of GitLab.com as fast as possible, not just 24,000 projects from external sources.
To do this, we want the following properties:
- We should be able to add worker machines to increase the rate of clones on the client side
- We want to avoid duplicated work (e.g. only issue a
git pull
once if there is an update) - We want to avoid killing GitLab.com
Here some ideas on what we might want to do:
-
Separate the cloning into a separate Sidekiq queue (e.g. geo_repository_clone
) -
Measure the rate at which we are consuming the Sidekiq queue -
Have an adaptive algorithm like TCP that starts slow with a certain target rate, doubles the rate if it is being achieved, and then backs off as the rate decreases.
Thoughts?
/cc: @pcarranza, @jarv, @dbalexandre, @to1ne, @brodock