Potentially gigantic cross-database ID pluck in Geo
Summary
https://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/workers/geo/repository_sync_worker.rb#L27
current_node.projects
.where.not(id: Geo::ProjectRegistry.synced.pluck(:project_id))
.order(last_repository_updated_at: :desc)
.limit(db_retrieve_batch_size)
.pluck(:id)
current_node
is a GeoNode
, which maps to a table in the main database, as does the projects
relation, which boils down to Project.all
.
Geo::ProjectRegistry
is in the separate, read-write, per-geo-secondary database. So we have to pluck the IDs here. But what happens when a couple of million projects have been synced? What sort of query do we generate?
Steps to reproduce
Sync a few million projects, try to sync a few more.
What is the current bug behavior?
Postgres will probably fail with a "query too large" error of some sort
What is the expected correct behavior?
We need to be able to scale to several million rows in the project_registry
table. I'm pretty sure this doesn't.
Possible fixes
Could we use where(last_repository_updated_at: nil)
instead?