Skip to content

Optimize getRepositoryMetadata when querying a single repository

Sami Hiltunen requested to merge smh-optimize-dataloss into master

The query in getRepositoryMetadata is currently slower than it needs to be when getting metadata for a single repository. This is due to Postgres ending up reading more rows than it needs to.

When querying with just repository id, Postgres ends up fully reading both tables in the replicas subquery due to the full join. This is not necessary, as Postgres would only have to get rows related to the repository we are fetching metadata for. This happens as Postgres doesn't push down the repository_id = <id> filter down through the join. We can help Postgres here by applying the filter manually to both sides of the join.

When querying with virtual storage and relative path, Postgres ends up computing the full valid_primaries view and does not realize it could read one row from repositories table and query the view with the repository id from that row. We can again help Postgres by explicitly querying the view with the ID retrieved from the repositories table.

This commit optimizes both cases by shadowing the related tables and applying the filters there as needed. This ensures the tables in the full join and the view are never fully read/computed when we only need the records of a single repository.

Query plans:

GetRepositoryMetadata: before 2781ms, after 0.437ms
GetRepositoryMetadataByPath: before 3012ms, after 0.720ms
GetPartiallyAvailableRepositories: before 6972ms, after 7000ms

Merge request reports

Loading