Lazy array resolver
What does this MR do?
This adds a batch loader capable of loading collections of values, rather than just single items. This comes at the cost of returning arrays rather than relations, so we cannot use this in connections, with the associated pagination that they provide, but by batching requests, we can get significant performance benefits.
Why do we need this?
If we run the following query:
query {
a: project(fullPath: "x/a") {
issues {
nodes { reference }
}
}
b: project(fullPath: "x/b") {
issues {
nodes { reference }
}
}
}
we will need to run a query agains the issues table for each project - here twice, once
for where project_id = X
and once for where project_id = Y
. But we can run these
queries at the same time in a union, which reduces the number of queries we need to run
here, which saves us one DB request in this query, but potentially more as this scales up.
A more real world example is the following, as proposed in !44703 (merged):
query {
currentUser {
authoredMergeRequests(state: open) {
headPipeline {
jobs {
name
status
duration
trace {
sections { name }
}
}
}
}
}
}
Where we fetch the lastest pipeline jobs for all open MRs of the current user. If we implement Pipeline.jobs
using
a caching array resolver, this entire query will only issue 4 (logical) queries - one for the MRs of the current
user, one of the pipeline (assuming effective batch-loading), one for the jobs, and then one for the trace sections.
If implemented naively, this would be linear in the number of merge requests and then linear in the number of trace sections for each MR.
If we returned 30 merge-requests, each of which have 100 jobs and each job has 5 sections (realistic and conservative numbers), this means the numbers would be:
with array-resolver | without |
---|---|
4 | 1(mrs) + 1(pipelines) + 30(jobs) + 100(sections) = 131 |
What are the alternatives?
We currently already have some tools for reducing N+1 performance issues, namely preloading using lookahead. The limitations of using preloading are
- that this only works for collections nested within collections (e.g.
issues -> assignees
) but not when the nodes are disjoint in the query (as withproject A -> issues
andproject B -> issues
in the example above). - preloading only works for associations that rails knows about, and the benefit disappears as soon as we start accepting arguments to filter the list or use finders. There are some workarounds for the problem of filtering, but they are somewhat ugly and we don't have general support for this pattern. We have no way at all of using preloading if we need to use finders.
What are the limitations?
This approach is not perfect - it avoids running multiple queries but:
- it returns arrays, not relations, which means that pagination is less efficient, as we always have to
load the full result set. This makes this approach only suitable for small result sets. This is OK for things like
pipeline.jobs
, ortrace.sections
where the cardinality is low and the client will typically want all results and not want to page. - the queries could be more efficient. If we do a union of three queries, and an item is in two of them, we will load a logically identical item twice from two distinct rows. It may be possible to improve on this.
- each branch of the union is the same query as if we were running each query separately. We get batching, and thus less network overhead, and less DB connection contention, but the raw speedup is likely to be modest after taking that overhead into account.
SQL
If we use this new abstraction to batch issue lookups as follows:
class X < Resolvers::CachingArrayResolver
def model_class
::Issue
end
def query_input(project_id:)
project_id
end
def query_for(project_id)
model_class.where(project_id: project_id)
end
end
Use of this resolver produces queries like:
SELECT "issues".* FROM ((SELECT "issues".*, 0 AS union_member_idx FROM "issues" WHERE "issues"."project_id" = 1 LIMIT 100)
UNION
(SELECT "issues".*, 1 AS union_member_idx FROM "issues" WHERE "issues"."project_id" = 2 LIMIT 100)
UNION
(SELECT "issues".*, 2 AS union_member_idx FROM "issues" WHERE "issues"."project_id" = 3 LIMIT 100)) issues
Rather than three separate queries for:
SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 1
and
SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 2
and
SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 3
An example query plan for this is as follows:
HashAggregate (cost=403.20..406.20 rows=300 width=1284) (actual time=14.265..14.269 rows=0 loops=1)
Group Key: issues.id, issues.title, issues.author_id, issues.project_id, issues.created_at, issues.updated_at, issues.description, issues.milestone_id, issues.iid, issues.updated_by_id, issues.weight, issues.confidential, issues.moved_to_id, issues.due_date, issues.lock_version, issues.title_html, issues.description_html, issues.time_estimate, issues.relative_position, issues.service_desk_reply_to, issues.cached_markdown_version, issues.last_edited_at, issues.last_edited_by_id, issues.discussion_locked, issues.closed_at, issues.closed_by_id, issues.state_id, issues.duplicated_to_id, issues.promoted_to_epic_id, issues.health_status, issues.external_key, issues.sprint_id, issues.issue_type, issues.blocking_issues_count, (0)
Buffers: shared hit=8 read=4
I/O Timings: read=14.103
-> Append (cost=0.56..376.95 rows=300 width=1284) (actual time=14.260..14.262 rows=0 loops=1)
Buffers: shared hit=8 read=4
I/O Timings: read=14.103
-> Limit (cost=0.56..124.15 rows=100 width=1271) (actual time=14.201..14.202 rows=0 loops=1)
Buffers: shared read=4
I/O Timings: read=14.103
-> Index Scan using index_issues_on_project_id_and_iid on public.issues (cost=0.56..299.64 rows=242 width=1271) (actual time=14.199..14.199 rows=0 loops=1)
Index Cond: (issues.project_id = 1)
Buffers: shared read=4
I/O Timings: read=14.103
-> Limit (cost=0.56..124.15 rows=100 width=1271) (actual time=0.045..0.046 rows=0 loops=1)
Buffers: shared hit=4
-> Index Scan using index_issues_on_project_id_and_iid on public.issues issues_1 (cost=0.56..299.64 rows=242 width=1271) (actual time=0.043..0.043 rows=0 loops=1)
Index Cond: (issues_1.project_id = 2)
Buffers: shared hit=4
-> Limit (cost=0.56..124.15 rows=100 width=1271) (actual time=0.010..0.010 rows=0 loops=1)
Buffers: shared hit=4
-> Index Scan using index_issues_on_project_id_and_iid on public.issues issues_2 (cost=0.56..299.64 rows=242 width=1271) (actual time=0.009..0.009 rows=0 loops=1)
Index Cond: (issues_2.project_id = 3)
Buffers: shared hit=4
Summary:
Summary:
Time: 15.476 ms
- planning: 0.936 ms
- execution: 14.540 ms
- I/O read: 14.103 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 8 (~64.00 KiB) from the buffer pool
- reads: 4 (~32.00 KiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
This is an opt-in enhancement, and includes tests for its guarantees. Since use of this abstraction will result in larger queries (that is the whole point), it is possible that we might see query timeouts. A database review is necessary.
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done