Optimize top-bound lineage search to use index-only scan
What does this MR do and why?
During backend pair programming (timestamp 15:54),
I demonstrated a method of reading a namespace hierarchy in a linear manner by using >=
and <
operators
on traversal_ids arrays. It was noted that this is different from the current query which uses @>
, and
I kinda glossed over it saying that they do the same thing. While both methods do have the same outcome, they
actually achieve them in different ways and I was curious which one was better. So, I tested both methods in
DB lab and compared them. >=
and <
can work with b-tree indices,
while @>
must use a gin index. When we use a b-tree index, we can perform an index-only scan which saves us a few MB of memory.
@>
) works
How the current method (The current method uses the clause WHERE traversal_ids @> '{namespace_id}'
.
This means "find all traversal_ids that contain namespace_id
".
It's equivalent to the following ruby code:
namespaces.find { |namespace| namespace.traversal_ids.include?(id) }
Given a hierarchy like the following:
id | traversal_ids
-----+---------------
108 | {108}
109 | {108,109}
110 | {108,109,110}
111 | {108,111}
112 | {108,111,112}
113 | {108,113}
114 | {108,113,114}
To find all namespaces under 108
, we look for arrays containing 108
.
To find namespaces under 109
, we look for arrays containing 109
, and so on.
This works because each namespace can only have one parent and so we don't need to worry about something like {999, 109}
appearing in the table. However, it means that we need to search the whole traveral_ids array for each row.
>= AND <
) works
How the new method (The new method uses the clause WHERE traversal_ids >= '{traversal_ids}' AND traversal_ids < {traversal_ids_with_rightmost_number_incremented}
.
Postgres knows how to compare arrays and will do so by comparing each item from left-to-right.
So, when we consider a table like this:
id | traversal_ids
-----+---------------
81 | {81}
83 | {81,83}
85 | {81,83,85}
87 | {81,83,84,87}
88 | {81,83,85,88}
84 | {81,83,84}
108 | {108}
109 | {108,109}
110 | {108,109,110}
111 | {108,111}
112 | {108,111,112}
113 | {108,113}
114 | {108,113,114}
If we SELECT * FROM namespaces WHERE traversal_ids >= '{108}' AND traversal_ids < '{109}';
Postgres will check the first element
of every array in the traversal_ids
columns and read it if it is 108 and skip it if it is not.
If we SELECT * FROM namespaces WHERE traversal_ids >= '{108, 109}' AND traversal_ids < '{108, 110}';
, then Postgres will do the following:
- Check if the first element is 108, skip the row if not
- Check if the second element is 109, skip the row if not
- Read the row
This is faster because when checking row 88, we can skip that row after checking the first element, while @>
will need to check the entire array.
Query comparisons
Description | Query | Postgres.ai link | Execution time | Buffer Reads |
---|---|---|---|---|
gitlab-org before |
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{9970}'; |
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82646 | 702.698 ms (684.318 ms on I/O reads) | ~7.00 MiB |
gitlab-org after |
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{9970}' AND traversal_ids < '{9971}'; |
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82648 | 27.121 ms (23.910 ms on I/O reads) | ~1.50 MiB |
gitlab-org/govern before |
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{11787569}'; |
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82663 | 1.044 s (1.022 s on I/O reads) | ~2.20 MiB |
gitlab-org/govern after |
SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{9970, 11787569}' AND traversal_ids < '{9970, 11787570}'; |
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/26319/commands/82666 | 108.029 ms (102.110 ms on I/O reads) | ~496.00 KiB |
4249178 (customer with deep hierarchy) before | SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{4249178}'; |
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82664 | 15.316 s (14.797 s on I/O reads) | ~61.50 MiB |
4249178 (customer with deep hierarchy) after | SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{4249178}' AND traversal_ids < '{4249179}'; |
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/26319/commands/82667 | 3.883 s (3.750 s on I/O reads) | ~11.90 MiB |
5778500 (customer with many projects) before | SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{5778500}'; |
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82665 | 2.123 s (2.064 s on I/O reads) | ~7.50 MiB |
5778500 (customer with many projects) after | SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{5778500}' AND traversal_ids < '{5778501}'; |
https://postgres.ai/console/gitlab/gitlab-production-main/sessions/26319/commands/82669 | 487.244 ms (476.877 ms on I/O reads) | ~1.70 MiB |
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.