Backend: Update Ci::Catalog::Resource search logic to use Postgres Full Text Search
Summary
Currently, we are using Postgres ILIKE
(fuzzy_search
) with separate trigram indexes on name
and description
in the catalog_resources
table. In order to make the search functionality more performant, we should use the Postgres Full Text Search capability.
Moreover, by combining the terms name
and description
in the search vector, it will allow us to create fewer indices with visibility_level
in #429056 (closed). See #429056 (comment 1633670556) for details.
Context (#429056 (comment 1632426495)):
If you want to search both at the same time then it's best to have an index that combines them both. For supporting issue search we use a PG full text search index on
title
anddescription
with the full text search tokens calculated ahead of time and persisted in theissue_search_data
. We used a separate table for that because I think we wanted to not make theissues
table too large. In your case it can go in the same catalog table. You can see also at https://gitlab.com/gitlab-org/gitlab/-/blob/555e439d3b8e244664810d3a5c5d4db39d0fe67b/app/models/issue.rb#L140 how we define the PG search settings and how we weight title matches more highly than description matches.This kind of full text search is suited to searching human text and creates a much smaller index than a
pg_trgm
but it is not as flexible as trigrams which allow precise substring matching. Full text search will match exact tokens but if we use a well configured tokenization (probably just copyIssueSearchData
) then I think it will be good enough. This will also more closely match how Elasticsearch works when we start allowing search with this.
Proposal
Update the name+description search logic in Ci::Catalog::Resource to use the Postgres Full Text Search functionality instead of the trigram indexes with ILIKE
(fuzzy_search). Remove the existing trigram indexes on name
and description
.
Refer to PgFullTextSearchable and follow a similar approach with regards to generating the search vector and tokenizing the query.
Update (2023-11-08):
Per !136263 (comment 1639977341), we will take the approach to create a new search_vector
column on catalog_resources
(ref Postgres docs). The vector value is automatically generated by the database from the GENERATED ALWAYS AS
directive. The vector applies a greater weight to the name
term than description
. COALESCE
is used to avoid a null
value. The SQL looks like:
ALTER TABLE catalog_resources
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS
(setweight(to_tsvector('english', COALESCE(name, '')), 'A') ||
setweight(to_tsvector('english', COALESCE(description, '')), 'B')) STORED;
With this approach, we will be able to add indices directly on search_vector
in #429056 (closed).