Backend: Update Ci::Catalog::Resource search logic to use Postgres Full Text Search

Summary

Currently, we are using Postgres ILIKE (fuzzy_search) with separate trigram indexes on name and description in the catalog_resources table. In order to make the search functionality more performant, we should use the Postgres Full Text Search capability.

Moreover, by combining the terms name and description in the search vector, it will allow us to create fewer indices with visibility_level in #429056 (closed). See #429056 (comment 1633670556) for details.

Context (#429056 (comment 1632426495)):

If you want to search both at the same time then it's best to have an index that combines them both. For supporting issue search we use a PG full text search index on title and description with the full text search tokens calculated ahead of time and persisted in the issue_search_data. We used a separate table for that because I think we wanted to not make the issues table too large. In your case it can go in the same catalog table. You can see also at https://gitlab.com/gitlab-org/gitlab/-/blob/555e439d3b8e244664810d3a5c5d4db39d0fe67b/app/models/issue.rb#L140 how we define the PG search settings and how we weight title matches more highly than description matches.

This kind of full text search is suited to searching human text and creates a much smaller index than a pg_trgm but it is not as flexible as trigrams which allow precise substring matching. Full text search will match exact tokens but if we use a well configured tokenization (probably just copy IssueSearchData) then I think it will be good enough. This will also more closely match how Elasticsearch works when we start allowing search with this.

Proposal

Update the name+description search logic in Ci::Catalog::Resource to use the Postgres Full Text Search functionality instead of the trigram indexes with ILIKE (fuzzy_search). Remove the existing trigram indexes on name and description.

Refer to PgFullTextSearchable and follow a similar approach with regards to generating the search vector and tokenizing the query.

Update (2023-11-08):

Per !136263 (comment 1639977341), we will take the approach to create a new search_vector column on catalog_resources (ref Postgres docs). The vector value is automatically generated by the database from the GENERATED ALWAYS AS directive. The vector applies a greater weight to the name term than description. COALESCE is used to avoid a null value. The SQL looks like:

      ALTER TABLE catalog_resources
        ADD COLUMN search_vector tsvector
          GENERATED ALWAYS AS
            (setweight(to_tsvector('english', COALESCE(name, '')), 'A') ||
             setweight(to_tsvector('english', COALESCE(description, '')), 'B')) STORED;

With this approach, we will be able to add indices directly on search_vector in #429056 (closed).

Edited Nov 09, 2023 by Leaminn Ma