Update full-text regex to strip out strings with @
What does this MR do and why?
Fixes timeouts seen in gitlab-com/gl-infra/production#6528 (closed)
The exact query that timed out is:
INSERT INTO issue_search_data (project_id, issue_id, search_vector, created_at, updated_at)
SELECT
project_id,
id,
setweight(to_tsvector('english', LEFT(title, 255)), 'A') || setweight(to_tsvector('english', LEFT(REGEXP_REPLACE(description, '[A-Za-z0-9+/]{50,}', ' ', 'g'), 1048576)), 'B'),
NOW(),
NOW()
FROM issues
WHERE issues.id IN (SELECT "issues"."id" FROM "issues" WHERE "issues"."id" BETWEEN 89570056 AND 89599068 AND "issues"."id" >= 89590728 AND "issues"."id" < 89590948)
ON CONFLICT DO NOTHING
It's actually the to_tsvector
part that's slow and can be reproduced with:
SELECT to_tsvector('english', REPEAT('@t1', 5000));
It gets much slower as the string gets longer. More investigation details in https://gitlab.slack.com/archives/C3NBYFJ6N/p1647415206447629
We haven't really figured out why it is slow but we suspect it may be related to parsing of emails because it only happens with a combination of @
, letters, and numbers.
This won't affect indexing of email addresses because those would include a .
and that won't match the regex. And to_tsvector
is actually fast when there's a period.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.