Standard postgres search frequently times out on GitLab.com
Spun out of #33040 (closed)
We see more than 3,478 QueryCanceled
timeout errors on SearchController#count
and 3,152 QueryCanceled
timeout errors on SearchController#show
every week on GitLab.com.
Timeouts on GitLab.com: https://log.gitlab.net/goto/1463be0a2554f2df293ea3c70267ca58
Examples of timeouts that we're seeing:
SELECT COUNT(*) FROM (SELECT 1 AS one FROM "issues" WHERE "issues"."confidential" = FALSE AND "issues"."project_id" = REDACTED AND ("issues"."title" ILIKE '%REDACTED%' AND "issues"."title" ILIKE '%source%' AND "issues"."title" ILIKE '%REDACTED%' OR "issues"."description" ILIKE '%REDACTED%' AND "issues"."description" ILIKE '%REDACTED%' AND "issues"."description" ILIKE '%REDACTED%') AND "issues"."project_id" IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" = REDACTED) LIMIT 100) subquery_for_count
SELECT COUNT(*) FROM (SELECT 1 AS one FROM "issues" INNER JOIN "projects" ON "projects"."id" = "issues"."project_id" LEFT JOIN project_features ON projects.id = project_features.project_id WHERE "issues"."confidential" = FALSE AND (EXISTS (SELECT 1 FROM "project_authorizations" WHERE "project_authorizations"."user_id" = REDACTED AND (project_authorizations.project_id = projects.id) AND (project_authorizations.access_level >= 10)) OR projects.visibility_level IN (10,20)) AND ("project_features"."issues_access_level" IS NULL OR "project_features"."issues_access_level" IN (20,30) OR ("project_features"."issues_access_level" = 10 AND EXISTS (SELECT 1 FROM "project_authorizations" WHERE "project_authorizations"."user_id" = REDACTED AND (project_authorizations.project_id = projects.id) AND (project_authorizations.access_level >= 10)))) AND ("issues"."title" ILIKE '%REDACTED%' OR "issues"."description" ILIKE '%REDACTED%') LIMIT 100) subquery_for_count
SELECT COUNT(*) FROM (SELECT 1 AS one FROM "issues" INNER JOIN "projects" ON "projects"."id" = "issues"."project_id" LEFT JOIN project_features ON projects.id = project_features.project_id WHERE "issues"."confidential" = FALSE AND "projects"."namespace_id" = REDACTED AND (EXISTS (SELECT 1 FROM "project_authorizations" WHERE "project_authorizations"."user_id" = REDACTED AND (project_authorizations.project_id = projects.id) AND (project_authorizations.access_level >= 10)) OR projects.visibility_level IN (10,20)) AND ("project_features"."issues_access_level" IS NULL OR "project_features"."issues_access_level" IN (20,30) OR ("project_features"."issues_access_level" = 10 AND EXISTS (SELECT 1 FROM "project_authorizations" WHERE "project_authorizations"."user_id" = REDACTED AND (project_authorizations.project_id = projects.id) AND (project_authorizations.access_level >= 10)))) AND ("issues"."title" ILIKE '%REDACTED%' AND "issues"."title" ILIKE '%REDACTED%' OR "issues"."description" ILIKE '%REDACTED%' AND "issues"."description" ILIKE '%REDACTED%') AND "issues"."project_id" IN (SELECT "projects"."id" FROM "projects" INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project' WHERE (EXISTS (SELECT 1 FROM "project_authorizations" WHERE "project_authorizations"."user_id" = REDACTED AND (project_authorizations.project_id = projects.id)) OR projects.visibility_level IN (10,20)) AND "projects"."archived" = FALSE AND (rs.path LIKE 'REDACTED%')) LIMIT 100) subquery_for_count
As @stanhu points out, &153 (closed) is the best solution. I have two minor concerns about this:
- Is this affecting other sites that don't run ElasticSearch? (I'm not sure if this is common or not)
- Its not clear to me what the delivery timeframes of &153 (closed) are. Is it possible that theres a small fix for this issue which would mitigate the problem in the mean time?