Advanced search: Remove prefixed asterisk for path filters
What does this MR do and why?
Background
We are seeing very long queries (many times around 6s) when doing code search with additional filter for path (example return path:lib/api
).
The reason is that we use a wildcard to match the path: { wildcard: *<query>* }
. The Elasticsearch docs warn against wildcard queries starting with *
or ?
as it has performance implications:
Fix
Remove the prefixed asterisk so that only the suffix of the path is a wildcard. Because we use the Path Hierarchy Tokenizer
which splits the path according to the delimiter (/
), the term can be anywhere after a slash in the path and does not have to the start of the file's path.
Examples
- For query
lib/api
, results with the path*/lib/api*
will still be returned (includingspec/lib/api
for example). - For query
ib/api
, results with the path*/lib/api*
will not be returned but would have previously. - For query
lib/ap
, results with the path*/lib/api*
will still be returned (includingspec/lib/api
for example).
Result
Comparing the search profiles in production with the before and after, the after is 500 times faster in finding the matching paths.
Search profiler before: 20.5s
Search profile after: 43.9ms
How to set up and validate locally
- Ensure Elasticsearch is running
- Check out master
- Do a search with the path filter, e.g.
/search?scope=blobs&search=path%3Aapi%2Fentities
- Check this branch out
- Do a search with the path filter and verify that results are as expected, e.g.
/search?scope=blobs&search=path%3Aapi%2Fentities
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #390177 (closed)