[EE] Support unlimited file search in web UI and API
What does this MR do?
This MR removes limit of 100 used for blob/wiki blob searches.
Because filename and content is done through gitaly request which returns all matches anyway, applying limit of 100
is not very effective (as most of the time is spent by doing content search on gitaly side) and introduces significant disadvantages to search usage:
- only max
100
matches can be returned both in web UI and API - this is quite limiting especially for API - if additional filters are used (e.g.
path:...
), these are applied on the limited first100
results which may provide incomplete (or even zero) set of matches
Changes in this MR:
- removes limit of
100
and does pagination of all matches - removes sorting of filename and content matches together - now filename matches are listed first, then content matches (sorting is done already on gitaly side)
- applies filters on all results (not only subset of results), "binary" utf string is used in filters now (running
utf_encode
on all results is too expensive) -
FoundBlob
class is moved into a separate file and extended, specifically fetching and parsing is done lazily - when some attribute is really requested - this allows us to useFoundBlob
for not-paginated array of matches - instead of returning array of tuples
[blob.filename, blob]
, onlyblob
is returned now - there is no reason to pass the tuple - this change is specific to not-elasticsearch search - elasticsearch doesn't use this code
Performance impact
This change adds relatively small penalty to the search time. Major penalty is that now for each match a new instance of FoundBlob
is initialized and filters are applied (if used in search string, which I think is not so often) on all matches. This overhead is marginal for thousands of matches. For big sets of matches, the overhead is still acceptable relatively to the time spent by grep
.
Bellow are statistics done on linux
repository when ten of thousands and hundreds of thousands of matches are returned.
Time spent in FileFinder.find
- most of changes related to performance were done in this method:
search string | w/o MR | with MR |
---|---|---|
test (45 000 matches from grep) |
3.57 | 3.69 |
test Documentation (45000 matches from grep) |
3.54 | 3.86 |
ab (414 000 matches from grep) |
6.0 | 6.6 |
ab Documentation (414 000 matches from grep) |
6.2 | 8.66 |
Overall request time:
search string | w/o MR | with MR |
---|---|---|
test (45 000 matches from grep) |
4137ms | 4635ms |
test Documentation (45000 matches from grep) |
8765ms | 9417ms |
ab (414 000 matches from grep) |
6324ms | 7589ms |
ab Documentation (414 000 matches from grep) |
11350ms | 14027ms |
The huge 5s penalty for requests which use ...path
in search string is unrelated - in this case commit count take much longer both with and w/o the MR.
What are the relevant issue numbers?
Closes https://gitlab.com/gitlab-org/gitlab-ce/issues/45915
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Documentation created/updated -
Tests added for this feature/bug -
Conforms to the code review guidelines -
Conforms to the merge request performance guidelines -
Conforms to the style guides -
Conforms to the database guides -
Link to e2e tests MR added if this MR has Requires e2e tests label. See the Test Planning Process. -
EE specific content should be in the top level /ee
folder -
For a paid feature, have we considered GitLab.com plans, how it works for groups, and is there a design for promoting it to users who aren't on the correct plan? -
Security reports checked/validated by reviewer