Improve performance of `Sbom::DependenciesFinder` when sorting and filtering
This issue has been extract from this discussion.
Sbom::DependenciesFinder is currently used:
- on a group level by dependencies_controller
- on a project level by dependencies_resolver
TL;DR;
In general, all the queries are just a bit heavier than the existing ones. But most of them might be acceptable as they're still performing between 100ms and 350ms in Postgres.ai, which I believe corresponds to our desired 100ms in gprd.
One exception is the query that sorts and filters go
, pipenv
and yarn
all together. This one jumped from 350ms to 500ms, and is buffering already 1GB (100MB more than before). When querying for just one source, it's much faster. Would limiting to one filter only be an option, at least for the near feature?
On the long run, none of these queries perform really well as of today. We should investigate what else we could do to improve them. But I think this should be on a one-by-one case with follow-up issues.
These queries concern me. I see it is under a disabled FF. We should roll it out very carefully.
I'm also asking other maintainers for ideas on how we could optimize it, as it's not clear to me.
More details
There was a confirmation (internal) that we had multiple incidents with Postgres.ai in the last 3 days. So, that explains some very high query times you had. It seems better now, so I've retested the query results too:
🆗
No sorting and no filter It looks a bit worse than the older one. I think the time consumption of 238ms in Postgres.ai is acceptable, as it should process faster on gprd. It had an increase of about 40MB of buffer usage. It's something to keep an eye on, but I think we can pass this one.
🆗
Sorting and no filter The new query here adds about 100MB of buffers, and it already about 500MB. But the increase in time was of about 20 ms, executing in a total of 291.704ms. So once again, like the former one, we're slowly adding more load to something that already does not process that well. I think this is also on an acceptable range, give these are Postgres.ai values, but I'd understand if other maintainer would reject it, as it feels like a gray area already.
🤔
Sorting and filter (a bit concerning) Here I think we were comparing different queries in the MR description. The ids tested out for the existing query are actually for go
, pipenv
and yarn
, and they are an incomplete list, as there were only about about 30 ids in the example. But there are actually 581 ids for these sources.
gitlabhq_dblab=# SELECT count("sbom_sources"."id") FROM "sbom_sources" WHERE (source->'package_manager'->>'name' IN ('go', 'pipenv', 'yarn')) limit 100;
count|581
But in the new query we were filtering bundler
, which has about 40 source ids.
As we see on the comparison between the existing and new query, they perform basically the same for just bundler
, but the new version buffers more. I think this is expected, as the data needed to calculate the sbom_sources were probably calculated in rails memory due to the pluck
being called, and now I think we're sending it to the DB with the select(:id)
approach. So, regarding this it should be ok. Still, we should look into the considerable increase in the new query when searching by 3 sources:
- Existing query (bundler) 30MB 40ms.
- New query (bundler) 150MB 40ms.
- Existing query (go, pipenv, yarn) 900MB 330ms.
- New query (go, pipenv, yarn) 1GB 500ms.
Implementation Plan
-
database Add a new column to sbom_occurrences
table calledpackage_manager
(varchar). -
backend Update the Sbom::Ingestion::Tasks::IngestOccurrences
to save thepackage_manager
while creating the records -
database Implement a background migration to set the package_manager
for already existing occurrence records -
database Create the necessary index to support the filtering by package manager query -
backend Change the query to use the sbom_occurrences.package_manager
instead of joining tosbom_sources
table