Improve performance of `Sbom::DependenciesFinder` when sorting and filtering

This issue has been extract from this discussion.

Sbom::DependenciesFinder is currently used:

on a group level by dependencies_controller
on a project level by dependencies_resolver

TL;DR;

In general, all the queries are just a bit heavier than the existing ones. But most of them might be acceptable as they're still performing between 100ms and 350ms in Postgres.ai, which I believe corresponds to our desired 100ms in gprd.

One exception is the query that sorts and filters go, pipenv and yarn all together. This one jumped from 350ms to 500ms, and is buffering already 1GB (100MB more than before). When querying for just one source, it's much faster. Would limiting to one filter only be an option, at least for the near feature?

On the long run, none of these queries perform really well as of today. We should investigate what else we could do to improve them. But I think this should be on a one-by-one case with follow-up issues.

These queries concern me. I see it is under a disabled FF. We should roll it out very carefully.

I'm also asking other maintainers for ideas on how we could optimize it, as it's not clear to me.

More details

There was a confirmation (internal) that we had multiple incidents with Postgres.ai in the last 3 days. So, that explains some very high query times you had. It seems better now, so I've retested the query results too:

No sorting and no filter 🆗

It looks a bit worse than the older one. I think the time consumption of 238ms in Postgres.ai is acceptable, as it should process faster on gprd. It had an increase of about 40MB of buffer usage. It's something to keep an eye on, but I think we can pass this one.

Sorting and no filter 🆗

The new query here adds about 100MB of buffers, and it already about 500MB. But the increase in time was of about 20 ms, executing in a total of 291.704ms. So once again, like the former one, we're slowly adding more load to something that already does not process that well. I think this is also on an acceptable range, give these are Postgres.ai values, but I'd understand if other maintainer would reject it, as it feels like a gray area already.

Sorting and filter (a bit concerning) 🤔

Here I think we were comparing different queries in the MR description. The ids tested out for the existing query are actually for go, pipenv and yarn, and they are an incomplete list, as there were only about about 30 ids in the example. But there are actually 581 ids for these sources.

gitlabhq_dblab=# SELECT count("sbom_sources"."id") FROM "sbom_sources" WHERE (source->'package_manager'->>'name' IN ('go', 'pipenv', 'yarn')) limit 100;
count|581

But in the new query we were filtering bundler, which has about 40 source ids.

As we see on the comparison between the existing and new query, they perform basically the same for just bundler, but the new version buffers more. I think this is expected, as the data needed to calculate the sbom_sources were probably calculated in rails memory due to the pluck being called, and now I think we're sending it to the DB with the select(:id) approach. So, regarding this it should be ok. Still, we should look into the considerable increase in the new query when searching by 3 sources:

Implementation Plan

database Add a new column to sbom_occurrences table called package_manager(varchar).
backend Update the Sbom::Ingestion::Tasks::IngestOccurrences to save the package_manager while creating the records
database Implement a background migration to set the package_manager for already existing occurrence records
database Create the necessary index to support the filtering by package manager query
backend Change the query to use the sbom_occurrences.package_manager instead of joining to sbom_sources table

Edited Aug 23, 2023 by Gregory Havenga