Enhance how we report multiple findings in one blob

Overview

There is a possibility multiple secrets are found in the same blob when scanning secrets during pre-receive secret detection. The current implementation of the secrets push check is unlikely to populate findings other than the first one with their commit/path occurrences due to the use of find to select the finding that matches the current tree entry when looping through tree entries of a certain commit.

Fortunately, the implementation gracefully degrades to displaying the blob id and line number, so a blob with multiple secrets detected in different lines would not fail to be reported, but will have an output similar to the example below:

remote: Secret leaked in commit: e746c402000ed932a7d8b81870f271fa559fc659
remote:   -- .env:1 | GitLab Personal Access Token
remote:
remote: Secret leaked in blob: fe29d93da4843da433e62711ace82db601eb4f8f
remote:   -- line:2 | GitLab Personal Access Token
remote:
remote: Secret leaked in blob: fe29d93da4843da433e62711ace82db601eb4f8f
remote:   -- line:3 | GitLab Personal Access Token

However, this is not an ideal user experience, and needs to be fixed so the following is displayed instead:

remote: Secret leaked in commit: e746c402000ed932a7d8b81870f271fa559fc659
remote:   -- .env:1 | GitLab Personal Access Token
remote:
remote: Secret leaked in commit: e746c402000ed932a7d8b81870f271fa559fc659
remote:   -- .env:2 | GitLab Personal Access Token
remote:
remote: Secret leaked in commit: e746c402000ed932a7d8b81870f271fa559fc659
remote:   -- .env:3 | GitLab Personal Access Token

Original Discussion

The following discussion from !138790 (merged) should be addressed:

@samihiltunen started a discussion: (+3 comments)

@ahmed.hemdan if I understand correctly, we may report multiple findings for a single blob if the blob contains multiple secrets.

find here would return the first finding that has the matching blob ID and we'd set the occurrence in it. If there was a second finding with the same blob ID, we wouldn't populate the commit and the path in it.

If there were multiple secretes in a single blob, we should probably collapse them together to report all of the secrets in a given file together instead of as separate entries. This could make it easier for the user to read the report.

If we don't have one yet, it would be good to have a test covering the case when there are multiple secrets in a single blob.

Not a functional issue but we could improve slightly by building a hash map of 'blob_id -> finding' ahead of looping the revisions for their tree entries. That way we wouldn't have to search for the correct finding in the list every iteration. The list of findings should likely be generally fairly small though.

Implementation Plan

Update secrets push check to build a hash map of blob_id -> finding ahead of querying tree entries.
Use the hash map while populating commit/path occurrences of a finding.
Add a test to cover the case when there are multiple secrets in a single blob.

Edited Dec 21, 2023 by Ahmed Hemdan