Metric : Scope of metrics for Prompt Library

Overview
Summary of the methodology
Similarity Score
The Partial Matching Problem
Additional metrics to better measure matching quality
- Cross Similarity Score
  - Implementation
  - Intuition

Overview

We currently use similarity score to compare a model prediction against a block of developer written code. While the similarity score is a good indicator of code suggestion quality in general, it suffers from the partial matching problem.

Summary of the methodology

Details here: gitlab-org/modelops/applied-ml/code-suggestions&13 (closed)

In this issue, we aim to find better metrics to overcome the partial matching problem, hence to better evaluate code suggestion in prompt library.

Similarity Score

The above diagram illustrate how the similarity score is computed. At its core, a pretrained embedding model will convert any block of text into a fixed length vector. Then the similarity is simply the dot product of the two vectors. We have used cosine similarity and for the embedding the text-embedding-gecko model of Vertex AI.

The Partial Matching Problem

The similarity score can only measure similarity by treating the block of text as a whole. So it will return a low score when the length of the two blocks differs significantly. In other words, when the two block of text matches partially, the similarity score will be low. Because of this, similarity score can be misleading in code suggestion, where there exist cases that partial matches is still a high quality match:

The suggested code successfully matches the remaining of the function, but carry on to suggest a new function. In this case, we should still consider the suggestion "high quality".
The suggested code is shorter than the developer written code, but the suggested content is a near-perfect match. The suggestion can be short for many reasons: model output token limits, the developer written code contain lots logging or comments etc.
The suggested code is shorter than the developer written code, but the developer written code is longer only because there some logging, printing or comments in between actual functional code. The functional code is a near-perfect match.

Additional metrics to better measure matching quality

Cross Similarity Score

We propose an improved version of similarity score called "cross similarity score". The algorithm to compute it is:

Cross Similarity Matrix (CSM)
Compute the singular values of the CSM: \{\sigma_0, \sigma_1, ... \sigma_m\} where m = rank(CSM).
Compute the singular values of the perfect match CSM. A perfect match CSM is the CSM between two blocks of identical text. Let CSM' be the perfect match CSM, we compute \{\sigma'_0, \sigma'_1, ... \sigma'_n \} where n = rank(CSM').
For both set of singular values, compute the ratio of the first singular value over all singular values. r = \frac{\sigma_0}{\sum_{i=0}^m \sigma_i} and r' = \frac{\sigma'_0}{\sum_0^n \sigma'_i}
The final cross similarity score is then computed as \frac{1-r}{1-r'}

Implementation

The implementation of this approach is in !33 .

Intuition

Edited Oct 16, 2023 by Mon Ray