Replace exact match with a Jaccard-type metric when cleaning-up model reflection
Problem to solve
The current implementation !366 (merged) of the algorithm to clean up model reflection totally relies on the exact-match similarity metric. The drawback of this metric is that we cannot clean snippets when the model counts:
Example:
Prompt1:
The name of this file is test.js
Model reflection:
The name of this file is test-1.js
We can think of both lines as the same since the only difference is the filename. Another example:
Prompt2:
def hello_world():
....
Model reflection:
def hello_world_1():
....
Proposal
Consider implementing a Jaccard-type similarity metric in a separate package that we'll further use to upgrade the algorithm. The input of this function is two lists, the output is a floating value between 0 and 1. We probably need to consider repeated elements when calculating the metric. This can be done by converting the input list into a list of tuples, where the first place is the number of times the value appears, and the second place is the value itself.