Update the fix-broken-pipeline evaluation for Duo Workflow
What does this merge request do and why?
Completed experiment - https://smith.langchain.com/o/477de7ad-583e-47b6-a1c4-c4a0300e7aca/datasets/db675a33-ddbc-4add-b531-da9ce616afb8
Calculated metrics:
- Recall (0-1):
- Measures the fraction of expected files that were correctly identified.
- Range: 0 to 1
- Higher values are better.
- Indicates how well the system identifies the files that need modification.
- Jaccard (0-1):
- Measures the similarity between the expected and actual file sets.
- Range: 0 to 1
- Higher values are better.
- Applies stricter conditions to the set of modified files.
- Useful for comparing the overlap between predicted and actual file sets.
- Equal-proba:
- An estimate by an LLM judge of the probability of solving the given task with the actual patch.
- Range: 0 to 100, with a step of 10 (i.e., 0, 10, 20, ..., 90, 100)
- Higher values indicate a higher estimated probability of success.
How to set up and validate locally
poetry run eli5 duo-workflow evaluate fix-broken-pipeline <path to the predictions generated by https://gitlab.com/gitlab-org/duo-workflow/testing/duo-workflow-tests/-/tree/main?ref_type=heads>
Example file: results.jsonl
Approach to analyze metrics
stateDiagram
classDef event_accept fill:green,color:white,stroke:green
irrelevant_files: too many irrelevant files updated
recall_accept: subset of actual files updated
recall: Recall
class recall_accept event_accept
jaccard: Jaccard Similarity
jaccard_accept: subset of actual files updated
class jaccard_accept event_accept
llm_proba: Equal-proba LLM judge
llm_proba_accept: patches (code changes) look more similar than different
llm_proba_low: patches (code changes) look different
class llm_proba_accept event_accept
[*] --> recall
recall --> irrelevant_files: low value
recall --> recall_accept: acceptable value
irrelevant_files --> [*]: check LLM judge reasoning for details
recall_accept --> jaccard: clarify Recall
jaccard --> irrelevant_files: significantly lower than Recall
jaccard --> jaccard_accept: almost the same or equal
jaccard_accept --> llm_proba: clarify code change quality
llm_proba --> llm_proba_accept: probability >= 75% (approx.)
llm_proba --> llm_proba_low: probability < 75% (approx.)
llm_proba_accept --> [*]: check LLM judge reasoning for what's missing
llm_proba_low --> [*]: check LLM judge reasoning for details
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Alexander Chueshev