Update the fix-broken-pipeline evaluation for Duo Workflow (!109) · Merge requests · GitLab.org / AI Powered / ELI5

Alexander Chueshev requested to merge ac/dw-evaluation into main Sep 06, 2024

What does this merge request do and why?

Completed experiment - https://smith.langchain.com/o/477de7ad-583e-47b6-a1c4-c4a0300e7aca/datasets/db675a33-ddbc-4add-b531-da9ce616afb8

Calculated metrics:

Recall (0-1):
- Measures the fraction of expected files that were correctly identified.
- Range: 0 to 1
- Higher values are better.
- Indicates how well the system identifies the files that need modification.
Jaccard (0-1):
- Measures the similarity between the expected and actual file sets.
- Range: 0 to 1
- Higher values are better.
- Applies stricter conditions to the set of modified files.
- Useful for comparing the overlap between predicted and actual file sets.
Equal-proba:
- An estimate by an LLM judge of the probability of solving the given task with the actual patch.
- Range: 0 to 100, with a step of 10 (i.e., 0, 10, 20, ..., 90, 100)
- Higher values indicate a higher estimated probability of success.

How to set up and validate locally

poetry run eli5 duo-workflow evaluate fix-broken-pipeline <path to the predictions generated by https://gitlab.com/gitlab-org/duo-workflow/testing/duo-workflow-tests/-/tree/main?ref_type=heads>

Example file: results.jsonl

Approach to analyze metrics

stateDiagram
    classDef event_accept fill:green,color:white,stroke:green


    irrelevant_files: too many irrelevant files updated
    recall_accept: subset of actual files updated
    recall: Recall

    class recall_accept event_accept

    jaccard: Jaccard Similarity
    jaccard_accept: subset of actual files updated

    class jaccard_accept event_accept

    llm_proba: Equal-proba LLM judge
    llm_proba_accept: patches (code changes) look more similar than different
    llm_proba_low: patches (code changes) look different

    class llm_proba_accept event_accept

    [*] --> recall
    recall --> irrelevant_files: low value
    recall --> recall_accept: acceptable value
    irrelevant_files --> [*]: check LLM judge reasoning for details

    recall_accept --> jaccard: clarify Recall
    jaccard --> irrelevant_files: significantly lower than Recall
    jaccard --> jaccard_accept: almost the same or equal

    jaccard_accept --> llm_proba: clarify code change quality
    llm_proba --> llm_proba_accept: probability >= 75% (approx.)
    llm_proba --> llm_proba_low: probability < 75% (approx.)
    llm_proba_accept --> [*]: check LLM judge reasoning for what's missing
    llm_proba_low --> [*]: check LLM judge reasoning for details

Merge request checklist

Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Edited Sep 18, 2024 by Alexander Chueshev

Update the fix-broken-pipeline evaluation for Duo Workflow

What does this merge request do and why?

How to set up and validate locally

Approach to analyze metrics

Merge request checklist

Merge request reports