Skip to content

Update the fix-broken-pipeline evaluation for Duo Workflow

Alexander Chueshev requested to merge ac/dw-evaluation into main

What does this merge request do and why?

Completed experiment - https://smith.langchain.com/o/477de7ad-583e-47b6-a1c4-c4a0300e7aca/datasets/db675a33-ddbc-4add-b531-da9ce616afb8

Calculated metrics:

  • Recall (0-1):
    • Measures the fraction of expected files that were correctly identified.
    • Range: 0 to 1
    • Higher values are better.
    • Indicates how well the system identifies the files that need modification.
  • Jaccard (0-1):
    • Measures the similarity between the expected and actual file sets.
    • Range: 0 to 1
    • Higher values are better.
    • Applies stricter conditions to the set of modified files.
    • Useful for comparing the overlap between predicted and actual file sets.
  • Equal-proba:
    • An estimate by an LLM judge of the probability of solving the given task with the actual patch.
    • Range: 0 to 100, with a step of 10 (i.e., 0, 10, 20, ..., 90, 100)
    • Higher values indicate a higher estimated probability of success.

How to set up and validate locally

poetry run eli5 duo-workflow evaluate fix-broken-pipeline <path to the predictions generated by https://gitlab.com/gitlab-org/duo-workflow/testing/duo-workflow-tests/-/tree/main?ref_type=heads>

Example file: results.jsonl

Approach to analyze metrics

stateDiagram
    classDef event_accept fill:green,color:white,stroke:green


    irrelevant_files: too many irrelevant files updated
    recall_accept: subset of actual files updated
    recall: Recall

    class recall_accept event_accept

    jaccard: Jaccard Similarity
    jaccard_accept: subset of actual files updated

    class jaccard_accept event_accept

    llm_proba: Equal-proba LLM judge
    llm_proba_accept: patches (code changes) look more similar than different
    llm_proba_low: patches (code changes) look different

    class llm_proba_accept event_accept

    [*] --> recall
    recall --> irrelevant_files: low value
    recall --> recall_accept: acceptable value
    irrelevant_files --> [*]: check LLM judge reasoning for details

    recall_accept --> jaccard: clarify Recall
    jaccard --> irrelevant_files: significantly lower than Recall
    jaccard --> jaccard_accept: almost the same or equal

    jaccard_accept --> llm_proba: clarify code change quality
    llm_proba --> llm_proba_accept: probability >= 75% (approx.)
    llm_proba --> llm_proba_low: probability < 75% (approx.)
    llm_proba_accept --> [*]: check LLM judge reasoning for what's missing
    llm_proba_low --> [*]: check LLM judge reasoning for details

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
Edited by Alexander Chueshev

Merge request reports

Loading