Skip to content

Implement a CLI command to evaluate Duo Workflow fix-broken-pipeline with LLM judge

Alexander Chueshev requested to merge ac/duo-workflow-llm-judge into main

What does this merge request do and why?

This MR implements a Command Line Interface (CLI) command to evaluate the Duo Workflow "fix-broken-pipeline" using an LLM judge. This implementation serves as a foundation for improving evaluation approaches in the ELI5 project.

Note: this MR requires the work done in gitlab-org/duo-workflow/duo-workflow-service!35 (closed)

How to set up and validate locally

  1. Check out to this merge request's branch.
  2. Update your .env file (you can skip DEEPSEEK_API_TOKEN and MISTRAL_API_KEY)
  3. Install dependencies.
    poetry run install
  4. Run help.
    poetry run eli5 duo-workflow --help
    poetry run eli5 duo-workflow evaluate-fix-broken-pipeline --help
  5. Run evaluation
    poetry run eli5 duo-workflow evaluate-fix-broken-pipeline datasets/duo_workflow/fix-broken-pipeline-v1 --dataset=duo_workflow.fix-broken-pipeline.1

Note: This command accepts predictions generated outside of the ELI5 project (see gitlab-org/duo-workflow/duo-workflow-service!35 (closed)). We use the datasets/duo_workflow/fix-broken-pipeline-v1 dataset for demonstration purposes only. All evaluation scores should be 1 as we are comparing the dataset against itself.

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.

Merge request reports

Loading