Implement a CLI command to evaluate Duo Workflow fix-broken-pipeline with LLM judge
What does this merge request do and why?
This MR implements a Command Line Interface (CLI) command to evaluate the Duo Workflow "fix-broken-pipeline" using an LLM judge. This implementation serves as a foundation for improving evaluation approaches in the ELI5 project.
Note: this MR requires the work done in gitlab-org/duo-workflow/duo-workflow-service!35 (closed)
How to set up and validate locally
- Check out to this merge request's branch.
- Update your .env file (you can skip DEEPSEEK_API_TOKEN and MISTRAL_API_KEY)
- Install dependencies.
poetry run install
- Run help.
poetry run eli5 duo-workflow --help poetry run eli5 duo-workflow evaluate-fix-broken-pipeline --help
- Run evaluation
poetry run eli5 duo-workflow evaluate-fix-broken-pipeline datasets/duo_workflow/fix-broken-pipeline-v1 --dataset=duo_workflow.fix-broken-pipeline.1
Note: This command accepts predictions generated outside of the ELI5 project (see gitlab-org/duo-workflow/duo-workflow-service!35 (closed)). We use the datasets/duo_workflow/fix-broken-pipeline-v1
dataset for demonstration purposes only. All evaluation scores should be 1
as we are comparing the dataset against itself.
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.