Run SWE benchmark as part of ELI5
What does this merge request do and why?
This MR adds support for ELI5 to run the SWE benchmark as part of the duo-workflow evaluate swe
command.
Ref: #43
How to set up and validate locally
- Check out to this merge request's branch.
- Update your .env file.
- Install dependencies.
poetry install --with swebench
- Check the existing command ELI5 provides:
poetry run eli5 duo-workflow evaluate swe --help
- Run SWE benchmark with custom evaluators:
poetry run eli5 duo-workflow evaluate swe results.jsonl --split=base --run-swe-benchmark
Note:
- DW resutls that can be used to check this MR - results.jsonl
- Additional instructions for Mac M1 users - #43 (comment 2164440717)
- If you experience issues with Docker-related Python code, try to update your
DOCKER_HOST
env variables. For example,DOCKER_HOST=unix:///Users/ac/.colima/default/docker.sock
.
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Alexander Chueshev