Evaluate Duo Chat on a document-related QA dataset
What does this merge request do and why?
Dataset collection
This MR collects a new expanded dataset to evaluate Duo Chat on document-related QA tasks following these steps:
- Read markdown doc files from the specified directory.
- Filter and process the content of each file.
- Check if the content is not empty.
- Split the content into sections based on "## " (Header 2) markers.
- Ensure there are at least 3 sections with Header 2.
- Verify that each section has at least 500 characters.
- Use the Anthropic Claude model to generate questions, answers, and relevant context for each file.
- Write the generated data to the output JSONL file.
Evaluation
This MR evaluates the accuracy of a prediction against a reference using LLM judgment. The implemented approach uses an LLM to assess the accuracy of the prediction based on the provided context (reference) and question (input). The LLM assigns a score from 1 to 4, where 1 is fully inaccurate and 4 is fully accurate.
How to set up and validate locally
-
Check out to this merge request's branch.
-
Update the .env file setting the right variables.
-
Install dependencies.
mise install # or use asdf poetry run install
-
Check the existing commands ELI5 provides:
poetry run eli5 --help poetry run eli5 duo-chat --help
-
Collect the dataset
poetry run eli5 duo-chat collect --help poetry run eli5 duo-chat collect cot-qa-docs --help poetry run eli5 duo-chat collect cot-qa-docs <PATH_TO_CLONED_GITLAB_DOCS, e.g., gitlab/docs> --output=<PATH_TO_OUTPUT_FILE, e.g., dataset.jsonl>
-
Upload the dataset to Langsmith. Note: we already have the dataset uploaded to Langsmith (check for
duo_chat.cot_qa_docs.1
). Please, don't run the command unnecessarily as we share the prod and dev instances and this command can create unexpected collisions. I'm working on the fix already.poetry run eli5 datasets create --help poetry run eli5 datasets create duo_chat.cot_qa_docs.1 <PATH_TO_GENERATED_DATASET>
-
Run evaluation:
poetry run eli5 duo-chat evaluate docs --help poetry run eli5 duo-chat evaluate docs
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.