Evaluate Duo Chat on a document-related QA dataset (!101) · Merge requests · GitLab.org / AI Powered / ELI5

Alexander Chueshev requested to merge ac/duo-chat-cot-qa-docs-eval into main Aug 27, 2024

What does this merge request do and why?

Dataset collection

This MR collects a new expanded dataset to evaluate Duo Chat on document-related QA tasks following these steps:

Read markdown doc files from the specified directory.
Filter and process the content of each file.
1. Check if the content is not empty.
2. Split the content into sections based on "## " (Header 2) markers.
3. Ensure there are at least 3 sections with Header 2.
4. Verify that each section has at least 500 characters.
Use the Anthropic Claude model to generate questions, answers, and relevant context for each file.
Write the generated data to the output JSONL file.

Evaluation

This MR evaluates the accuracy of a prediction against a reference using LLM judgment. The implemented approach uses an LLM to assess the accuracy of the prediction based on the provided context (reference) and question (input). The LLM assigns a score from 1 to 4, where 1 is fully inaccurate and 4 is fully accurate.

How to set up and validate locally

Check out to this merge request's branch.
Update the .env file setting the right variables.

Install dependencies.

mise install # or use asdf
poetry run install

Check the existing commands ELI5 provides:

poetry run eli5 --help
poetry run eli5 duo-chat --help

Collect the dataset

poetry run eli5 duo-chat collect --help
poetry run eli5 duo-chat collect cot-qa-docs --help
poetry run eli5 duo-chat collect cot-qa-docs <PATH_TO_CLONED_GITLAB_DOCS, e.g., gitlab/docs> --output=<PATH_TO_OUTPUT_FILE, e.g., dataset.jsonl>

Upload the dataset to Langsmith. Note: we already have the dataset uploaded to Langsmith (check for duo_chat.cot_qa_docs.1). Please, don't run the command unnecessarily as we share the prod and dev instances and this command can create unexpected collisions. I'm working on the fix already.
```
poetry run eli5 datasets create --help
poetry run eli5 datasets create duo_chat.cot_qa_docs.1 <PATH_TO_GENERATED_DATASET>
```

Run evaluation:

poetry run eli5 duo-chat evaluate docs --help
poetry run eli5 duo-chat evaluate docs

Merge request checklist

Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Edited Aug 27, 2024 by Alexander Chueshev

Evaluate Duo Chat on a document-related QA dataset

What does this merge request do and why?

Dataset collection

Evaluation

How to set up and validate locally

Merge request checklist

Merge request reports