Skip to content

Evaluate Duo Chat on a document-related QA dataset

Alexander Chueshev requested to merge ac/duo-chat-cot-qa-docs-eval into main

What does this merge request do and why?

Dataset collection

This MR collects a new expanded dataset to evaluate Duo Chat on document-related QA tasks following these steps:

  1. Read markdown doc files from the specified directory.
  2. Filter and process the content of each file.
    1. Check if the content is not empty.
    2. Split the content into sections based on "## " (Header 2) markers.
    3. Ensure there are at least 3 sections with Header 2.
    4. Verify that each section has at least 500 characters.
  3. Use the Anthropic Claude model to generate questions, answers, and relevant context for each file.
  4. Write the generated data to the output JSONL file.

Evaluation

This MR evaluates the accuracy of a prediction against a reference using LLM judgment. The implemented approach uses an LLM to assess the accuracy of the prediction based on the provided context (reference) and question (input). The LLM assigns a score from 1 to 4, where 1 is fully inaccurate and 4 is fully accurate.

How to set up and validate locally

  1. Check out to this merge request's branch.

  2. Update the .env file setting the right variables.

  3. Install dependencies.

    mise install # or use asdf
    poetry run install
  4. Check the existing commands ELI5 provides:

    poetry run eli5 --help
    poetry run eli5 duo-chat --help
  5. Collect the dataset

    poetry run eli5 duo-chat collect --help
    poetry run eli5 duo-chat collect cot-qa-docs --help
    poetry run eli5 duo-chat collect cot-qa-docs <PATH_TO_CLONED_GITLAB_DOCS, e.g., gitlab/docs> --output=<PATH_TO_OUTPUT_FILE, e.g., dataset.jsonl>
  6. Upload the dataset to Langsmith. Note: we already have the dataset uploaded to Langsmith (check for duo_chat.cot_qa_docs.1). Please, don't run the command unnecessarily as we share the prod and dev instances and this command can create unexpected collisions. I'm working on the fix already.

    poetry run eli5 datasets create --help
    poetry run eli5 datasets create duo_chat.cot_qa_docs.1 <PATH_TO_GENERATED_DATASET>
  7. Run evaluation:

    poetry run eli5 duo-chat evaluate docs --help
    poetry run eli5 duo-chat evaluate docs

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
Edited by Alexander Chueshev

Merge request reports

Loading