Skip to content

Evaluate Duo Chat on an issue/epic-related QA dataset

Alexander Chueshev requested to merge ac/duo-chat-epics-issues-eval into main

What does this merge request do and why?

This MR evaluates the accuracy of a prediction against a reference using LLM judgment for issue/epic-related QA datasets. The implemented approach uses an LLM to assess the prediction's correctness, readability, and comprehensiveness based on the provided context (reference) and question (input). The LLM assigns a score from 1 to 4, where 1 is fully inaccurate and 4 is fully accurate.

How to set up and validate locally

  1. Check out to this merge request's branch.

  2. Update the .env file setting the right variables.

  3. Install dependencies.

    mise install # or asdf
    poetry run install
  4. Collect Duo Chat completions by running the Rake task as described in #25 (closed). Optionally, feel free to use the file I collected to save time - 895c66862d37c4d9351ab6030f7f7bc5.jsonl

  5. Check the existing commands ELI5 provides:

    poetry run eli5 --help
    poetry run eli5 duo-chat --help
  6. Run evaluation:

    poetry run eli5 duo-chat evaluate qa-resources --help
    poetry run eli5 duo-chat evaluate qa-resources <PATH to the Rake output, e.g., 895c66862d37c4d9351ab6030f7f7bc5.jsonl> --dataset=duo_chat.cot_qa_resources.1

Here is the completed experiment for the uploaded completions - https://smith.langchain.com/o/477de7ad-583e-47b6-a1c4-c4a0300e7aca/datasets/f0f7c18a-a282-465b-8f16-d5b763365ec4

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.

Closes #25 (closed)

Edited by Alexander Chueshev

Merge request reports

Loading