AI Framework Test Suite - Chat Response Measurement Heuristic (Enhancement)

Problem Statement

The Chat Framework's test suite currently faces challenges in accurately measuring the quality of its initial responses to user inputs. The existing 1-1 comparison method falls short in providing comprehensive insights into the relevance and accuracy of the Chat's responses. As a result, it's challenging to evaluate the Chat's performance effectively and identify areas for improvement.

Initial Work Based on: The initial work for this issue was based on the following merge request: [Experimental]: Implement QA Duo Chat evaluator... (gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!431 - merged) • Alexander Chueshev • 16.6.

Context

In the context of the initial work, an API endpoint was added to the AI Gateway. This API endpoint is designed to accept Duo Chat questions and answers and evaluate them as either good or bad. The reasons for choosing this approach were as follows:

Integration with CI Job: One option was to build an integration between the CI job and the AI Gateway. However, this approach was not initially chosen for this project.
LLM Prompt Integration: Instead, the same LLM (Large Language Model) prompt that is added in the merge request could be integrated into the Rails codebase relatively easily.
Monolith Logic: It was decided to keep the logic within the monolith for the time being. While there is the option to use the API endpoint later, it was deemed more efficient to maintain the logic within the monolith at this stage.

Example Output of a Failed RSpec Run:

A sample output of a failed RSpec run:

1) GitLab Duo Chat Evaluation evaluation feed incorrect issue data 1 summarize an issue correctly
   Failure/Error: expect(evaluation).to match /Grade: CORRECT/i

     expected " Grade: INCORRECT\n\nExplanation:\nThe student answer contains some incorrect information compared t...etails about the label and milestone assignment mean the answer should be marked incorrect overall." to match /Grade: CORRECT/i
     Diff:
     @@ -1,8 +1,15 @@
     -/Grade: CORRECT/i
     + Grade: INCORRECT
     +
     +Explanation:
     +The student answer contains some incorrect information compared to the context provided. Specifically:
     +- The context does not mention any label like "ai-enablement" for the issue.
     +- The context also does not mention any milestone like "milestone1" that the issue is assigned to.
     +
     +The student answer summarizes the main point that the issue is about evaluating AI provider reliability. However, the additional incorrect details about the label and milestone assignment mean the answer should be marked incorrect overall.

Exit Criteria:

New Measurement Heuristic: We aim to implement an alternative model as an evaluation agent to assess the quality of the Chat's initial responses. This new approach will deliver more dependable and precise measurements.
Extend RSpecs for Tool Validation: We will augment the existing RSpec tests to validate that the Chat selects the correct tool for each test case. This will involve using classic binary classifiers, where the expected tool for each test case is predefined, and the test will confirm whether the Chat's tool selection is accurate.
Heuristic and Tool Selection Integration into RSpec/CI/CD: The newly proposed heuristic, along with the tool selection process, will be seamlessly integrated into the existing RSpec tests and the CI/CD pipeline for the Chat test suite. This integration will ensure that the evaluation process is consistent and automated.

Resources and links

This paper has useful to get an overview of the common approaches (and pitfalls) to NLG evaluation among practitioners.
https://arxiv.org/pdf/2303.16634.pdf
Video about how Docugami and Rechat are tackling these problems
Decision Record: Chat Evaluation

Edited Oct 19, 2023 by David O'Regan