refactor: encapsulate judges for DuoChat
What does this merge request do and why?
Encapsulate judges for duo-chat. Also plan to do the same for other usecases.
Benefits:
- All logic for a judge can be found in one class
promptlib/duo_chat/judges.py
. - The configurations of each predefined judge (and its different versions) can be easily inspected in the config
promptlib/duo_chat/config.py
- Decouple judge logic and prompt templating logic from Beam. This improves developer experience and making adding new judges more easily. Here is an example of adding a new judge for DuoChat doc: !746 (merged) , notice how clear it is.
- User only need to choose one from a pool of predefined judges, no longer need to build all the models and prompt templates. This will remove a lot of user friction
How to set up and validate locally
Run the pipeline with this config:
Click to expand
{
"beam_config": {
"pipeline_options": {
"runner": "DirectRunner",
"project": "dev-ai-research-0e2f8974",
"region": "us-central1",
"temp_location": "gs://prompt-library/tmp/",
"save_main_session": false
}
},
"input_source": {
"type": "bigquery",
"path": "dev-ai-research-0e2f8974.duo_chat.issue_epic_staging_v1"
},
"output_sinks": [
{
"type": "bigquery",
"path": "dev-ai-research-0e2f8974.duo_chat_experiments",
"prefix": "test_prebuilt_judge"
}
],
"throttle_sec": 1,
"batch_size": 16,
"eval_setup": {
"answering_models": [
{
"name": "duo-chat",
"parameters": {
"base_url": "https://staging.gitlab.com"
},
"prompt_template_config": {
"templates": [
{
"name": "empty",
"template_path": "data/prompts/duo_chat/answering/empty.txt.example"
}
]
}
},
{
"name": "gpt-4o-mini",
"prompt_template_config": {
"templates": [
{
"name": "claude-3-sonnet",
"template_path": "data/prompts/duo_chat/answering/code-explanation-simple.txt.example"
}
]
}
}
],
"metrics": [
{
"name": "similarity_score"
},
{
"name": "independent_llm_judge_generic",
"model": {
"name": "claude-3-5-sonnet"
}
}
]
}
}
Merge request checklist
-
I've ran the affected pipeline(s) to validate that nothing is broken. -
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Hongtao Yang