LLM Judge Calibration: Iterating on Custom Metric for RCA-correctness
Problem to solve
With the spot checking the Correctness score as compared to human rating was 65% correct overall , with taking gitlab-cli out as a project was 81% . We want to thrive in iterating the metric of being 80-90% accurate even with gitlab cli
Proposal
Further details
Links / references
Edited by Mon Ray