LLM Judge Calibration: Iterating on Custom Metric for RCA-correctness

Problem to solve

With the spot checking the Correctness score as compared to human rating was 65% correct overall , with taking gitlab-cli out as a project was 81% . We want to thrive in iterating the metric of being 80-90% accurate even with gitlab cli

Proposal

Further details

Links / references

Edited Jul 22, 2024 by Mon Ray