LLM-as-a-Judge has become the go-to solution for evaluating AI systems. It's intuitive, flexible, and seemingly perfect for complex, subjective tasks that traditional metrics can't handle. Need to score creative writing? Use GPT-4. Want to evaluate conversational quality? Claude it is. The approach feels natural—who better to judge language than a language model?
But beneath this appealing surface lies a fundamental problem: LLM judges are unreliable, expensive, and inconsistent in ways that can completely undermine your evaluation pipeline.
Here's what happens when you run the same evaluation multiple times with an LLM judge:
# Evaluating the same response 5 times
response = "Paris is the capital of France and known for the Eiffel Tower."
prompt = "Rate the accuracy of this statement on a scale of 1-10."
scores = []
for i in range(5):
score = llm_judge.evaluate(prompt, response)
scores.append(score)
print(scores)
# Output: [9, 7, 8, 9, 6]
# Standard deviation: 1.3
The same factual statement receives scores ranging from 6 to 9. This isn't a minor variance—it's a 40% swing that makes evaluation results meaningless
Temperature and Sampling: Even with temperature set to 0, LLMs aren't deterministic. Different sampling methods, minor prompt variations, and model updates create variance.
Context Sensitivity: LLMs are heavily influenced by:
Subjective Interpretation: What seems like objective criteria to humans becomes subjective to LLMs. "Rate the creativity" means different things across model calls.
This inconsistency doesn't just create noisy data—it creates systematically wrong conclusions:
# Comparing two models using LLM judge
model_a_scores = [evaluate_with_llm(response) for response in model_a_outputs]
model_b_scores = [evaluate_with_llm(response) for response in model_b_outputs]
# Model A average: 7.2
# Model B average: 7.8
# Conclusion: Model B is better!
# But with evaluation variance of ±1.5, this difference is meaningless
Teams make costly decisions based on evaluations that would give different results if run again. That "breakthrough" model might just be measurement noise.
When evaluation scores vary by 20-40% across runs, comparing different approaches becomes impossible. You're optimizing for randomness, not performance.
Unreliable evaluations create cascading problems: