LLM-as-Judge Rubric Builder
Define your evaluation criteria and a scoring scale, then generate a clean, copy-pasteable LLM-as-judge prompt you can drop into your eval pipeline — with the common pitfalls (position bias, verbosity bias, ties) called out. Turns eval theory into a prompt you can ship.
You are an impartial evaluator. Your job is to score a response for the following task.
# Task
Answer a user question using only the provided context.
# Evaluation criteria
1. Faithfulness: Every claim is supported by the provided context; no hallucinations.
2. Relevance: The answer directly addresses the user’s question.
3. Completeness: The answer covers the key points the question requires.
# Inputs
<user_input>{{INPUT}}</user_input>
<response>{{RESPONSE}}</response>
# Instructions
- Score the response on each criterion using the scale: 1–5 (Likert).
- Judge on substance, not length or verbosity.
- For each criterion, give a one-line rationale before its score.
# Output (JSON)
{
"faithfulness": { "rationale": "<one line>", "score": <number> },
"relevance": { "rationale": "<one line>", "score": <number> },
"completeness": { "rationale": "<one line>", "score": <number> }
}How it works
- List independent, specific criteria — one judgment each.
- Define a scale with described anchors (e.g. 1 = wrong, 5 = fully correct).
- Ask for a rationale before the score to improve consistency + auditability.
- Mitigate position, verbosity and self-preference bias in the instructions.
Frequently asked questions
What is LLM-as-a-judge?
LLM-as-a-judge uses a language model to score or compare the outputs of another model against a rubric, instead of relying solely on humans or rigid string matching. It scales evaluation and handles open-ended outputs, as long as the rubric is clear and known biases are controlled for.
What makes a good evaluation rubric?
Specific, independent criteria; an explicit scoring scale with described anchor points; and instructions to reason before scoring. Vague criteria like "is it good?" produce noisy scores — "does the answer cite a source for every factual claim?" produces consistent ones.
What biases affect LLM judges?
Common ones are position bias (favoring the first or second option in a comparison), verbosity bias (favoring longer answers), and self-preference (favoring the judge model’s own style). The generated prompt includes mitigations such as randomizing order and scoring on substance, not length.
Should the judge explain its score?
Yes. Asking the judge to give a short rationale before the numeric score improves consistency and makes results auditable, so you can see why something was marked down rather than trusting an unexplained number.