LLM-as-Judge Rubric Builder

Define your evaluation criteria and a scoring scale, then generate a clean, copy-pasteable LLM-as-judge prompt you can drop into your eval pipeline — with the common pitfalls (position bias, verbosity bias, ties) called out. Turns eval theory into a prompt you can ship.

What is being evaluated?

Mode

Scoring scale

Require a rationale before the score (recommended)

Criteria

Generated judge prompt

You are an impartial evaluator. Your job is to score a response for the following task.

# Task
Answer a user question using only the provided context.

# Evaluation criteria
1. Faithfulness: Every claim is supported by the provided context; no hallucinations.
2. Relevance: The answer directly addresses the user’s question.
3. Completeness: The answer covers the key points the question requires.

# Inputs
<user_input>{{INPUT}}</user_input>
<response>{{RESPONSE}}</response>

# Instructions
- Score the response on each criterion using the scale: 1–5 (Likert).
- Judge on substance, not length or verbosity.
- For each criterion, give a one-line rationale before its score.

# Output (JSON)
{
  "faithfulness": { "rationale": "<one line>", "score": <number> },
  "relevance": { "rationale": "<one line>", "score": <number> },
  "completeness": { "rationale": "<one line>", "score": <number> }
}

LLM-as-a-judge

Once you're generating open-ended text there's no exact-match "correct" to test against. The scalable answer is LLM-as-a-judge: a strong model, given a clear rubric, scores outputs. Done well it tracks human judgement at a fraction of the cost; done carelessly it produces confident, biased, meaningless numbers.

What makes a rubric work

Specific criteria — score named dimensions (faithfulness, relevance, tone), not a vague "is this good?".
A defined scale — describe what each score means so it's applied consistently.
Reasoning first — ask the judge to explain, then score; the rationale improves the score and lets you audit it.
Reference-guided — grading against a gold answer or the source context beats grading from the model's own memory.

The biases to design against

LLM judges favour longer answers, reward their own style, and are swayed by position and confident tone. A good rubric blunts these — and you should still validate the judge against human labels before trusting it, because an uncalibrated judge silently corrupts every downstream metric. This builder turns your criteria and scale into a ready-to-use judge prompt.

How it works

List independent, specific criteria — one judgment each.
Define a scale with described anchors (e.g. 1 = wrong, 5 = fully correct).
Ask for a rationale before the score to improve consistency + auditability.
Mitigate position, verbosity and self-preference bias in the instructions.

Frequently asked questions

What is LLM-as-a-judge?

LLM-as-a-judge uses a language model to score or compare the outputs of another model against a rubric, instead of relying solely on humans or rigid string matching. It scales evaluation and handles open-ended outputs, as long as the rubric is clear and known biases are controlled for.

What makes a good evaluation rubric?

Specific, independent criteria; an explicit scoring scale with described anchor points; and instructions to reason before scoring. Vague criteria like "is it good?" produce noisy scores — "does the answer cite a source for every factual claim?" produces consistent ones.

What biases affect LLM judges?

Common ones are position bias (favoring the first or second option in a comparison), verbosity bias (favoring longer answers), and self-preference (favoring the judge model’s own style). The generated prompt includes mitigations such as randomizing order and scoring on substance, not length.

Should the judge explain its score?

Yes. Asking the judge to give a short rationale before the numeric score improves consistency and makes results auditable, so you can see why something was marked down rather than trusting an unexplained number.

LLM-as-a-judge

What makes a rubric work

The biases to design against

How it works

Frequently asked questions

Keep going