LEARNING PATH · AI Engineering

Master LLM Evaluation

For engineers who need to prove their AI actually works.

Shipping an LLM feature is easy; knowing whether it is good is the hard part. Build the vocabulary of evals, design a judge rubric, implement a metric by hand, then evaluate agents and a real production system.

Speak precisely about offline vs online evals, judges and rubrics
Design an LLM-as-judge rubric that resists gaming
Implement eval metrics like token-level F1 from scratch
Evaluate agents and conversational systems, not just single calls

0 / 5 done · 0%

HandbookNext up
51 LLM Evals Interview Questions
The vocabulary of evals — start here.
ToolTool · optional
LLM-as-Judge Rubric Builder
Design an LLM-as-judge rubric interactively.
Challenge
Token-Level F1
Implement a real metric by hand, with tests.
Handbook
The Agent Evaluations Handbook
Evaluate agents — multi-step, tool-using.
AI System Design
Design an AI Agent System
The agent system your evals must judge.
Handbook
The Senior AI Engineer Interview Handbook
Tie evals into senior-scope AI engineering.

← All learning paths

51 LLM Evals Interview Questions

LLM-as-Judge Rubric Builder

Token-Level F1

The Agent Evaluations Handbook

Design an AI Agent System

The Senior AI Engineer Interview Handbook