LEARNING PATH · AI Engineering

Master LLM Evaluation

For engineers who need to prove their AI actually works.

Shipping an LLM feature is easy; knowing whether it is good is the hard part. Build the vocabulary of evals, design a judge rubric, implement a metric by hand, then evaluate agents and a real production system.

  • Speak precisely about offline vs online evals, judges and rubrics
  • Design an LLM-as-judge rubric that resists gaming
  • Implement eval metrics like token-level F1 from scratch
  • Evaluate agents and conversational systems, not just single calls
0 / 5 done · 0%
  1. HandbookNext up

    51 LLM Evals Interview Questions

    The vocabulary of evals — start here.

  2. ToolTool · optional

    LLM-as-Judge Rubric Builder

    Design an LLM-as-judge rubric interactively.

  3. Challenge

    Token-Level F1

    Implement a real metric by hand, with tests.

  4. Handbook

    The Agent Evaluations Handbook

    Evaluate agents — multi-step, tool-using.

  5. AI System Design

    Design an AI Agent System

    The agent system your evals must judge.

  6. Handbook

    The Senior AI Engineer Interview Handbook

    Tie evals into senior-scope AI engineering.

← All learning paths