AI ENGINEERING
AI Evaluation
How you know an AI system actually works — LLM-as-judge, agent evals, and metrics like token-F1 — the half of AI engineering that separates demos from production.
Handbooks 2
Handbook
51 LLM Evals Interview Questions
Golden sets, LLM-as-judge, regression testing, offline vs online evals, RAG evals, agent evals, red-teaming, and observability — demystified for interviews and production.
Handbook
The Agent Evaluations Handbook
A self-contained handbook on evaluating AI agents — theory, interactive widgets, and practical guidance. Trajectory evals, tool-use scoring, LLM-as-judge, observability, and reliability for PMs, engineers, and founders.