AI ENGINEERING

AI Evaluation

How you know an AI system actually works — LLM-as-judge, agent evals, and metrics like token-F1 — the half of AI engineering that separates demos from production.

4 pieces · 3 formats

Handbooks 2

Handbook

51 LLM Evals Interview Questions

Golden sets, LLM-as-judge, regression testing, offline vs online evals, RAG evals, agent evals, red-teaming, and observability — demystified for interviews and production.

AIEngineering

Handbook

The Agent Evaluations Handbook

A self-contained handbook on evaluating AI agents — theory, interactive widgets, and practical guidance. Trajectory evals, tool-use scoring, LLM-as-judge, observability, and reliability for PMs, engineers, and founders.

AIEngineering

Coding Challenges 1

Challenge

Token-Level F1

The metric behind QA evaluation (SQuAD and friends): how well does a predicted answer overlap a reference as a bag of words? Compute token precision and recall, then their harmonic-mean F1. Solve it in Python or TypeScript.

AI EngineeringEvalsNLP

Interactive Tools 1

Tool

LLM-as-Judge Rubric Builder

Define your evaluation criteria and a scoring scale, then generate a clean, copy-pasteable LLM-as-judge prompt you can drop into your eval pipeline — with the common pitfalls (position bias, verbosity bias, ties) called out. Turns eval theory into a prompt you can ship.

AIEvalsLLM

More in AI Engineering

RAG & Retrieval → LLM Engineering → AI Agents & Tools → ML Foundations →

← Browse all topics