Detailed answers, visual diagrams & interview tips.
Everything you need to work confidently in AI.
The bedrock. What an LLM actually is, how it differs from what came before, and the invisible mechanics — tokenization, embeddings, foundation models — that shape everything downstream.
An LLM is a deep neural network trained on massive text corpora to predict the next token in a sequence. At its core, it learns statistical patterns of language — grammar, facts, reasoning patterns, and stylistic nuances — by processing billions of text samples.
The "large" refers to both parameter count (often billions to trillions) and training data scale. Modern LLMs like GPT-4, Claude, and Llama use the Transformer architecture and are trained with self-supervised learning: they predict masked or next tokens, building rich internal representations of language.
After pretraining, they're typically fine-tuned with RLHF (Reinforcement Learning from Human Feedback) or similar alignment techniques to follow instructions and be helpful.
| Dimension | Classic Models (n-gram, RNN) | Modern LLMs (Transformer) |
|---|---|---|
| Architecture | n-grams, RNNs, LSTMs | Transformers with self-attention |
| Scale | Millions of parameters | Billions to trillions of parameters |
| Context | Fixed, short window | Thousands to millions of tokens |
| Emergent abilities | None — task-specific | In-context learning, chain-of-thought |
| Transfer | Separate model per task | One model → many tasks |
Classic models were essentially lookup tables — n-gram models computed P(word | previous n-1 words) and that was it. LLMs are learned, generalizable representations that can handle tasks never seen during training.
A foundation model is a large model trained on broad data at scale, designed to be adapted to a wide range of downstream tasks. The term was coined by Stanford's HAI center.
Key properties: Generality (diverse training data), Adaptability (fine-tune or prompt for specific tasks), and Emergent capabilities (behaviors not explicitly programmed). Examples: GPT-4 (language), CLIP (vision-language), Stable Diffusion (image generation).
This paradigm is both powerful (fewer models to train) and risky — a single model's biases propagate everywhere it's deployed.
| Capability | GPT-3 | GPT-4 |
|---|---|---|
| Modality | Text only | Text + Images (multimodal) |
| Context length | ~4K tokens | Up to 128K tokens |
| Reasoning | Mediocre on exams | Top percentile on Bar, SAT, GRE |
| Instruction following | Often inconsistent | Reliable over long outputs |
| Safety | Limited red-teaming | Extensive RLHF + red-teaming |
The jump was less about architectural novelty and more about scale, data quality, and alignment. GPT-4 unlocked use cases like document analysis, complex code generation, tutoring, and professional-grade writing that GPT-3 couldn't reliably handle.
Tokenization converts raw text into discrete tokens (subword units) that the model processes. Modern LLMs use subword tokenizers like BPE (Byte-Pair Encoding), WordPiece, or SentencePiece.
Why it matters for behavior: The tokenizer determines what atomic units the model sees. Poor tokenization of certain languages (e.g., non-Latin scripts) means more tokens for the same content, degrading performance and effective context capacity.
Why it matters for costs: API pricing is per-token. Non-English text or unusual variable names can tokenize 2–5× less efficiently than standard English, directly inflating costs.
Hidden gotcha: Tokenization artifacts explain why LLMs can't reliably count letters in a word — they see subword chunks, not individual characters.
Embeddings are dense vector representations that map discrete tokens (or sentences, documents) into a continuous vector space where semantic similarity = geometric proximity.
| Where | What embeddings do | Example |
|---|---|---|
| Inside LLMs | First layer converts token IDs to vectors | 768-dim vectors per token |
| RAG pipelines | Documents and queries → vectors in a DB | Pinecone, Qdrant, pgvector |
| Evaluation | Semantic similarity between output and ground truth | BERTScore, cosine sim |
| Clustering | Group related documents | topic modeling |
Classic example: king - man + woman ≈ queen. Good embeddings capture semantic relationships geometrically. Modern contrastive learning produces embeddings where similar meanings are close and dissimilar ones are far apart.
How do models actually work under the hood? From handling unknown words to understanding the encoder-decoder split, this section covers the building blocks before the Transformer era — and what led to it.
Modern LLMs effectively eliminated the OOV (out-of-vocabulary) problem through subword tokenization.
| Method | How It Works | Used By |
|---|---|---|
| BPE | Starts with characters, iteratively merges most-frequent pairs | GPT-2, GPT-3, GPT-4 |
| Byte-level BPE | Operates on raw bytes (0–255) — truly universal, handles emoji | GPT-2 onward |
| WordPiece | Similar to BPE but optimizes likelihood of training data | BERT, DistilBERT |
| SentencePiece | Language-agnostic, treats input as raw byte stream | T5, LLaMA, multilingual models |
The tradeoff: rare words get split into more tokens, consuming more context and giving the model less direct "understanding" of them as atomic units. This is why LLMs struggle with very rare proper nouns — they see fragments, not whole words.
Seq2seq (sequence-to-sequence) maps an input sequence to an output sequence of potentially different length. Originally proposed by Sutskever et al. (2014) using RNNs.
Problems it solves: Machine translation, summarization, question answering, dialogue — any task where input and output have different lengths and structures.
Key limitation: The fixed-size context vector is an information bottleneck. Long inputs get compressed into the same size vector as short ones, losing information. This is exactly what attention mechanisms (and later Transformers) were designed to fix.
Evolution: Seq2seq + attention → Transformer encoder-decoder (T5) → decoder-only LLMs (GPT). The paradigm still lives in T5, BART, and mBART.
Encoder-decoder: Uses both. The encoder builds a representation of the input; the decoder generates the output while cross-attending to the encoder's representations. Examples: T5, BART, original Transformer.
Modern trend: Decoder-only models (GPT, Claude, Llama) dominate because they're simpler to scale and can handle both understanding and generation in one architecture.
| Autoregressive (GPT) | Masked (BERT) | |
|---|---|---|
| Training objective | Predict next token: P(xₜ | x₁…xₜ₋₁) | Predict masked tokens: P(x_masked | x_unmasked) |
| Attention direction | Causal — left context only | Bidirectional — full context |
| Best for | Generation (text, code) | Understanding (classification, NER) |
| Examples | GPT-2/3/4, Claude, LLaMA | BERT, RoBERTa, DistilBERT |
| Generates naturally? | Yes — token by token | No — predicts all masks at once |
Hybrid approaches: XLNet uses permutation language modeling for bidirectional context with autoregressive generation. T5 uses span corruption as a middle ground. The field converged on autoregressive for large-scale LLMs.
MLM is BERT's core training objective. During training, ~15% of tokens are randomly selected:
Why it works: Bidirectional context forces the model to use both left and right context, building richer representations. It's fully self-supervised — no labeled data needed.
Limitation: The [MASK] token never appears at inference time (train-test mismatch). Also, only ~15% of tokens provide learning signal per example — less sample-efficient than autoregressive training where every token is a target.
NSP was introduced alongside MLM in the original BERT paper. The model is given two sentences and must predict whether sentence B actually follows sentence A in the corpus.
| Input | Label |
|---|---|
| "The dog barked loudly." + "It was a German Shepherd." | IsNext ✓ |
| "The dog barked loudly." + "Paris is the capital of France." | NotNext ✗ |
Purpose: Help the model understand inter-sentence relationships, useful for QA and natural language inference.
Why it fell out of favor: RoBERTa showed that NSP doesn't help and can even hurt performance. The task is too easy — the model can distinguish random pairs using topic mismatch alone, without learning deep discourse relationships.
What replaced it: RoBERTa dropped NSP entirely. ALBERT replaced it with Sentence Order Prediction (SOP) — both sentences are from the same document but may be swapped, forcing actual coherence understanding.
The mechanism that changed everything. Why Transformers won, how attention actually computes what to focus on, and why multi-head attention is more than just "more is better."
Five reasons Transformers won:
1. Parallelization: RNNs process tokens sequentially — Transformers process all positions in parallel via self-attention, enabling massive GPU utilization.
2. Long-range dependencies: In RNNs, information degrades over distance. In Transformers, self-attention connects every position to every other — path length is O(1) regardless of distance.
3. No information bottleneck: RNN seq2seq compressed the entire input into one fixed-size vector. Transformers maintain per-token representations throughout.
4. Scalability: Parallel nature makes it scale efficiently. This enabled training on orders of magnitude more data.
5. Cleaner gradient flow: Residual connections + layer normalization give much cleaner gradient paths than deep RNNs.
Self-attention is permutation-invariant — it treats the input as a set, not a sequence. Without positional information, "dog bites man" and "man bites dog" would produce identical representations.
| Method | How It Works | Advantage | Used In |
|---|---|---|---|
| Sinusoidal | Fixed sin/cos functions of different frequencies | Generalizes to unseen lengths | Original Transformer |
| Learned | A position embedding per position, trained | Simple to implement | GPT, BERT |
| RoPE | Rotates Q and K vectors by position | Captures relative positions, extrapolates | LLaMA, Mistral |
| ALiBi | Linear bias on attention scores based on distance | Efficient for long contexts | BLOOM |
The choice of positional encoding significantly impacts a model's ability to handle long contexts. RoPE has become the dominant choice for modern open-source LLMs because it naturally captures relative positions and supports context length extension via techniques like NTK-aware scaling.
Attention is a mechanism that lets each token dynamically focus on the most relevant parts of the input when computing its representation. It computes a weighted sum of values, where weights are determined by the compatibility between queries and keys.
Types of attention in Transformers:
Self-attention: Tokens in a sequence attend to each other — every token can interact with every other token.
Cross-attention: Decoder tokens attend to encoder representations — used in encoder-decoder models.
Causal attention: Restricted self-attention where tokens only attend to past positions — used in all decoder-only LLMs.
Multi-head attention runs several attention operations in parallel, each with its own learned projection matrices, then concatenates and projects the results.
Why multiple heads matter:
One head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (coreference), and another on positional patterns. A single head can only capture one type of interaction per layer.
Practical details: If model dimension d=512 and we use h=8 heads, each head operates in d/h=64 dimensions. Total compute ≈ same as single-head at d=512, but representation capacity is much richer.
Research finding: Not all heads are equally important — some can be pruned with minimal quality loss, while others are critical for specific capabilities.
The full formula: Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V
Why divide by √d_k? The raw dot product grows with dimension (expected value increases with d_k). Without scaling, softmax would saturate on high-dimensional vectors, producing near-zero gradients and making training unstable.
Softmax serves three critical purposes in the attention mechanism:
| Purpose | Why It Matters |
|---|---|
| Normalization | Converts raw scores into a probability distribution that sums to 1 — ensures output is a proper weighted average of values |
| Sparsification | Amplifies large differences — high-probability tokens get disproportionately high weights, low ones pushed toward zero — creates focused selection |
| Differentiability | Unlike hard argmax, softmax is smooth everywhere, enabling gradient-based training |
Alternatives being explored:
Linear attention: Removes softmax for O(n) complexity instead of O(n²), but loses sparsification benefit. Sigmoid attention: Some recent work replaces softmax with sigmoid for more independent per-head attention. Sparse attention: Uses top-k or local windows to approximate softmax more efficiently (Longformer, BigBird).
How does the model turn probability distributions into words? Temperature, top-p, beam search — these are the controls that separate generic output from precisely what you wanted.
The dot product appears as the similarity function between query and key vectors: score(q, k) = q · k = Σ(qᵢ × kᵢ)
| Reason for Dot Product | Why It Works |
|---|---|
| Computational efficiency | Dot products are just matrix multiplications — extremely fast on GPUs (QKᵀ is a single batched op) |
| Geometric meaning | Dot product measures vector alignment. Similar directions (semantically related) → high score. Orthogonal → zero score |
| Simplicity | Fewer parameters than additive attention, faster to compute |
| Differentiable | Smooth gradients for backprop through the similarity computation |
The scaling: Raw dot products grow with dimension — expected value ∝ d_k. Without the ÷ √d_k scaling, softmax saturates at high dimensions, producing near-zero gradients. That's why QKᵀ / √d_k is always seen together.
In practice, QKᵀ is a single matrix multiplication computing every query's dot product with every key simultaneously — the core of efficient Transformer inference.
| Greedy | Beam Search | |
|---|---|---|
| Speed | O(V) per step | O(B·V) per step |
| Quality | Locally optimal | Globally better sequences |
| Use for | Conversational AI, creative writing | Translation, summarization, ASR |
| Output style | Diverse, natural | Often generic, repetitive |
Modern practice: Most LLM apps use sampling (temperature + top-p) rather than beam search, because beam search tends to produce generic text. Beam search remains important for structured output tasks with a narrow correct answer space.
Temperature is a scalar that modifies logits before softmax: softmax(logits / T)
| Temperature | Effect | Use For |
|---|---|---|
| T → 0 | Greedy decoding (argmax) | Deterministic outputs |
| T = 0.0–0.3 | Focused, consistent | Factual QA, code, structured outputs |
| T = 0.7–1.0 | Natural, varied | Conversational AI, writing |
| T > 1.0 | Chaotic, unexpected | Brainstorming (rarely used in production) |
| T → ∞ | Uniform random sampling | Never useful |
Both are filtering strategies applied before sampling to avoid drawing from the long tail of low-probability tokens.
| Top-k Sampling | Top-p (Nucleus) Sampling | |
|---|---|---|
| How it works | Keep the k highest-probability tokens | Keep smallest set where cumulative prob ≥ p |
| Candidate set size | Fixed at k | Adapts dynamically to model confidence |
| When model is confident | Still keeps k candidates | Keeps fewer candidates (maybe 2–3) |
| When model is uncertain | Still keeps k candidates | Keeps more candidates (maybe 50+) |
| Common default | k = 50 | p = 0.9 |
| Preference | Simpler but less adaptive | Preferred in most production systems |
Problem with top-k: If k=50 but 3 tokens have 95% of probability mass, you're still sampling from 47 near-zero-probability tokens — wasting candidate slots on bad choices.
Best practice: Many APIs apply both — top-k to cap maximum candidates, then top-p within that set. Common combination: top-p=0.9, top-k=50.
Adaptive softmax is an approximation technique for the output softmax layer when vocabulary is very large (100K+ tokens).
The problem: Computing softmax over the entire vocabulary requires computing logits for every token — a huge matrix multiplication that is the computational bottleneck during training.
Efficiency gain: Most of the time, the model only needs to compute the full softmax over the small head cluster. Tail clusters are only evaluated when needed, providing 2–10× speedups on the output layer.
Modern relevance: With BPE vocabularies around 32K–100K and faster hardware, adaptive softmax is less critical than before — but the concept of variable-capacity allocation based on frequency remains influential in efficient NLP.
Cross-entropy loss measures the difference between the model's predicted probability distribution and the true distribution: L = -log(P(correct_token))
| Property | Why It Matters |
|---|---|
| Direct likelihood optimization | Minimizing cross-entropy ≡ maximizing likelihood of training data. The most principled objective for probabilistic models. |
| Information-theoretic meaning | Measures average bits needed to encode true data using model's distribution. Lower = better compression = better model. |
| Relationship to perplexity | Perplexity = e^(cross-entropy). Perplexity of 10 = "model as confused as if 10 equally likely tokens." |
| Gradient behavior | Penalizes confident wrong predictions heavily, moderate gradients for uncertain predictions — encourages calibration naturally. |
Perplexity formula: PPL = exp(H(p,q)) — the standard evaluation metric for language models. Lower is better.
The math that makes learning happen. Gradients, backprop, vanishing signals, and why ReLU matters. This section separates engineers who use LLMs from those who understand them.
Embedding layers are lookup tables (matrices) where each row corresponds to a token. During training:
Sparse gradients: The backward pass computes gradient only for the rows corresponding to tokens that appeared in the current batch. This is why gradients are sparse — only the rows corresponding to tokens that appeared get updated.
Key implications:
• Rare tokens learn slowly — fewer gradient updates, less refined embeddings. This is why models struggle with very rare proper nouns.
• The embedding matrix is often the largest single parameter block in the model (vocab_size × hidden_dim).
• Weight tying: Many models share the embedding matrix with the output projection (lm_head), which acts as a regularizer and reduces parameter count.
The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function. For f: ℝⁿ → ℝᵐ, the Jacobian J is an m×n matrix where J_ij = ∂fᵢ/∂xⱼ.
| Role in Backprop | What It Means |
|---|---|
| Gradient transformation | Gradient through a layer = Jᵀ · upstream_gradient. The Jacobian determines how error signals are transformed at each layer. |
| Vanishing gradients | If singular values of J are consistently < 1, gradients shrink exponentially. This causes vanishing gradients in deep networks. |
| Exploding gradients | If singular values > 1 consistently, gradients explode. Gradient clipping is the mitigation. |
| Computational efficiency | Backprop never builds the full m×n Jacobian — it computes Jacobian-vector products (JVPs) in O(n), not O(n²). |
Conditioning: Well-conditioned Jacobians (singular values ≈ 1) lead to stable training. Layer normalization and careful initialization aim to maintain this throughout the network.
The chain rule is the mathematical foundation of backpropagation. For composition f(g(x)), the derivative is f'(g(x)) · g'(x).
Why backprop is efficient: Computing the gradient for layer 1 in an L-layer network is O(L) multiplications — not O(L!) (which you'd get by naively applying the chain rule). It reuses intermediate results from the forward pass.
Practical concerns: The chain of multiplications can cause vanishing/exploding gradients. Residual connections (x + f(x)) mitigate this by providing a direct gradient highway — the gradient of x + f(x) w.r.t. x is I + ∂f/∂x, so even if ∂f/∂x vanishes, the identity term I preserves the gradient.
ReLU (Rectified Linear Unit): f(x) = max(0, x). Despite its simplicity, it revolutionized deep learning training.
| Activation | Formula | Gradient Issue | Used In |
|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | Saturates at both ends → vanishing gradient | Old networks, output layers |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | Saturates at ±1 → vanishing gradient | RNNs, old networks |
| ReLU | max(0,x) | Gradient = 1 for x>0, dying ReLU for x<0 | CNNs, ResNets |
| GELU | x·Φ(x) | Smooth, no dying neurons | BERT, GPT-2, GPT-3 |
| SwiGLU | Gated: x·σ(Wx)·(Vx) | Best empirical performance | LLaMA, Mistral, GPT-4 |
Why ReLU was a breakthrough: The gradient is 1 for x > 0, enabling constant gradient flow in the positive region. Unlike sigmoid/tanh which saturate at both extremes, ReLU maintains a constant gradient, enabling much deeper networks.
Why modern LLMs use GELU/SwiGLU: ReLU has a "dying ReLU" problem (neurons stuck at 0) and isn't zero-centered. GELU provides a smooth approximation; SwiGLU adds a gating mechanism that empirically outperforms both.
Root cause: During backpropagation, gradients are multiplied by the Jacobian at each layer. If these Jacobians have spectral norms consistently < 1 (from saturating activations), the gradient shrinks exponentially with depth.
In PCA (Principal Component Analysis) and spectral methods, eigenvalues and eigenvectors provide the geometric decomposition of the data's variance structure.
| Concept | What It Represents | How It's Used |
|---|---|---|
| Eigenvectors | Principal directions (axes) of maximum variance in the data | Define the new coordinate system for PCA projection |
| Eigenvalues | Magnitude of variance along each eigenvector | Determine how much information each direction captures |
| Top-k eigenvectors | The most informative directions | Keep these to reduce dimensions while preserving structure |
| Eigenvalue ratio | λ_k / Σλᵢ = fraction of variance retained | Decide how many dimensions to keep |
Connection to LLMs:
• LoRA: Exploits the fact that weight update matrices have low intrinsic rank — SVD decomposition finds the eigenvectors of the update subspace
• Embedding visualization: PCA projects 768-dim BERT embeddings to 2D for visualization
• Attention analysis: Eigendecomposition of attention matrices reveals what the model focuses on
How do you make a model better without breaking what it already knows? LoRA, QLoRA, KL divergence, catastrophic forgetting, and knowledge distillation — the practical toolkit for customizing LLMs.
KL divergence (Kullback-Leibler divergence) measures how one probability distribution P differs from a reference distribution Q:
KL(P||Q) = Σ P(x) · log(P(x)/Q(x))
| Property | What It Means |
|---|---|
| Always ≥ 0 | Equals 0 only when P = Q exactly |
| Asymmetric | KL(P||Q) ≠ KL(Q||P). Matters which distribution is the reference. |
| Relation to cross-entropy | KL(P||Q) = H(P,Q) − H(P). Cross-entropy minus entropy. |
Key uses in LLMs:
RLHF constraint: During RLHF training, a KL penalty prevents the fine-tuned model from diverging too far from the base model. This preserves fluency while improving alignment. Without it, the model would exploit the reward signal and produce gibberish that scores high.
Knowledge distillation: Student model is trained to minimize KL divergence between its output distribution and the teacher's soft outputs.
VAE training: KL divergence regularizes the latent space toward a prior (Gaussian) distribution.
| LoRA | QLoRA | |
|---|---|---|
| Base weights | Frozen in full precision | Quantized to 4-bit NF4 |
| Adapter precision | float16/bf16 | bfloat16 |
| Memory savings | 10–100× fewer trainable params | Also reduces base model memory 4–8× |
| Use case | Multi-GPU fine-tuning | Single GPU fine-tuning (48GB for 65B model) |
| Inference overhead | None — BA merged into W | None — same merge trick |
Why it works: Weight updates during fine-tuning have low intrinsic rank — most of the change concentrates in a small subspace. LoRA parameterizes the update as a low-rank product, capturing this efficiently.
Catastrophic forgetting occurs when a neural network trained on new data loses its ability to perform well on previously learned tasks. New gradients overwrite the weights that encoded old knowledge.
Why it's especially problematic for LLMs: Fine-tuning on a narrow domain can destroy the model's general capabilities. A model fine-tuned heavily on legal text might lose its ability to write code or do math.
| Mitigation Strategy | How It Works | Example |
|---|---|---|
| PEFT (LoRA) | Freeze base weights — can't forget what you don't update | Default approach today |
| EWC | Regularization penalty for changing weights important to prior tasks (via Fisher information) | Elastic Weight Consolidation |
| Replay | Mix samples from original training distribution during fine-tuning | Experience replay |
| Small LR + Layer Freezing | Freeze early layers; very small LR on later layers | Common in transfer learning |
| Multi-task training | Train on new and old tasks simultaneously | Instruction-tuned models |
PEFT (Parameter-Efficient Fine-Tuning) is a family of methods that update only a small subset of model parameters while keeping most pretrained weights frozen.
| PEFT Method | What It Trains | Memory |
|---|---|---|
| LoRA / QLoRA | Low-rank decomposition of weight updates (A and B matrices) | Very low |
| Prefix Tuning | Learnable prefix tokens prepended to each layer | Low |
| Prompt Tuning | Learnable soft prompts at the input level only | Minimal |
| Adapters | Small bottleneck layers inserted between Transformer layers | Low |
| IA³ | Learned rescaling vectors for keys, values, and FFN activations | Minimal |
Why PEFT prevents forgetting by construction:
• Frozen base = preserved knowledge (mathematically — frozen weights can't change)
• Small parameter budget = optimization landscape constrained — model can't drift far from original
• Composability: train separate PEFT modules for different tasks, swap at inference. Base model stays pristine.
Tradeoff: Slightly lower peak performance on the target task vs full fine-tuning, but preservation of general capabilities makes this worthwhile in practice.
Knowledge distillation transfers knowledge from a large "teacher" model to a smaller "student" model by training the student to mimic the teacher's output distribution, not just match hard labels.
Why soft labels work better than hard labels: The teacher's probability distribution encodes which wrong answers are "almost right" — richer signal than a one-hot label. This dark knowledge helps the student learn the relationships between classes.
Applications in LLMs: Model compression (distilling GPT-4 quality into smaller models), synthetic data generation (large model generates training data for smaller), speculative decoding (small draft model verified by large model).
Limitation: Capacity gap — a 7B model can't fully absorb a 70B model's knowledge. Distillation also can't transfer emergent abilities that require scale.
In LLM fine-tuning, overfitting means the model memorizes the fine-tuning data rather than learning generalizable patterns.
Symptoms: Training loss keeps decreasing but validation loss plateaus or increases; the model generates verbatim training examples.
| Strategy | How It Helps | Priority |
|---|---|---|
| Data diversity | More varied high-quality data is the single most effective defense | ⭐ Highest |
| PEFT / LoRA | Constrains update to low-rank subspace — limits memorization capacity | ⭐ High |
| Early stopping | Monitor validation loss; stop when it starts increasing | ⭐ High |
| Small learning rate | Prevents aggressive weight updates that memorize individual examples | Medium |
| Dropout | Randomly zero activations during training — reduces co-adaptation | Medium |
| Weight decay (L2) | Regularization on parameter magnitudes | Low |
Evaluation strategy: For LLMs, you need task-specific evaluation beyond just loss — measure actual generation quality on held-out examples. Perplexity on training set vs validation set is a quick diagnostic.
From discriminative vs generative to RAG pipelines and chain-of-thought. The concepts every AI practitioner — technical or not — needs to articulate clearly and correctly.
| Discriminative Models | Generative Models | |
|---|---|---|
| What they learn | Decision boundary P(y|x) | Data distribution P(x) or joint P(x,y) |
| Goal | Given x, predict y | Generate new samples from distribution |
| Examples | Logistic regression, BERT classification, SVMs | GPT, VAEs, GANs, diffusion models |
| Best at | Classification, regression | Generation, density estimation |
| Data requirements | Less — only needs decision boundary | More — needs to model full distribution |
Modern blur: LLMs like GPT are generative models that perform discriminative tasks (classification, NLI) by generating the answer. "Is this review positive or negative?" → GPT generates "Positive." This generative approach to discriminative tasks has been surprisingly effective, blurring the traditional boundary.
Business model difference: Discriminative AI tends to save costs (automating classification). Generative AI tends to create new value (enabling new capabilities). Best products often combine both: generate content (generative) then filter/rank it (discriminative).
Prompt design fundamentally determines what part of the model's learned distribution you're sampling from. Small changes can dramatically shift output quality.
| Technique | How It Works | Example |
|---|---|---|
| Specificity | Vague prompts get average outputs | "Write about AI" vs "Write a 500-word technical post comparing RAG and fine-tuning, targeting ML engineers" |
| Role/Persona | Activates different knowledge patterns | "You are a senior ML engineer" vs bare question |
| Output format | Requesting structure helps organization | "Respond in JSON with keys: summary, pros, cons" |
| Few-shot examples | Often more effective than lengthy instructions | Provide 2–3 input→output pairs before the query |
| Decomposition | Breaking complex tasks into steps | Chain-of-thought: "Think step by step" |
System-level prompt engineering: In production, you design system prompts that set behavioral constraints, inject context, and define output schemas. This is where prompt engineering becomes system design.
Chain-of-thought (CoT) prompting encourages the model to show its reasoning steps before giving a final answer.
| Helpful When | Not Helpful When |
|---|---|
| Math and arithmetic problems | Simple factual recall ("Capital of France?") |
| Multi-step logical reasoning | Tasks where model is already highly confident |
| Complex analysis with many factors | Speed-critical applications (CoT adds tokens) |
| Planning and decision-making | Short-answer classification tasks |
Variants: Tree of Thought (explores multiple reasoning paths), Self-Consistency (sample multiple CoTs, take majority vote), Zero-shot CoT (just add "Let's think step by step").
RAG (Retrieval-Augmented Generation) augments LLM generation with retrieved external knowledge.
| Stage | Key Decisions | Common Tools |
|---|---|---|
| Indexing | Chunk size (too small = no context; too large = dilutes relevance) | LangChain, LlamaIndex |
| Retrieval | Embedding model choice, top-k value, hybrid (dense+sparse) | Pinecone, Qdrant, pgvector |
| Reranking | Cross-encoder model, reranking threshold | Cohere Rerank, BGE Reranker |
| Generation | Context window budget, citation format, hallucination mitigation | Any LLM API |
Evaluation metrics: Retrieval quality (Precision@k, recall, MRR), Generation quality (faithfulness, groundedness), End-to-end (user satisfaction).
Knowledge graphs (KGs) store information as structured triples (entity → relationship → entity) and can significantly enhance RAG pipelines beyond what vector search alone provides.
| Benefit | How KG Adds It |
|---|---|
| Structured relationships | Traverse edges (not just find similar text) — critical for relational queries |
| Multi-hop reasoning | Naturally chains A→B→C queries that confuse vector search |
| Entity disambiguation | Resolves "Apple" (company vs fruit) via entity types |
| Hallucination reduction | Structured facts ground generation in verified triples |
| Completeness signals | Can detect when information is missing → "I don't know" vs hallucinate |
GraphRAG: Microsoft's approach combines vector retrieval for broad context with graph traversal for precise, structured facts. LLMs are used to build a community-based KG from documents, then queries it with graph algorithms + LLM generation.
What happens at the frontier. Mixture-of-Experts, multimodal models, zero-shot and few-shot learning, context windows — and the practical limits of what scale actually buys you.
| Improvement | How Native Multimodal Achieves It |
|---|---|
| Training stability | Joint pretraining prevents distribution mismatch between independently-trained vision/language models |
| Cross-modal understanding | Cross-modal attention from earliest layers, not just late-stage adapters |
| Parameter efficiency | Single unified model vs maintaining separate vision + language models |
| Interleaved content | Handles documents with mixed text+images naturally (slides, charts, figures) |
MoE replaces the dense feedforward network (FFN) in each Transformer layer with multiple "expert" FFNs and a learned router that activates only a subset per token.
| Benefit | Detail |
|---|---|
| Parameter efficiency | 1.8T parameter MoE model activates only ~280B per token — near-dense performance at fraction of compute |
| Specialization | Different experts can specialize in different domains, learning more efficiently |
| Training cost | Scales with active parameters, not total. Large knowledge capacity for same training FLOPS |
Challenges: Load balancing (ensuring all experts get used — auxiliary loss required), communication overhead in distributed training, expert collapse (all tokens routing to same experts), higher memory (all experts must fit in RAM).
Examples: Mixtral 8x7B, Switch Transformer, Grok (xAI), rumored GPT-4.
Zero-shot means the model performs a task without any task-specific training examples in the prompt. It relies entirely on pretraining knowledge and instruction-following ability.
Why LLMs can do zero-shot: During pretraining on diverse text, the model encounters millions of implicit task demonstrations. It learns the pattern of instructions and responses. Instruction tuning (fine-tuning on instruction-following data like InstructGPT) dramatically improves zero-shot performance.
Limitations: Zero-shot performance varies wildly. Simple classification and translation work well; complex structured output or domain-specific tasks often need examples or fine-tuning.
Few-shot learning provides a small number of input-output examples directly in the prompt before the actual query.
| Why Few-Shot Works | Mechanism |
|---|---|
| In-context learning (ICL) | Examples create a local pattern the model's autoregressive generation follows — "copies the format" but generalizes |
| Task specification | Examples implicitly define the task more precisely than instructions alone — shows format, style, expected output |
| Distribution anchoring | Shifts the model's output distribution toward the desired pattern without any weight updates |
Best practices: Use diverse, representative examples. Order can matter (recency bias — later examples weigh more). More examples help up to a point, then returns diminish. For structured tasks, consistent formatting is critical.
The context window is the maximum number of tokens a model can process in a single forward pass — the model's "working memory."
| Model | Context Window | ≈ Pages of Text |
|---|---|---|
| GPT-4o | 128K tokens | ~300 pages |
| Claude 3.5 (Sonnet) | 200K tokens | ~500 pages |
| Gemini 1.5 Pro | 2M tokens | ~5,000 pages |
| LLaMA 3 (open source) | 128K tokens | ~300 pages |
Practical limits beyond raw size: Self-attention is O(n²) — doubling context quadruples compute. Effective context < maximum context (needle-in-a-haystack tests show degraded recall as context grows). Latency scales with context length.
Best practice: Use RAG to retrieve and inject only relevant context rather than stuffing the entire window. Quality of context >> quantity of context.
| Learned Parameters | Hyperparameters | |
|---|---|---|
| What they are | Values the model discovers via gradient descent | Values set by the engineer before training |
| Updated by | Optimizer (Adam, SGD) | Never — set manually or via search |
| Examples | Attention weight matrices, embedding vectors, layer norm scales | Learning rate, batch size, number of layers, heads |
| Count in GPT-4 | ~1.8 trillion (rumored) | Dozens of key decisions |
Three types of hyperparameters:
Architecture: Number of layers, hidden dimension, number of attention heads, vocabulary size, FFN intermediate size — define the model's capacity.
Training: Learning rate, batch size, number of epochs, warmup steps, weight decay, dropout rate, gradient clipping — control the training dynamics.
Generation: Temperature, top-p, top-k, max tokens, repetition penalty — control the model's output at inference time.
Responsible deployment: harm mitigation, output filtering, layered defenses, and the operational pitfalls that sink LLM products in the real world.
No single layer is sufficient. Production-grade LLM systems use a layered defense model — multiple independent mechanisms at different stages of the pipeline.
Response strategies for specific failure modes:
| Failure Mode | Strategy |
|---|---|
| Harmful content (violence, CSAM, hate) | Hard refusal at input and output layers; incident logging; no graceful fallback |
| Hallucination | Ground with RAG; add uncertainty language ("I'm not certain, but..."); cite sources |
| Off-topic drift | System prompt constraints; semantic similarity check on output; graceful redirect |
| Prompt injection | Input sanitization; treat user input as untrusted; separate system/user namespaces |
| Jailbreaks | Layered classifiers; behavioral analysis; model-level RLHF; adversarial fine-tuning |
| PII leakage | Regex + NER-based scrubbing at input; output diff against training data; data minimization |
Most LLM deployments fail not because of model quality, but because of infrastructure, cost, and expectation gaps that weren't anticipated during prototyping.
Quick-reference checklist before launch:
| Area | Pre-launch Check |
|---|---|
| Cost | Model 95th-percentile usage; set API spend alerts; enable caching |
| Latency | Load test at 5× expected peak; stream responses where possible |
| Safety | Red-team before launch; output filters live; PII handling documented |
| Reliability | Fallback path tested; circuit breaker for API failures; retry with backoff |
| Observability | All calls logged; dashboards for error rates and latency; user feedback button |
| Eval | Domain-specific eval suite; baseline metrics set; alerts for regression |
From tokenization to transformer internals, from fine-tuning strategies to production safety — this handbook is your compact reference for LLM interviews and real-world AI engineering.