Vibe Engines
Visual Handbook · 2025
50 Questions · 9 Domains
Interview Preparation & Reference

50 LLM
Interview
Questions

Detailed answers, visual diagrams & interview tips.
Everything you need to work confidently in AI.

Transformer Architecture Attention Mechanisms Tokenization Training & Optimization LoRA / QLoRA RAG Pipelines Decoding Strategies Evaluation Metrics Production Deployment
LLM Foundations
Q1–6
Architectures & Mechanisms
Q7–12
Transformers & Attention
Q13–18
Decoding & Generation
Q19–24
Training & Optimization
Q25–30
Evaluation & Fine-Tuning
Q31–36
Generative AI Concepts
Q37–42
Multimodal & Scaling
Q43–48
Safety & Production
Q49–50
Saurabh Singh
AI Engineer & Builder.
Contents

What's
Inside

Part 1 — LLM Foundations
Q1–6
01What is an LLM?
02Modern vs classic language models
03Foundation models explained
04GPT-3 vs GPT-4 changes
05Tokenization & why it matters
06Embeddings in LLM systems
Part 2 — Architectures & Mechanisms
Q7–12
07Handling OOV words
08Seq2seq models
09Encoder vs decoder
10Autoregressive vs masked models
11Masked language modeling (MLM)
12Next sentence prediction (NSP)
Part 3 — Transformers & Attention
Q13–18
13Why Transformers outperformed seq2seq
14Positional encodings
15What is attention?
16Multi-head attention
17Computing attention weights
18Why softmax in attention?
Part 4 — Decoding & Generation
Q19–24
19Dot product in self-attention
20Beam search vs greedy decoding
21Temperature in text generation
22Top-k vs top-p sampling
23Adaptive softmax
24Cross-entropy loss
Part 5 — Training & Optimization
Q25–30
25Gradient flow to embeddings
26The Jacobian in backprop
27Chain rule in deep learning
28ReLU activation function
29Vanishing gradients & mitigations
30Eigenvalues in dimensionality reduction
Part 6 — Evaluation & Fine-Tuning
Q31–36
31KL divergence in ML/LLM evaluation
32LoRA and QLoRA
33Catastrophic forgetting
34PEFT reduces forgetting
35Knowledge distillation
36Overfitting in fine-tuning
Part 7 — Generative AI Concepts
Q37–42
37Generative vs discriminative models
38Explaining AI to a PM
39Prompt design & its influence
40Chain-of-thought prompting
41RAG pipeline stages
42Knowledge graphs in RAG
Part 8 — Multimodal & Scaling
Q43–48
43Gemini multimodal improvements
44Mixture-of-Experts (MoE)
45Zero-shot learning
46Few-shot learning in prompts
47Context window & practical limits
48Hyperparameters vs learned parameters
Part 9 — Safety & Production
Q49–50
49Response strategy for harmful content
50Common production pitfalls
Part One

LLM
Foundations

The bedrock. What an LLM actually is, how it differs from what came before, and the invisible mechanics — tokenization, embeddings, foundation models — that shape everything downstream.

Q1 — What is an LLM? Q2 — Modern vs Classic Q3 — Foundation Models Q4 — GPT-3 → GPT-4 Q5 — Tokenization Q6 — Embeddings
Questions 1 – 6
Saurabh Singh
Q01

In simple terms, what is an LLM (large language model)?

An LLM is a deep neural network trained on massive text corpora to predict the next token in a sequence. At its core, it learns statistical patterns of language — grammar, facts, reasoning patterns, and stylistic nuances — by processing billions of text samples.

The "large" refers to both parameter count (often billions to trillions) and training data scale. Modern LLMs like GPT-4, Claude, and Llama use the Transformer architecture and are trained with self-supervised learning: they predict masked or next tokens, building rich internal representations of language.

After pretraining, they're typically fine-tuned with RLHF (Reinforcement Learning from Human Feedback) or similar alignment techniques to follow instructions and be helpful.

LLM Training Pipeline
Raw Text
Terabytes
Pre-training
Next token prediction
RLHF/SFT
Alignment
Deployed
LLM
→ Interview Tip
Emphasize the self-supervised pretraining + alignment pipeline, not just "big neural net." Interviewers want to hear that you understand the two-phase training process.
Q02

How do modern LLMs differ from older, classic language models?

DimensionClassic Models (n-gram, RNN)Modern LLMs (Transformer)
Architecturen-grams, RNNs, LSTMsTransformers with self-attention
ScaleMillions of parametersBillions to trillions of parameters
ContextFixed, short windowThousands to millions of tokens
Emergent abilitiesNone — task-specificIn-context learning, chain-of-thought
TransferSeparate model per taskOne model → many tasks

Classic models were essentially lookup tables — n-gram models computed P(word | previous n-1 words) and that was it. LLMs are learned, generalizable representations that can handle tasks never seen during training.

→ Key Contrast
n-gram models were lookup tables; LLMs are learned, generalizable representations. The key jump was parallelization + scale + emergent abilities.
Q03

What does the term "foundation model" mean?

A foundation model is a large model trained on broad data at scale, designed to be adapted to a wide range of downstream tasks. The term was coined by Stanford's HAI center.

Foundation Model Paradigm
One Large Foundation Model
Trained on diverse text, code, images
Fine-tuning
Prompting
RAG
Embeddings

Key properties: Generality (diverse training data), Adaptability (fine-tune or prompt for specific tasks), and Emergent capabilities (behaviors not explicitly programmed). Examples: GPT-4 (language), CLIP (vision-language), Stable Diffusion (image generation).

This paradigm is both powerful (fewer models to train) and risky — a single model's biases propagate everywhere it's deployed.

→ Mental Model
Think of it as the "operating system" of modern AI — one base, many applications built on top.
Q04

In practice, what changed going from GPT-3 to GPT-4?

CapabilityGPT-3GPT-4
ModalityText onlyText + Images (multimodal)
Context length~4K tokensUp to 128K tokens
ReasoningMediocre on examsTop percentile on Bar, SAT, GRE
Instruction followingOften inconsistentReliable over long outputs
SafetyLimited red-teamingExtensive RLHF + red-teaming

The jump was less about architectural novelty and more about scale, data quality, and alignment. GPT-4 unlocked use cases like document analysis, complex code generation, tutoring, and professional-grade writing that GPT-3 couldn't reliably handle.

→ Interview Tip
Don't just say "GPT-4 is bigger." Lead with multimodality, 32× context expansion, and alignment improvements. These are the practical capability jumps that matter in production.
Q05

What is tokenization and why does it matter for LLM behavior/costs?

Tokenization converts raw text into discrete tokens (subword units) that the model processes. Modern LLMs use subword tokenizers like BPE (Byte-Pair Encoding), WordPiece, or SentencePiece.

BPE Tokenization Example
Input text:
"Unfamiliarization"
Tokens:
Un
familiar
ization
Count:
3 tokens (not 18 characters)

Why it matters for behavior: The tokenizer determines what atomic units the model sees. Poor tokenization of certain languages (e.g., non-Latin scripts) means more tokens for the same content, degrading performance and effective context capacity.

Why it matters for costs: API pricing is per-token. Non-English text or unusual variable names can tokenize 2–5× less efficiently than standard English, directly inflating costs.

Hidden gotcha: Tokenization artifacts explain why LLMs can't reliably count letters in a word — they see subword chunks, not individual characters.

→ Key Insight
Tokenization is the invisible bottleneck — it affects cost, context usage, multilingual performance, and even reasoning tasks like counting characters.
Q06

Why are embeddings useful, and where do they show up in LLM systems?

Embeddings are dense vector representations that map discrete tokens (or sentences, documents) into a continuous vector space where semantic similarity = geometric proximity.

WhereWhat embeddings doExample
Inside LLMsFirst layer converts token IDs to vectors768-dim vectors per token
RAG pipelinesDocuments and queries → vectors in a DBPinecone, Qdrant, pgvector
EvaluationSemantic similarity between output and ground truthBERTScore, cosine sim
ClusteringGroup related documentstopic modeling

Classic example: king - man + woman ≈ queen. Good embeddings capture semantic relationships geometrically. Modern contrastive learning produces embeddings where similar meanings are close and dissimilar ones are far apart.

→ Core Idea
Embeddings are the bridge between discrete text and the continuous math that neural networks operate on. Without them, gradient-based learning wouldn't work on language.
Part Two

Architectures &
Mechanisms

How do models actually work under the hood? From handling unknown words to understanding the encoder-decoder split, this section covers the building blocks before the Transformer era — and what led to it.

Q7 — OOV Handling Q8 — Seq2seq Q9 — Encoder vs Decoder Q10 — Autoregressive vs Masked Q11 — MLM Q12 — NSP
Questions 7 – 12
Saurabh Singh
Q07

How do models deal with words or strings they've never seen before (OOV)?

Modern LLMs effectively eliminated the OOV (out-of-vocabulary) problem through subword tokenization.

MethodHow It WorksUsed By
BPEStarts with characters, iteratively merges most-frequent pairsGPT-2, GPT-3, GPT-4
Byte-level BPEOperates on raw bytes (0–255) — truly universal, handles emojiGPT-2 onward
WordPieceSimilar to BPE but optimizes likelihood of training dataBERT, DistilBERT
SentencePieceLanguage-agnostic, treats input as raw byte streamT5, LLaMA, multilingual models

The tradeoff: rare words get split into more tokens, consuming more context and giving the model less direct "understanding" of them as atomic units. This is why LLMs struggle with very rare proper nouns — they see fragments, not whole words.

→ Key Tradeoff
Subword tokenization trades vocabulary completeness for vocabulary compactness. You get OOV-free coverage but rare terms get fragmented, reducing the model's effective understanding of them.
Q08

What is a seq2seq model and what problem does it solve?

Seq2seq (sequence-to-sequence) maps an input sequence to an output sequence of potentially different length. Originally proposed by Sutskever et al. (2014) using RNNs.

Seq2Seq Architecture
Input tokens
"Hello world"
Encoder
Compresses to context vector
Output tokens
"Bonjour monde"
Decoder
Generates token by token

Problems it solves: Machine translation, summarization, question answering, dialogue — any task where input and output have different lengths and structures.

Key limitation: The fixed-size context vector is an information bottleneck. Long inputs get compressed into the same size vector as short ones, losing information. This is exactly what attention mechanisms (and later Transformers) were designed to fix.

Evolution: Seq2seq + attention → Transformer encoder-decoder (T5) → decoder-only LLMs (GPT). The paradigm still lives in T5, BART, and mBART.

→ Historical Arc
Seq2seq introduced the encoder-decoder paradigm; attention fixed its information bottleneck. Understanding this evolution matters — interviewers often trace the "why" of Transformers back to here.
Q09

Encoder vs decoder: what does each one do?

Encoder vs Decoder — Side by Side
Encoder
token₁
↕ Bidirectional attention
token₂
↕ Sees all tokens
token₃
Sees full sequence at once.
Great for understanding tasks (BERT).
Decoder
token₁
→ Causal (left-to-right)
token₂
→ Can't see future
[NEXT] ?
Generates token by token.
Great for generation tasks (GPT).

Encoder-decoder: Uses both. The encoder builds a representation of the input; the decoder generates the output while cross-attending to the encoder's representations. Examples: T5, BART, original Transformer.

Modern trend: Decoder-only models (GPT, Claude, Llama) dominate because they're simpler to scale and can handle both understanding and generation in one architecture.

→ One-liner
Encoder = understand everything at once. Decoder = generate one token at a time. Modern LLMs are decoder-only because generation at scale is simpler than maintaining separate encoder/decoder.
Q10

Autoregressive vs masked models: what's the core difference?

Autoregressive (GPT)Masked (BERT)
Training objectivePredict next token: P(xₜ | x₁…xₜ₋₁)Predict masked tokens: P(x_masked | x_unmasked)
Attention directionCausal — left context onlyBidirectional — full context
Best forGeneration (text, code)Understanding (classification, NER)
ExamplesGPT-2/3/4, Claude, LLaMABERT, RoBERTa, DistilBERT
Generates naturally?Yes — token by tokenNo — predicts all masks at once

Hybrid approaches: XLNet uses permutation language modeling for bidirectional context with autoregressive generation. T5 uses span corruption as a middle ground. The field converged on autoregressive for large-scale LLMs.

→ Field Direction
Autoregressive = generation-first. Masked = understanding-first. Modern AI converged on autoregressive for frontier LLMs. But BERT-style models are still dominant for embeddings and classification.
Q11

What is masked language modeling and why is it a good pretraining objective?

MLM is BERT's core training objective. During training, ~15% of tokens are randomly selected:

MLM Token Selection
The
[MASK]
sat
on
chair
80% → [MASK]
10% → random word
10% → unchanged

Why it works: Bidirectional context forces the model to use both left and right context, building richer representations. It's fully self-supervised — no labeled data needed.

Limitation: The [MASK] token never appears at inference time (train-test mismatch). Also, only ~15% of tokens provide learning signal per example — less sample-efficient than autoregressive training where every token is a target.

→ Intuition
MLM is like a cloze test at massive scale — fill in the blank forces holistic understanding. The model can't cheat by just looking left.
Q12

What is next sentence prediction (NSP) and when is it used?

NSP was introduced alongside MLM in the original BERT paper. The model is given two sentences and must predict whether sentence B actually follows sentence A in the corpus.

InputLabel
"The dog barked loudly." + "It was a German Shepherd."IsNext ✓
"The dog barked loudly." + "Paris is the capital of France."NotNext ✗

Purpose: Help the model understand inter-sentence relationships, useful for QA and natural language inference.

Why it fell out of favor: RoBERTa showed that NSP doesn't help and can even hurt performance. The task is too easy — the model can distinguish random pairs using topic mismatch alone, without learning deep discourse relationships.

What replaced it: RoBERTa dropped NSP entirely. ALBERT replaced it with Sentence Order Prediction (SOP) — both sentences are from the same document but may be swapped, forcing actual coherence understanding.

→ Key Lesson
NSP was well-intentioned but too easy to game. Harder objectives like SOP worked better. A good interview answer connects this to the general principle: your pretraining objective shapes what the model actually learns.
Part Three

Transformers &
Attention

The mechanism that changed everything. Why Transformers won, how attention actually computes what to focus on, and why multi-head attention is more than just "more is better."

Q13 — Why Transformers Won Q14 — Positional Encodings Q15 — Attention Explained Q16 — Multi-Head Attention Q17 — Computing Attention Scores Q18 — Why Softmax?
Questions 13 – 18
Saurabh Singh
Q13

Why did Transformers outperform older seq2seq architectures?

RNN (Sequential) vs Transformer (Parallel)
RNN / LSTM
h₁ → h₂ → h₃ → h₄
Sequential — can't parallelize
Gradient vanishes over distance
Fixed-size context bottleneck
Transformer
t₁↔t₂
t₁↔t₃
t₂↔t₄
t₃↔t₄
All positions in parallel
O(1) path between any two tokens
Per-token representations maintained

Five reasons Transformers won:

1. Parallelization: RNNs process tokens sequentially — Transformers process all positions in parallel via self-attention, enabling massive GPU utilization.

2. Long-range dependencies: In RNNs, information degrades over distance. In Transformers, self-attention connects every position to every other — path length is O(1) regardless of distance.

3. No information bottleneck: RNN seq2seq compressed the entire input into one fixed-size vector. Transformers maintain per-token representations throughout.

4. Scalability: Parallel nature makes it scale efficiently. This enabled training on orders of magnitude more data.

5. Cleaner gradient flow: Residual connections + layer normalization give much cleaner gradient paths than deep RNNs.

→ Core Insight
Replace recurrence with attention. It's simpler, more parallel, and scales better. The paper title was literally "Attention Is All You Need."
Q14

What are positional encodings and why are they needed?

Self-attention is permutation-invariant — it treats the input as a set, not a sequence. Without positional information, "dog bites man" and "man bites dog" would produce identical representations.

MethodHow It WorksAdvantageUsed In
SinusoidalFixed sin/cos functions of different frequenciesGeneralizes to unseen lengthsOriginal Transformer
LearnedA position embedding per position, trainedSimple to implementGPT, BERT
RoPERotates Q and K vectors by positionCaptures relative positions, extrapolatesLLaMA, Mistral
ALiBiLinear bias on attention scores based on distanceEfficient for long contextsBLOOM

The choice of positional encoding significantly impacts a model's ability to handle long contexts. RoPE has become the dominant choice for modern open-source LLMs because it naturally captures relative positions and supports context length extension via techniques like NTK-aware scaling.

→ Key Point
Positional encodings inject sequence order into an otherwise order-agnostic architecture. RoPE is the modern standard — know why it's preferred over learned embeddings for long-context tasks.
Q15

What is attention, and what does it enable in Transformers?

Attention is a mechanism that lets each token dynamically focus on the most relevant parts of the input when computing its representation. It computes a weighted sum of values, where weights are determined by the compatibility between queries and keys.

Attention: What Each Vector Means
Q
Query
"What am I looking for?"
K
Key
"What information do I contain?"
V
Value
"What do I actually offer if selected?"

Types of attention in Transformers:

Self-attention: Tokens in a sequence attend to each other — every token can interact with every other token.

Cross-attention: Decoder tokens attend to encoder representations — used in encoder-decoder models.

Causal attention: Restricted self-attention where tokens only attend to past positions — used in all decoder-only LLMs.

→ Intuition
Attention = learned, content-dependent routing of information between tokens. Unlike convolutions (fixed window), attention adaptively selects what to focus on based on the content itself.
Q16

What does "multi-head" attention add beyond single-head attention?

Multi-head attention runs several attention operations in parallel, each with its own learned projection matrices, then concatenates and projects the results.

Multi-Head Attention — Parallel Heads
Input Embeddings
Head 1
Syntax?
Head 2
Semantics?
Head 3
Position?
Head N
…?
Concat → Linear Projection
Output

Why multiple heads matter:

One head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (coreference), and another on positional patterns. A single head can only capture one type of interaction per layer.

Practical details: If model dimension d=512 and we use h=8 heads, each head operates in d/h=64 dimensions. Total compute ≈ same as single-head at d=512, but representation capacity is much richer.

Research finding: Not all heads are equally important — some can be pruned with minimal quality loss, while others are critical for specific capabilities.

→ Key Insight
Multi-head attention lets the model attend to information from different representation subspaces simultaneously. It's "divide and specialize" — each head learns different types of relationships.
Q17

How are attention weights/scores computed at a high level?

Scaled Dot-Product Attention — Step by Step
Q (Queries)
W_Q × Input
K (Keys)
W_K × Input
V (Values)
W_V × Input
Scores = Q · Kᵀ
Dot product similarity
Scale by √d_k
Prevent gradient saturation
Softmax → Weights (sum=1)
Output = Weights · V

The full formula: Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V

Why divide by √d_k? The raw dot product grows with dimension (expected value increases with d_k). Without scaling, softmax would saturate on high-dimensional vectors, producing near-zero gradients and making training unstable.

→ The Formula
Memorize: Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V. Know what each piece does. The √d_k scaling is a common interview gotcha — know why it's there.
Q18

Why does attention use softmax?

Softmax serves three critical purposes in the attention mechanism:

PurposeWhy It Matters
NormalizationConverts raw scores into a probability distribution that sums to 1 — ensures output is a proper weighted average of values
SparsificationAmplifies large differences — high-probability tokens get disproportionately high weights, low ones pushed toward zero — creates focused selection
DifferentiabilityUnlike hard argmax, softmax is smooth everywhere, enabling gradient-based training

Alternatives being explored:

Linear attention: Removes softmax for O(n) complexity instead of O(n²), but loses sparsification benefit. Sigmoid attention: Some recent work replaces softmax with sigmoid for more independent per-head attention. Sparse attention: Uses top-k or local windows to approximate softmax more efficiently (Longformer, BigBird).

→ Core Reason
Softmax turns raw similarity scores into a learnable, differentiable probability distribution over context. The alternatives (linear, sparse) trade the quality of this distribution for computational efficiency at scale.
Part Four

Decoding &
Generation

How does the model turn probability distributions into words? Temperature, top-p, beam search — these are the controls that separate generic output from precisely what you wanted.

Q19 — Dot Product in Attention Q20 — Beam vs Greedy Q21 — Temperature Q22 — Top-k vs Top-p Q23 — Adaptive Softmax Q24 — Cross-Entropy Loss
Questions 19 – 24
Saurabh Singh
Q19

Where does the dot product show up in self-attention and why?

The dot product appears as the similarity function between query and key vectors: score(q, k) = q · k = Σ(qᵢ × kᵢ)

Reason for Dot ProductWhy It Works
Computational efficiencyDot products are just matrix multiplications — extremely fast on GPUs (QKᵀ is a single batched op)
Geometric meaningDot product measures vector alignment. Similar directions (semantically related) → high score. Orthogonal → zero score
SimplicityFewer parameters than additive attention, faster to compute
DifferentiableSmooth gradients for backprop through the similarity computation

The scaling: Raw dot products grow with dimension — expected value ∝ d_k. Without the ÷ √d_k scaling, softmax saturates at high dimensions, producing near-zero gradients. That's why QKᵀ / √d_k is always seen together.

In practice, QKᵀ is a single matrix multiplication computing every query's dot product with every key simultaneously — the core of efficient Transformer inference.

→ GPU Intuition
Dot product = fast, parallelizable similarity that GPUs love. The entire attention mechanism is essentially a sequence of matrix multiplications, which is why modern hardware (NVIDIA, TPUs) can run it so efficiently.
Q20

Beam search vs greedy decoding: how do they differ and when would you use each?

Greedy vs Beam Search (beam width B=2)
Greedy Decoding
[START]
↓ pick max prob
The
↓ pick max prob
cat
Fast, myopic.
May miss global optimum.
Beam Search (B=2)
[START]
The ✓
A ✓
cat
dog
big✗
small✗
Keeps top-B candidates.
Better global sequences.
GreedyBeam Search
SpeedO(V) per stepO(B·V) per step
QualityLocally optimalGlobally better sequences
Use forConversational AI, creative writingTranslation, summarization, ASR
Output styleDiverse, naturalOften generic, repetitive

Modern practice: Most LLM apps use sampling (temperature + top-p) rather than beam search, because beam search tends to produce generic text. Beam search remains important for structured output tasks with a narrow correct answer space.

→ Production Advice
Beam search finds the most probable sequence; sampling finds diverse, natural-sounding text. For chatbots use sampling. For translation/transcription use beam search.
Q21

What does temperature control in text generation?

Temperature is a scalar that modifies logits before softmax: softmax(logits / T)

Temperature Effect on Token Probability Distribution
T = 0.2 (Low — Focused)
the
85%
a
10%
this
3%
an
2%
Deterministic, repetitive
T = 1.5 (High — Creative)
the
32%
a
26%
this
22%
an
20%
Diverse, surprising
TemperatureEffectUse For
T → 0Greedy decoding (argmax)Deterministic outputs
T = 0.0–0.3Focused, consistentFactual QA, code, structured outputs
T = 0.7–1.0Natural, variedConversational AI, writing
T > 1.0Chaotic, unexpectedBrainstorming (rarely used in production)
T → ∞Uniform random samplingNever useful
→ Mental Model
Temperature controls the explore-exploit tradeoff: low = exploit known patterns, high = explore novel combinations. Always pair with top-p in production — temperature alone can still sample very low probability tokens.
Q22

Top-k vs top-p (nucleus) sampling: how are they different?

Both are filtering strategies applied before sampling to avoid drawing from the long tail of low-probability tokens.

Top-k SamplingTop-p (Nucleus) Sampling
How it worksKeep the k highest-probability tokensKeep smallest set where cumulative prob ≥ p
Candidate set sizeFixed at kAdapts dynamically to model confidence
When model is confidentStill keeps k candidatesKeeps fewer candidates (maybe 2–3)
When model is uncertainStill keeps k candidatesKeeps more candidates (maybe 50+)
Common defaultk = 50p = 0.9
PreferenceSimpler but less adaptivePreferred in most production systems

Problem with top-k: If k=50 but 3 tokens have 95% of probability mass, you're still sampling from 47 near-zero-probability tokens — wasting candidate slots on bad choices.

Best practice: Many APIs apply both — top-k to cap maximum candidates, then top-p within that set. Common combination: top-p=0.9, top-k=50.

→ Production Preference
Top-p adapts to model confidence; top-k uses a fixed cutoff. In practice, top-p is preferred. Use p=0.9 for most tasks. Combine with temperature for fine-grained control.
Q23

What is adaptive softmax and how can it improve efficiency?

Adaptive softmax is an approximation technique for the output softmax layer when vocabulary is very large (100K+ tokens).

The problem: Computing softmax over the entire vocabulary requires computing logits for every token — a huge matrix multiplication that is the computational bottleneck during training.

Adaptive Softmax — Zipf-based Partitioning
Head cluster
~500 most frequent tokens
Full-dimension representations
90%+ of tokens land here
Tail cluster 1
~10,000 medium-freq tokens
Reduced-dimension reps
8% of tokens
Tail cluster 2
Remaining rare tokens
Low-dimension reps
2% of tokens

Efficiency gain: Most of the time, the model only needs to compute the full softmax over the small head cluster. Tail clusters are only evaluated when needed, providing 2–10× speedups on the output layer.

Modern relevance: With BPE vocabularies around 32K–100K and faster hardware, adaptive softmax is less critical than before — but the concept of variable-capacity allocation based on frequency remains influential in efficient NLP.

→ Core Idea
Adaptive softmax exploits Zipf's law — a small fraction of tokens are very frequent, the rest rare. Give frequent tokens full capacity, rare tokens reduced capacity. Trade uniform treatment for efficiency.
Q24

What is cross-entropy loss and why is it common in language modeling?

Cross-entropy loss measures the difference between the model's predicted probability distribution and the true distribution: L = -log(P(correct_token))

PropertyWhy It Matters
Direct likelihood optimizationMinimizing cross-entropy ≡ maximizing likelihood of training data. The most principled objective for probabilistic models.
Information-theoretic meaningMeasures average bits needed to encode true data using model's distribution. Lower = better compression = better model.
Relationship to perplexityPerplexity = e^(cross-entropy). Perplexity of 10 = "model as confused as if 10 equally likely tokens."
Gradient behaviorPenalizes confident wrong predictions heavily, moderate gradients for uncertain predictions — encourages calibration naturally.

Perplexity formula: PPL = exp(H(p,q)) — the standard evaluation metric for language models. Lower is better.

→ Intuition
Cross-entropy = "how surprised is the model by the correct answer?" Lower surprise = better model. A model that assigns 100% probability to the correct token has zero cross-entropy loss.
Part Five

Training &
Optimization

The math that makes learning happen. Gradients, backprop, vanishing signals, and why ReLU matters. This section separates engineers who use LLMs from those who understand them.

Q25 — Gradient Flow to Embeddings Q26 — The Jacobian Q27 — Chain Rule Q28 — ReLU Activation Q29 — Vanishing Gradients Q30 — Eigenvalues in Dim Reduction
Questions 25 – 30
Saurabh Singh
Q25

How do gradients flow to/update embedding vectors during training?

Embedding layers are lookup tables (matrices) where each row corresponds to a token. During training:

Embedding Update During Training
Token ID "cat"
Index = 2847
Embedding Matrix
vocab_size × hidden_dim
Row 2847
[0.2, -0.5, ...]
↑ Backward pass: gradient flows only to row 2847
Other rows = zero gradient this step

Sparse gradients: The backward pass computes gradient only for the rows corresponding to tokens that appeared in the current batch. This is why gradients are sparse — only the rows corresponding to tokens that appeared get updated.

Key implications:

Rare tokens learn slowly — fewer gradient updates, less refined embeddings. This is why models struggle with very rare proper nouns.

• The embedding matrix is often the largest single parameter block in the model (vocab_size × hidden_dim).

Weight tying: Many models share the embedding matrix with the output projection (lm_head), which acts as a regularizer and reduces parameter count.

→ Key Insight
Only tokens present in the batch get their embeddings updated. Rare tokens learn slowly because they appear in few training steps. This is a fundamental limitation that tokenization can partially address by breaking rare words into common subwords.
Q26

What is the Jacobian and why does it matter in backprop?

The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function. For f: ℝⁿ → ℝᵐ, the Jacobian J is an m×n matrix where J_ij = ∂fᵢ/∂xⱼ.

Role in BackpropWhat It Means
Gradient transformationGradient through a layer = Jᵀ · upstream_gradient. The Jacobian determines how error signals are transformed at each layer.
Vanishing gradientsIf singular values of J are consistently < 1, gradients shrink exponentially. This causes vanishing gradients in deep networks.
Exploding gradientsIf singular values > 1 consistently, gradients explode. Gradient clipping is the mitigation.
Computational efficiencyBackprop never builds the full m×n Jacobian — it computes Jacobian-vector products (JVPs) in O(n), not O(n²).

Conditioning: Well-conditioned Jacobians (singular values ≈ 1) lead to stable training. Layer normalization and careful initialization aim to maintain this throughout the network.

→ Practical Meaning
The Jacobian controls how gradients transform as they flow backward through each layer. Large Jacobians → exploding gradients. Small Jacobians → vanishing gradients. Residual connections keep the Jacobian close to identity.
Q27

How is the chain rule applied in deep learning training?

The chain rule is the mathematical foundation of backpropagation. For composition f(g(x)), the derivative is f'(g(x)) · g'(x).

Backpropagation — Chain Rule in Action
Input x
Layer 1
f₁(x)
Layer 2
f₂(·)
Loss L
← Backward pass: ∂L/∂x = (∂L/∂f₂) · (∂f₂/∂f₁) · (∂f₁/∂x)

Why backprop is efficient: Computing the gradient for layer 1 in an L-layer network is O(L) multiplications — not O(L!) (which you'd get by naively applying the chain rule). It reuses intermediate results from the forward pass.

Practical concerns: The chain of multiplications can cause vanishing/exploding gradients. Residual connections (x + f(x)) mitigate this by providing a direct gradient highway — the gradient of x + f(x) w.r.t. x is I + ∂f/∂x, so even if ∂f/∂x vanishes, the identity term I preserves the gradient.

→ Core Efficiency
Backprop = chain rule applied systematically from output to input, reusing intermediate results. The key insight is that it's O(L) not O(L!) because of the dynamic programming structure.
Q28

What is ReLU and why is it a popular activation?

ReLU (Rectified Linear Unit): f(x) = max(0, x). Despite its simplicity, it revolutionized deep learning training.

ActivationFormulaGradient IssueUsed In
Sigmoid1/(1+e⁻ˣ)Saturates at both ends → vanishing gradientOld networks, output layers
Tanh(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)Saturates at ±1 → vanishing gradientRNNs, old networks
ReLUmax(0,x)Gradient = 1 for x>0, dying ReLU for x<0CNNs, ResNets
GELUx·Φ(x)Smooth, no dying neuronsBERT, GPT-2, GPT-3
SwiGLUGated: x·σ(Wx)·(Vx)Best empirical performanceLLaMA, Mistral, GPT-4

Why ReLU was a breakthrough: The gradient is 1 for x > 0, enabling constant gradient flow in the positive region. Unlike sigmoid/tanh which saturate at both extremes, ReLU maintains a constant gradient, enabling much deeper networks.

Why modern LLMs use GELU/SwiGLU: ReLU has a "dying ReLU" problem (neurons stuck at 0) and isn't zero-centered. GELU provides a smooth approximation; SwiGLU adds a gating mechanism that empirically outperforms both.

→ Modern Usage
ReLU solved the vanishing gradient problem for deep networks. GELU/SwiGLU refined it for Transformers. If asked about activations, mention SwiGLU as what LLaMA/GPT-4 use and explain the gating mechanism briefly.
Q29

What causes vanishing gradients, and how do Transformers mitigate it?

Root cause: During backpropagation, gradients are multiplied by the Jacobian at each layer. If these Jacobians have spectral norms consistently < 1 (from saturating activations), the gradient shrinks exponentially with depth.

Transformer Mitigations for Vanishing Gradients
Residual Connections
x + f(x)
Direct gradient highway. Even if f'(x) → 0, the identity term I preserves gradient.
Layer Normalization
Pre-LN or Post-LN
Prevents activations drifting into saturation regions. Stabilizes gradient magnitudes.
GELU / SwiGLU
Non-saturating
Smooth activations that don't saturate at extremes. No "dying neuron" problem.
Xavier/He Init
Careful initialization
Preserves variance of activations/gradients across layers at initialization.
→ Most Important Mitigation
Residual connections are the single most important mitigation — they create gradient highways. The gradient of x + f(x) w.r.t. x is always at least I regardless of what f does, making very deep Transformers trainable.
Q30

In dimensionality reduction, what do eigenvalues/eigenvectors represent?

In PCA (Principal Component Analysis) and spectral methods, eigenvalues and eigenvectors provide the geometric decomposition of the data's variance structure.

ConceptWhat It RepresentsHow It's Used
EigenvectorsPrincipal directions (axes) of maximum variance in the dataDefine the new coordinate system for PCA projection
EigenvaluesMagnitude of variance along each eigenvectorDetermine how much information each direction captures
Top-k eigenvectorsThe most informative directionsKeep these to reduce dimensions while preserving structure
Eigenvalue ratioλ_k / Σλᵢ = fraction of variance retainedDecide how many dimensions to keep

Connection to LLMs:

LoRA: Exploits the fact that weight update matrices have low intrinsic rank — SVD decomposition finds the eigenvectors of the update subspace

Embedding visualization: PCA projects 768-dim BERT embeddings to 2D for visualization

Attention analysis: Eigendecomposition of attention matrices reveals what the model focuses on

→ Intuition
Eigenvectors = directions that matter. Eigenvalues = how much they matter. Keep the big ones, discard the rest. This is precisely why LoRA works — weight updates during fine-tuning live in a low-eigenvalue subspace.
Part Six

Evaluation &
Fine-Tuning

How do you make a model better without breaking what it already knows? LoRA, QLoRA, KL divergence, catastrophic forgetting, and knowledge distillation — the practical toolkit for customizing LLMs.

Q31 — KL Divergence Q32 — LoRA & QLoRA Q33 — Catastrophic Forgetting Q34 — PEFT vs Full Fine-Tune Q35 — Knowledge Distillation Q36 — Overfitting in Fine-Tuning
Questions 31 – 36
Saurabh Singh
Q31

What is KL divergence and how is it used in ML/LLM evaluation?

KL divergence (Kullback-Leibler divergence) measures how one probability distribution P differs from a reference distribution Q:

KL(P||Q) = Σ P(x) · log(P(x)/Q(x))

PropertyWhat It Means
Always ≥ 0Equals 0 only when P = Q exactly
AsymmetricKL(P||Q) ≠ KL(Q||P). Matters which distribution is the reference.
Relation to cross-entropyKL(P||Q) = H(P,Q) − H(P). Cross-entropy minus entropy.

Key uses in LLMs:

RLHF constraint: During RLHF training, a KL penalty prevents the fine-tuned model from diverging too far from the base model. This preserves fluency while improving alignment. Without it, the model would exploit the reward signal and produce gibberish that scores high.

Knowledge distillation: Student model is trained to minimize KL divergence between its output distribution and the teacher's soft outputs.

VAE training: KL divergence regularizes the latent space toward a prior (Gaussian) distribution.

→ Information-Theoretic Meaning
KL divergence = "how many extra bits do I need because I'm using Q instead of P?" Used everywhere in LLM alignment. The RLHF KL penalty is the key reason models stay coherent during alignment training.
Q32

What are LoRA and QLoRA, and why do people use them?

LoRA — Low-Rank Adaptation of Weight Matrices
W
d × d
Frozen ❄️
+
B
d × r
Trainable
rank r << d
A
r × d
Trainable
=
W + BA
Effective weight
during forward pass
LoRAQLoRA
Base weightsFrozen in full precisionQuantized to 4-bit NF4
Adapter precisionfloat16/bf16bfloat16
Memory savings10–100× fewer trainable paramsAlso reduces base model memory 4–8×
Use caseMulti-GPU fine-tuningSingle GPU fine-tuning (48GB for 65B model)
Inference overheadNone — BA merged into WNone — same merge trick

Why it works: Weight updates during fine-tuning have low intrinsic rank — most of the change concentrates in a small subspace. LoRA parameterizes the update as a low-rank product, capturing this efficiently.

→ Production Summary
LoRA = efficient fine-tuning via low-rank updates. QLoRA = LoRA + 4-bit quantization for consumer GPUs. Key innovation in QLoRA: double quantization + paged optimizers to handle memory spikes. Hot-swapping LoRA adapters on a shared base model is a major production pattern.
Q33

What is catastrophic forgetting and what are common mitigations?

Catastrophic forgetting occurs when a neural network trained on new data loses its ability to perform well on previously learned tasks. New gradients overwrite the weights that encoded old knowledge.

Why it's especially problematic for LLMs: Fine-tuning on a narrow domain can destroy the model's general capabilities. A model fine-tuned heavily on legal text might lose its ability to write code or do math.

Mitigation StrategyHow It WorksExample
PEFT (LoRA)Freeze base weights — can't forget what you don't updateDefault approach today
EWCRegularization penalty for changing weights important to prior tasks (via Fisher information)Elastic Weight Consolidation
ReplayMix samples from original training distribution during fine-tuningExperience replay
Small LR + Layer FreezingFreeze early layers; very small LR on later layersCommon in transfer learning
Multi-task trainingTrain on new and old tasks simultaneouslyInstruction-tuned models
→ Key Insight
Catastrophic forgetting is why you LoRA instead of full fine-tune — frozen weights can't be forgotten. This is a fundamental property, not just an efficiency trick.
Q34

How does PEFT reduce forgetting compared to full fine-tuning?

PEFT (Parameter-Efficient Fine-Tuning) is a family of methods that update only a small subset of model parameters while keeping most pretrained weights frozen.

PEFT MethodWhat It TrainsMemory
LoRA / QLoRALow-rank decomposition of weight updates (A and B matrices)Very low
Prefix TuningLearnable prefix tokens prepended to each layerLow
Prompt TuningLearnable soft prompts at the input level onlyMinimal
AdaptersSmall bottleneck layers inserted between Transformer layersLow
IA³Learned rescaling vectors for keys, values, and FFN activationsMinimal

Why PEFT prevents forgetting by construction:

• Frozen base = preserved knowledge (mathematically — frozen weights can't change)

• Small parameter budget = optimization landscape constrained — model can't drift far from original

• Composability: train separate PEFT modules for different tasks, swap at inference. Base model stays pristine.

Tradeoff: Slightly lower peak performance on the target task vs full fine-tuning, but preservation of general capabilities makes this worthwhile in practice.

→ Core Property
PEFT preserves pretraining by construction — frozen weights are mathematically incapable of forgetting. The new capability is added in the adapter; the old capability lives in the unchanged base.
Q35

What is knowledge distillation and how is it applied to LLMs?

Knowledge distillation transfers knowledge from a large "teacher" model to a smaller "student" model by training the student to mimic the teacher's output distribution, not just match hard labels.

Knowledge Distillation Pipeline
Teacher
GPT-4 (70B)
Soft labels
[0.7, 0.2, 0.1, ...]
+
Student
7B Model
Minimize KL(teacher || student)

Why soft labels work better than hard labels: The teacher's probability distribution encodes which wrong answers are "almost right" — richer signal than a one-hot label. This dark knowledge helps the student learn the relationships between classes.

Applications in LLMs: Model compression (distilling GPT-4 quality into smaller models), synthetic data generation (large model generates training data for smaller), speculative decoding (small draft model verified by large model).

Limitation: Capacity gap — a 7B model can't fully absorb a 70B model's knowledge. Distillation also can't transfer emergent abilities that require scale.

→ Core Idea
Distillation compresses teacher knowledge into a smaller model via soft probability distributions. The "dark knowledge" in the teacher's output distribution is what makes distillation more effective than just training on hard labels.
Q36

What is overfitting in this context, and how do you reduce it?

In LLM fine-tuning, overfitting means the model memorizes the fine-tuning data rather than learning generalizable patterns.

Symptoms: Training loss keeps decreasing but validation loss plateaus or increases; the model generates verbatim training examples.

StrategyHow It HelpsPriority
Data diversityMore varied high-quality data is the single most effective defense⭐ Highest
PEFT / LoRAConstrains update to low-rank subspace — limits memorization capacity⭐ High
Early stoppingMonitor validation loss; stop when it starts increasing⭐ High
Small learning ratePrevents aggressive weight updates that memorize individual examplesMedium
DropoutRandomly zero activations during training — reduces co-adaptationMedium
Weight decay (L2)Regularization on parameter magnitudesLow

Evaluation strategy: For LLMs, you need task-specific evaluation beyond just loss — measure actual generation quality on held-out examples. Perplexity on training set vs validation set is a quick diagnostic.

→ Practical Advice
Overfitting in fine-tuning = memorization. Fight it with data diversity first, PEFT second, early stopping third. If you're using LoRA with a reasonable rank (r=8 or r=16), you already have significant protection against overfitting.
Part Seven

Generative AI
Concepts

From discriminative vs generative to RAG pipelines and chain-of-thought. The concepts every AI practitioner — technical or not — needs to articulate clearly and correctly.

Q37 — Generative vs Discriminative Q38 — Explaining AI to a PM Q39 — Prompt Design Q40 — Chain-of-Thought Q41 — RAG Pipeline Stages Q42 — Knowledge Graphs in RAG
Questions 37 – 42
Saurabh Singh
Q37

Generative vs discriminative models: what's the distinction?

Discriminative ModelsGenerative Models
What they learnDecision boundary P(y|x)Data distribution P(x) or joint P(x,y)
GoalGiven x, predict yGenerate new samples from distribution
ExamplesLogistic regression, BERT classification, SVMsGPT, VAEs, GANs, diffusion models
Best atClassification, regressionGeneration, density estimation
Data requirementsLess — only needs decision boundaryMore — needs to model full distribution

Modern blur: LLMs like GPT are generative models that perform discriminative tasks (classification, NLI) by generating the answer. "Is this review positive or negative?" → GPT generates "Positive." This generative approach to discriminative tasks has been surprisingly effective, blurring the traditional boundary.

The Modern Blur
Discriminative
P(y|x)
LLMs do both →
Generative LLM
Generates answers
← can classify too
Generative
P(x)
→ Modern View
Discriminative = learn boundaries. Generative = learn distributions. LLMs showed generative models can do both. The field is moving toward unified generative models that handle all tasks through generation.
Q38

Discriminative AI vs generative AI: how would you explain it to a PM?

Two Types of AI — PM Mental Model
Discriminative AI
🗂️ Sorting Machine
Answers: "Is this spam?"
"Will this customer churn?"

Takes input, assigns label or score. Mature, predictable, cheaper to run.
📌 Recommendations · Fraud detection · Search ranking · Content moderation
Generative AI
🎨 Creation Machine
Creates: text, images, code, music
that didn't exist before.

More expensive, needs guardrails, enables new product categories.
📌 AI assistants · Code copilots · Content tools · Conversational interfaces

Business model difference: Discriminative AI tends to save costs (automating classification). Generative AI tends to create new value (enabling new capabilities). Best products often combine both: generate content (generative) then filter/rank it (discriminative).

→ PM-Friendly Summary
Discriminative = sort and decide. Generative = create and produce. Best products combine both. This framing also works for technical conversations about system design — use discriminative classifiers as safety guardrails on top of generative outputs.
Q39

How does prompt design influence LLM outputs?

Prompt design fundamentally determines what part of the model's learned distribution you're sampling from. Small changes can dramatically shift output quality.

TechniqueHow It WorksExample
SpecificityVague prompts get average outputs"Write about AI" vs "Write a 500-word technical post comparing RAG and fine-tuning, targeting ML engineers"
Role/PersonaActivates different knowledge patterns"You are a senior ML engineer" vs bare question
Output formatRequesting structure helps organization"Respond in JSON with keys: summary, pros, cons"
Few-shot examplesOften more effective than lengthy instructionsProvide 2–3 input→output pairs before the query
DecompositionBreaking complex tasks into stepsChain-of-thought: "Think step by step"

System-level prompt engineering: In production, you design system prompts that set behavioral constraints, inject context, and define output schemas. This is where prompt engineering becomes system design.

→ Mental Model
Prompt engineering is really about navigating the model's probability space to find the output region you want. You're not programming — you're steering a probability distribution through carefully chosen context.
Q40

What is chain-of-thought prompting, and when is it helpful?

Chain-of-thought (CoT) prompting encourages the model to show its reasoning steps before giving a final answer.

Standard vs Chain-of-Thought
Standard Prompting
Q: If a train travels 60mph for 2.5 hours, how far does it go?

A: 150 miles
Single jump to answer.
Error-prone on complex tasks.
Chain-of-Thought
Q: Same question. Think step by step.

A: Speed = 60mph. Time = 2.5h.
Distance = 60 × 2.5 = 150 miles
Each step constrains the next.
Catches errors early.
Helpful WhenNot Helpful When
Math and arithmetic problemsSimple factual recall ("Capital of France?")
Multi-step logical reasoningTasks where model is already highly confident
Complex analysis with many factorsSpeed-critical applications (CoT adds tokens)
Planning and decision-makingShort-answer classification tasks

Variants: Tree of Thought (explores multiple reasoning paths), Self-Consistency (sample multiple CoTs, take majority vote), Zero-shot CoT (just add "Let's think step by step").

→ Why It Works
CoT works because autoregressive generation is the model's only form of computation — more tokens = more thinking. Each generated step becomes context that constrains subsequent steps, effectively expanding the model's working memory.
Q41

What are the main stages of a RAG pipeline?

RAG (Retrieval-Augmented Generation) augments LLM generation with retrieved external knowledge.

RAG Pipeline — 4 Stages
📄
Stage 1
Indexing
Chunk · Embed · Store in vector DB
🔍
Stage 2
Retrieval
ANN search · Top-k chunks
⚖️
Stage 3
Reranking
Cross-encoder · Refine relevance
✍️
Stage 4
Generation
Context + Query → LLM → Answer
StageKey DecisionsCommon Tools
IndexingChunk size (too small = no context; too large = dilutes relevance)LangChain, LlamaIndex
RetrievalEmbedding model choice, top-k value, hybrid (dense+sparse)Pinecone, Qdrant, pgvector
RerankingCross-encoder model, reranking thresholdCohere Rerank, BGE Reranker
GenerationContext window budget, citation format, hallucination mitigationAny LLM API

Evaluation metrics: Retrieval quality (Precision@k, recall, MRR), Generation quality (faithfulness, groundedness), End-to-end (user satisfaction).

→ Production Reality
RAG = your LLM's external memory. The retrieval quality is usually the bottleneck, not the generation. Invest heavily in chunking strategy, embedding model selection, and reranking before optimizing the LLM step.
Q42

How can a knowledge graph improve retrieval + generation?

Knowledge graphs (KGs) store information as structured triples (entity → relationship → entity) and can significantly enhance RAG pipelines beyond what vector search alone provides.

Vector Search vs Knowledge Graph
Vector Search
Finds similar documents.
Good for: "Tell me about metformin"

❌ Can't answer: "What drugs interact with metformin?" — requires traversing a relationship graph, not finding similar text.
Knowledge Graph
Traverses relationships.
Supports multi-hop:
A → B → C → answer

✅ "Who is CEO of the company that acquired Instagram's maker?" → traverse 3 edges.
BenefitHow KG Adds It
Structured relationshipsTraverse edges (not just find similar text) — critical for relational queries
Multi-hop reasoningNaturally chains A→B→C queries that confuse vector search
Entity disambiguationResolves "Apple" (company vs fruit) via entity types
Hallucination reductionStructured facts ground generation in verified triples
Completeness signalsCan detect when information is missing → "I don't know" vs hallucinate

GraphRAG: Microsoft's approach combines vector retrieval for broad context with graph traversal for precise, structured facts. LLMs are used to build a community-based KG from documents, then queries it with graph algorithms + LLM generation.

→ When to Use KG
KGs add structure where vector search only finds similarity — critical for relational reasoning. Best in domains with well-defined entity relationships: medical, legal, enterprise. Expensive to build and maintain — don't default to KG unless vector search demonstrably fails.
Part Eight

Multimodal &
Scaling

What happens at the frontier. Mixture-of-Experts, multimodal models, zero-shot and few-shot learning, context windows — and the practical limits of what scale actually buys you.

Q43 — Multimodal Models (Gemini) Q44 — Mixture-of-Experts Q45 — Zero-Shot Learning Q46 — Few-Shot Learning Q47 — Context Windows Q48 — Hyperparameters
Questions 43 – 48
Saurabh Singh
Q43

How do multimodal models like Gemini improve stability/efficiency compared to prior approaches?

Bolted-On vs Native Multimodal
Old Approach (Bolted-On)
Frozen Vision Encoder (CLIP)
↓ cross-attention adapter
Language Model (GPT)
Limited cross-modal understanding.
Distribution mismatch.
Native Multimodal (Gemini)
Text
Image
Audio
↓ unified Transformer
Joint Representation
Deep cross-modal attention from layer 1.
Shared representations.
ImprovementHow Native Multimodal Achieves It
Training stabilityJoint pretraining prevents distribution mismatch between independently-trained vision/language models
Cross-modal understandingCross-modal attention from earliest layers, not just late-stage adapters
Parameter efficiencySingle unified model vs maintaining separate vision + language models
Interleaved contentHandles documents with mixed text+images naturally (slides, charts, figures)
→ Key Principle
Native multimodal > bolted-on multimodal because joint training enables deeper cross-modal understanding from the ground up. The "stitching" approach (CLIP + GPT) is fundamentally limited by the shallow integration point.
Q44

How does Mixture-of-Experts (MoE) help scale models?

MoE replaces the dense feedforward network (FFN) in each Transformer layer with multiple "expert" FFNs and a learned router that activates only a subset per token.

Mixture-of-Experts — Token Routing
Input Token
Gating Network (Router)
Selects top-2 of N experts
↓ routes to →
Expert 1 ✓
Active
Expert 2
Idle
Expert 3 ✓
Active
Expert 4
Idle
Expert N
Idle
↓ weighted combination
Output (only 2/N experts computed)
BenefitDetail
Parameter efficiency1.8T parameter MoE model activates only ~280B per token — near-dense performance at fraction of compute
SpecializationDifferent experts can specialize in different domains, learning more efficiently
Training costScales with active parameters, not total. Large knowledge capacity for same training FLOPS

Challenges: Load balancing (ensuring all experts get used — auxiliary loss required), communication overhead in distributed training, expert collapse (all tokens routing to same experts), higher memory (all experts must fit in RAM).

Examples: Mixtral 8x7B, Switch Transformer, Grok (xAI), rumored GPT-4.

→ Core Idea
MoE = scale parameters without scaling compute. It's how you build giant models that run fast. The tradeoff: all experts need to fit in memory even when idle, so memory scales with total params but compute scales with active params.
Q45

What does "zero-shot" mean for LLMs?

Zero-shot means the model performs a task without any task-specific training examples in the prompt. It relies entirely on pretraining knowledge and instruction-following ability.

Zero-Shot vs Few-Shot vs Fine-Tuned Spectrum
ZERO-SHOT
No examples
Just the task
Tests: "Does it just know?"
FEW-SHOT
1–5 examples
in the prompt
Quick prompt tutorial
FINE-TUNED
1000s of examples
weights updated
Maximum accuracy

Why LLMs can do zero-shot: During pretraining on diverse text, the model encounters millions of implicit task demonstrations. It learns the pattern of instructions and responses. Instruction tuning (fine-tuning on instruction-following data like InstructGPT) dramatically improves zero-shot performance.

Limitations: Zero-shot performance varies wildly. Simple classification and translation work well; complex structured output or domain-specific tasks often need examples or fine-tuning.

→ Simple Summary
Zero-shot tests if the model "just knows" how to do it. Few-shot adds a quick tutorial in the prompt. Fine-tuned uses thousands of examples to update weights. Choose based on available data, required accuracy, and cost constraints.
Q46

What does "few-shot" learning look like in prompts and why does it work?

Few-shot learning provides a small number of input-output examples directly in the prompt before the actual query.

Few-Shot Prompt Structure
# Examples (few-shot)
Input: "The movie was fantastic" → Sentiment: Positive
Input: "Terrible waste of time" → Sentiment: Negative
Input: "I loved every minute" → Sentiment: Positive

# Actual query
Input: "Mediocre at best" → Sentiment: ???
Why Few-Shot WorksMechanism
In-context learning (ICL)Examples create a local pattern the model's autoregressive generation follows — "copies the format" but generalizes
Task specificationExamples implicitly define the task more precisely than instructions alone — shows format, style, expected output
Distribution anchoringShifts the model's output distribution toward the desired pattern without any weight updates

Best practices: Use diverse, representative examples. Order can matter (recency bias — later examples weigh more). More examples help up to a point, then returns diminish. For structured tasks, consistent formatting is critical.

→ Why It's Powerful
Few-shot works because Transformers can implicitly learn from examples during a single forward pass. No weight updates — just context. Research suggests ICL partly works through implicit gradient descent in the forward pass (the Transformer "simulates" learning).
Q47

What is a context window and what are the practical limits?

The context window is the maximum number of tokens a model can process in a single forward pass — the model's "working memory."

ModelContext Window≈ Pages of Text
GPT-4o128K tokens~300 pages
Claude 3.5 (Sonnet)200K tokens~500 pages
Gemini 1.5 Pro2M tokens~5,000 pages
LLaMA 3 (open source)128K tokens~300 pages
"Lost in the Middle" — Attention Quality vs Position
Beginning
High recall ✓
Middle
Degraded recall ⚠️
End
High recall ✓
Context window position → Put critical info at start or end

Practical limits beyond raw size: Self-attention is O(n²) — doubling context quadruples compute. Effective context < maximum context (needle-in-a-haystack tests show degraded recall as context grows). Latency scales with context length.

Best practice: Use RAG to retrieve and inject only relevant context rather than stuffing the entire window. Quality of context >> quantity of context.

→ Production Principle
Context windows are growing fast, but effective utilization hasn't kept pace. RAG beats brute-force long context for most use cases. Always put the most important information at the beginning or end of the context, not the middle.
Q48

What is a hyperparameter (vs a learned parameter)?

Learned ParametersHyperparameters
What they areValues the model discovers via gradient descentValues set by the engineer before training
Updated byOptimizer (Adam, SGD)Never — set manually or via search
ExamplesAttention weight matrices, embedding vectors, layer norm scalesLearning rate, batch size, number of layers, heads
Count in GPT-4~1.8 trillion (rumored)Dozens of key decisions

Three types of hyperparameters:

Architecture: Number of layers, hidden dimension, number of attention heads, vocabulary size, FFN intermediate size — define the model's capacity.

Training: Learning rate, batch size, number of epochs, warmup steps, weight decay, dropout rate, gradient clipping — control the training dynamics.

Generation: Temperature, top-p, top-k, max tokens, repetition penalty — control the model's output at inference time.

→ Key Insight
A 70B parameter model has 70B learned parameters, but the hyperparameters that shaped it (learning rate schedule, architecture choices, data mix) are just as important to its quality. Learned parameters = what the model knows. Hyperparameters = the decisions you make about how it learns.
Part IX

Safety & Production

Questions 49 – 50

Responsible deployment: harm mitigation, output filtering, layered defenses, and the operational pitfalls that sink LLM products in the real world.

Output Safety
Guardrails
Production Pitfalls
Monitoring
Human-in-the-Loop
Hallucination Control
Q49

If the model outputs harmful or wrong content, what response strategy do you use?

No single layer is sufficient. Production-grade LLM systems use a layered defense model — multiple independent mechanisms at different stages of the pipeline.

Layered Defense Architecture
LAYER 1
Input Filtering
Reject or transform harmful, off-topic, or adversarial inputs before they reach the model
Prompt injection
Jailbreak detection
PII scrubbing
LAYER 2
System Prompt & Context
Behavioral constraints via carefully engineered system prompts that define scope, tone, and hard refusals
Role definition
Topic boundaries
Safety instructions
LAYER 3
Output Filtering
Post-generation classifiers scan outputs for toxicity, hallucinations, and policy violations before delivery
Toxicity classifier
Format validator
Sensitive data check
LAYER 4
Factual Verification
Ground claims against retrieved evidence (RAG) or run a separate verifier model to detect hallucinations
Source attribution
Confidence scoring
Claim verification
LAYER 5
Human-in-the-Loop
High-stakes decisions (medical, legal, financial) route to human reviewers before acting on model output
Review queues
Confidence threshold
Escalation policy
LAYER 6
Monitoring & Feedback Loop
Log all inputs, outputs, and user feedback. Detect drift, emerging attack patterns, and failure modes at scale
Logging pipeline
A/B testing
Red-teaming

Response strategies for specific failure modes:

Failure ModeStrategy
Harmful content (violence, CSAM, hate)Hard refusal at input and output layers; incident logging; no graceful fallback
HallucinationGround with RAG; add uncertainty language ("I'm not certain, but..."); cite sources
Off-topic driftSystem prompt constraints; semantic similarity check on output; graceful redirect
Prompt injectionInput sanitization; treat user input as untrusted; separate system/user namespaces
JailbreaksLayered classifiers; behavioral analysis; model-level RLHF; adversarial fine-tuning
PII leakageRegex + NER-based scrubbing at input; output diff against training data; data minimization
→ Core Principle
Safety is not a feature — it's a system property. No single filter is reliable enough alone. Build defense-in-depth: assume each layer will occasionally fail, and the combination keeps the system acceptable. Document every layer and its failure rate so you can reason about system-level risk.
Q50

What are the most common pitfalls when deploying LLMs in production?

Most LLM deployments fail not because of model quality, but because of infrastructure, cost, and expectation gaps that weren't anticipated during prototyping.

8 Production Pitfalls
01 · Latency Underestimation
Demos run locally with a single request. Production has concurrent users, cold starts, and token queuing. P99 latency can be 10× the median. Measure tail latency, not averages.
02 · Cost Explosion
Token costs compound fast. A single GPT-4 call that works in demos becomes $10k/month at scale. Model many user journeys, cache aggressively, and right-size — use small models for classification steps.
03 · Prompt Brittleness
Prompts that worked in dev break when user inputs vary. Build a prompt regression suite. Version control your prompts. Test against adversarial inputs before launch.
04 · No Observability
Can't debug what you can't see. Log all inputs, outputs, token counts, and latencies. Tag sessions with user intent labels. Without this, you're flying blind when the model starts misbehaving.
05 · Model Version Drift
API providers silently update model versions. A prompt tuned for gpt-4-0613 may produce different outputs on gpt-4-1106. Pin model versions in production; test before upgrading.
06 · Hallucination at Scale
A 1% hallucination rate sounds fine — until you have 10,000 users/day and 100 confident wrong answers daily. Build domain-specific evals, add retrieval grounding, and set user expectations early.
07 · Context Window Overflow
Conversations grow long. Naive concatenation hits the context limit and truncates — usually at the most important part. Use sliding window, summarization, or a dedicated memory module for long sessions.
08 · No Fallback Strategy
API outages happen. Rate limits hit. What does your product do when the LLM is unavailable? Design graceful degradation: cached answers, simplified rules-based fallback, or a clear "temporarily unavailable" state.

Quick-reference checklist before launch:

AreaPre-launch Check
CostModel 95th-percentile usage; set API spend alerts; enable caching
LatencyLoad test at 5× expected peak; stream responses where possible
SafetyRed-team before launch; output filters live; PII handling documented
ReliabilityFallback path tested; circuit breaker for API failures; retry with backoff
ObservabilityAll calls logged; dashboards for error rates and latency; user feedback button
EvalDomain-specific eval suite; baseline metrics set; alerts for regression
→ Final Word
The gap between "it works in a notebook" and "it works in production at scale" is enormous for LLMs. The model is often the easiest part — the hard parts are the surrounding system: cost management, latency control, safety monitoring, and graceful degradation when things go wrong. Design the system, not just the prompt.
Complete

All 50 Questions.
Covered.

From tokenization to transformer internals, from fine-tuning strategies to production safety — this handbook is your compact reference for LLM interviews and real-world AI engineering.

50
Questions
9
Topic Areas
10+
Visual Diagrams
Saurabh Singh
AI Engineer & Builder
linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7