What does an agentic AI / LLM engineering interview actually test?

Beyond model knowledge, these interviews probe how you reason about building reliable systems on top of LLMs: prompt and context design, retrieval-augmented generation, evaluation pipelines, agent loops and tool use, latency and cost trade-offs, and how you handle non-determinism and failure in production. The bar is whether you can design and operate a generative system, not just call an API.

How do you evaluate an LLM or agentic system?

With a layered approach: a golden dataset with reference answers for regression testing, automated metrics or LLM-as-judge scoring for open-ended outputs, task-level success metrics for agents (did it complete the goal, with how many steps and tool calls), and online monitoring of real traffic. The key interview signal is treating evals as a first-class, version-controlled part of the system rather than a one-off check.

What is the difference between RAG and fine-tuning?

RAG (retrieval-augmented generation) injects relevant documents into the prompt at query time, so knowledge stays fresh and auditable without retraining. Fine-tuning bakes behaviour or style into the model weights. Use RAG for changing or factual knowledge and citations; use fine-tuning for consistent format, tone, or task-specific behaviour. Many production systems combine both.

Is this handbook free?

Yes. The Agentic AI Interview Handbook is completely free to read, with no sign-up required, covering 50 LLM and agentic AI interview questions with worked reasoning.

Visual Handbook · 2025

50 Questions · 9 Domains

Interview Preparation & Reference

50 LLM
Interview
Questions

Detailed answers, visual diagrams & interview tips.
Everything you need to work confidently in AI.

Transformer Architecture Attention Mechanisms Tokenization Training & Optimization LoRA / QLoRA RAG Pipelines Decoding Strategies Evaluation Metrics Production Deployment

LLM Foundations

Q1–6

Architectures & Mechanisms

Q7–12

Transformers & Attention

Q13–18

Decoding & Generation

Q19–24

Training & Optimization

Q25–30

Evaluation & Fine-Tuning

Q31–36

Generative AI Concepts

Q37–42

Multimodal & Scaling

Q43–48

Safety & Production

Q49–50

Saurabh Singh

AI Engineer & Builder.

LinkedIn Medium GitHub

Contents

What's
Inside

Part 1 — LLM Foundations

Q1–6

01What is an LLM?

02Modern vs classic language models

03Foundation models explained

04GPT-3 vs GPT-4 changes

05Tokenization & why it matters

06Embeddings in LLM systems

Part 2 — Architectures & Mechanisms

Q7–12

07Handling OOV words

08Seq2seq models

09Encoder vs decoder

10Autoregressive vs masked models

11Masked language modeling (MLM)

12Next sentence prediction (NSP)

Part 3 — Transformers & Attention

Q13–18

13Why Transformers outperformed seq2seq

14Positional encodings

15What is attention?

16Multi-head attention

17Computing attention weights

18Why softmax in attention?

Part 4 — Decoding & Generation

Q19–24

19Dot product in self-attention

20Beam search vs greedy decoding

21Temperature in text generation

22Top-k vs top-p sampling

23Adaptive softmax

24Cross-entropy loss

Part 5 — Training & Optimization

Q25–30

25Gradient flow to embeddings

26The Jacobian in backprop

27Chain rule in deep learning

28ReLU activation function

29Vanishing gradients & mitigations

30Eigenvalues in dimensionality reduction

Part 6 — Evaluation & Fine-Tuning

Q31–36

31KL divergence in ML/LLM evaluation

32LoRA and QLoRA

33Catastrophic forgetting

34PEFT reduces forgetting

35Knowledge distillation

36Overfitting in fine-tuning

Part 7 — Generative AI Concepts

Q37–42

37Generative vs discriminative models

38Explaining AI to a PM

39Prompt design & its influence

40Chain-of-thought prompting

41RAG pipeline stages

42Knowledge graphs in RAG

Part 8 — Multimodal & Scaling

Q43–48

43Gemini multimodal improvements

44Mixture-of-Experts (MoE)

45Zero-shot learning

46Few-shot learning in prompts

47Context window & practical limits

48Hyperparameters vs learned parameters

Part 9 — Safety & Production

Q49–50

49Response strategy for harmful content

50Common production pitfalls

Part One

LLM
Foundations

The bedrock. What an LLM actually is, how it differs from what came before, and the invisible mechanics — tokenization, embeddings, foundation models — that shape everything downstream.

Q1 — What is an LLM? Q2 — Modern vs Classic Q3 — Foundation Models Q4 — GPT-3 → GPT-4 Q5 — Tokenization Q6 — Embeddings

Questions 1 – 6

Saurabh Singh

LinkedIn Medium GitHub

Q01

In simple terms, what is an LLM (large language model)?

An LLM is a deep neural network trained on massive text corpora to predict the next token in a sequence. At its core, it learns statistical patterns of language — grammar, facts, reasoning patterns, and stylistic nuances — by processing billions of text samples.

The "large" refers to both parameter count (often billions to trillions) and training data scale. Modern LLMs like GPT-4, Claude, and Llama use the Transformer architecture and are trained with self-supervised learning: they predict masked or next tokens, building rich internal representations of language.

After pretraining, they're typically fine-tuned with RLHF (Reinforcement Learning from Human Feedback) or similar alignment techniques to follow instructions and be helpful.

LLM Training Pipeline

Raw Text
Terabytes

→

Pre-training
Next token prediction

→

RLHF/SFT
Alignment

→

Deployed
LLM

→ Interview Tip

Emphasize the self-supervised pretraining + alignment pipeline, not just "big neural net." Interviewers want to hear that you understand the two-phase training process.

Q02

How do modern LLMs differ from older, classic language models?

Dimension	Classic Models (n-gram, RNN)	Modern LLMs (Transformer)
Architecture	n-grams, RNNs, LSTMs	Transformers with self-attention
Scale	Millions of parameters	Billions to trillions of parameters
Context	Fixed, short window	Thousands to millions of tokens
Emergent abilities	None — task-specific	In-context learning, chain-of-thought
Transfer	Separate model per task	One model → many tasks

Classic models were essentially lookup tables — n-gram models computed P(word | previous n-1 words) and that was it. LLMs are learned, generalizable representations that can handle tasks never seen during training.

→ Key Contrast

n-gram models were lookup tables; LLMs are learned, generalizable representations. The key jump was parallelization + scale + emergent abilities.

Q03

What does the term "foundation model" mean?

A foundation model is a large model trained on broad data at scale, designed to be adapted to a wide range of downstream tasks. The term was coined by Stanford's HAI center.

Foundation Model Paradigm

One Large Foundation Model
Trained on diverse text, code, images

→

Fine-tuning

Prompting

RAG

Embeddings

Key properties: Generality (diverse training data), Adaptability (fine-tune or prompt for specific tasks), and Emergent capabilities (behaviors not explicitly programmed). Examples: GPT-4 (language), CLIP (vision-language), Stable Diffusion (image generation).

This paradigm is both powerful (fewer models to train) and risky — a single model's biases propagate everywhere it's deployed.

→ Mental Model

Think of it as the "operating system" of modern AI — one base, many applications built on top.

Q04

In practice, what changed going from GPT-3 to GPT-4?

Capability	GPT-3	GPT-4
Modality	Text only	Text + Images (multimodal)
Context length	~4K tokens	Up to 128K tokens
Reasoning	Mediocre on exams	Top percentile on Bar, SAT, GRE
Instruction following	Often inconsistent	Reliable over long outputs
Safety	Limited red-teaming	Extensive RLHF + red-teaming

The jump was less about architectural novelty and more about scale, data quality, and alignment. GPT-4 unlocked use cases like document analysis, complex code generation, tutoring, and professional-grade writing that GPT-3 couldn't reliably handle.

→ Interview Tip

Don't just say "GPT-4 is bigger." Lead with multimodality, 32× context expansion, and alignment improvements. These are the practical capability jumps that matter in production.

Q05

What is tokenization and why does it matter for LLM behavior/costs?

Tokenization converts raw text into discrete tokens (subword units) that the model processes. Modern LLMs use subword tokenizers like BPE (Byte-Pair Encoding), WordPiece, or SentencePiece.

BPE Tokenization Example

Input text:

"Unfamiliarization"

Tokens:

familiar

ization

Count:

3 tokens (not 18 characters)

Why it matters for behavior: The tokenizer determines what atomic units the model sees. Poor tokenization of certain languages (e.g., non-Latin scripts) means more tokens for the same content, degrading performance and effective context capacity.

Why it matters for costs: API pricing is per-token. Non-English text or unusual variable names can tokenize 2–5× less efficiently than standard English, directly inflating costs.

Hidden gotcha: Tokenization artifacts explain why LLMs can't reliably count letters in a word — they see subword chunks, not individual characters.

→ Key Insight

Tokenization is the invisible bottleneck — it affects cost, context usage, multilingual performance, and even reasoning tasks like counting characters.

Q06

Why are embeddings useful, and where do they show up in LLM systems?

Embeddings are dense vector representations that map discrete tokens (or sentences, documents) into a continuous vector space where semantic similarity = geometric proximity.

Where	What embeddings do	Example
Inside LLMs	First layer converts token IDs to vectors	768-dim vectors per token
RAG pipelines	Documents and queries → vectors in a DB	Pinecone, Qdrant, pgvector
Evaluation	Semantic similarity between output and ground truth	BERTScore, cosine sim
Clustering	Group related documents	topic modeling

Classic example: king - man + woman ≈ queen. Good embeddings capture semantic relationships geometrically. Modern contrastive learning produces embeddings where similar meanings are close and dissimilar ones are far apart.

→ Core Idea

Embeddings are the bridge between discrete text and the continuous math that neural networks operate on. Without them, gradient-based learning wouldn't work on language.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part Two

Architectures &
Mechanisms

How do models actually work under the hood? From handling unknown words to understanding the encoder-decoder split, this section covers the building blocks before the Transformer era — and what led to it.

Q7 — OOV Handling Q8 — Seq2seq Q9 — Encoder vs Decoder Q10 — Autoregressive vs Masked Q11 — MLM Q12 — NSP

Questions 7 – 12

Saurabh Singh

LinkedIn Medium GitHub

Q07

How do models deal with words or strings they've never seen before (OOV)?

Modern LLMs effectively eliminated the OOV (out-of-vocabulary) problem through subword tokenization.

Method	How It Works	Used By
BPE	Starts with characters, iteratively merges most-frequent pairs	GPT-2, GPT-3, GPT-4
Byte-level BPE	Operates on raw bytes (0–255) — truly universal, handles emoji	GPT-2 onward
WordPiece	Similar to BPE but optimizes likelihood of training data	BERT, DistilBERT
SentencePiece	Language-agnostic, treats input as raw byte stream	T5, LLaMA, multilingual models

The tradeoff: rare words get split into more tokens, consuming more context and giving the model less direct "understanding" of them as atomic units. This is why LLMs struggle with very rare proper nouns — they see fragments, not whole words.

→ Key Tradeoff

Subword tokenization trades vocabulary completeness for vocabulary compactness. You get OOV-free coverage but rare terms get fragmented, reducing the model's effective understanding of them.

Q08

What is a seq2seq model and what problem does it solve?

Seq2seq (sequence-to-sequence) maps an input sequence to an output sequence of potentially different length. Originally proposed by Sutskever et al. (2014) using RNNs.

Seq2Seq Architecture

Input tokens
"Hello world"

→

Encoder
Compresses to context vector

→

Output tokens
"Bonjour monde"

→

Decoder
Generates token by token

Problems it solves: Machine translation, summarization, question answering, dialogue — any task where input and output have different lengths and structures.

Key limitation: The fixed-size context vector is an information bottleneck. Long inputs get compressed into the same size vector as short ones, losing information. This is exactly what attention mechanisms (and later Transformers) were designed to fix.

Evolution: Seq2seq + attention → Transformer encoder-decoder (T5) → decoder-only LLMs (GPT). The paradigm still lives in T5, BART, and mBART.

→ Historical Arc

Seq2seq introduced the encoder-decoder paradigm; attention fixed its information bottleneck. Understanding this evolution matters — interviewers often trace the "why" of Transformers back to here.

Q09

Encoder vs decoder: what does each one do?

Encoder vs Decoder — Side by Side

Encoder

token₁

↕ Bidirectional attention

token₂

↕ Sees all tokens

token₃

Sees full sequence at once.
Great for understanding tasks (BERT).

Decoder

token₁

→ Causal (left-to-right)

token₂

→ Can't see future

[NEXT] ?

Generates token by token.
Great for generation tasks (GPT).

Encoder-decoder: Uses both. The encoder builds a representation of the input; the decoder generates the output while cross-attending to the encoder's representations. Examples: T5, BART, original Transformer.

Modern trend: Decoder-only models (GPT, Claude, Llama) dominate because they're simpler to scale and can handle both understanding and generation in one architecture.

→ One-liner

Encoder = understand everything at once. Decoder = generate one token at a time. Modern LLMs are decoder-only because generation at scale is simpler than maintaining separate encoder/decoder.

Q10

Autoregressive vs masked models: what's the core difference?

	Autoregressive (GPT)	Masked (BERT)
Training objective	Predict next token: P(xₜ \| x₁…xₜ₋₁)	Predict masked tokens: P(x_masked \| x_unmasked)
Attention direction	Causal — left context only	Bidirectional — full context
Best for	Generation (text, code)	Understanding (classification, NER)
Examples	GPT-2/3/4, Claude, LLaMA	BERT, RoBERTa, DistilBERT
Generates naturally?	Yes — token by token	No — predicts all masks at once

Hybrid approaches: XLNet uses permutation language modeling for bidirectional context with autoregressive generation. T5 uses span corruption as a middle ground. The field converged on autoregressive for large-scale LLMs.

→ Field Direction

Autoregressive = generation-first. Masked = understanding-first. Modern AI converged on autoregressive for frontier LLMs. But BERT-style models are still dominant for embeddings and classification.

Q11

What is masked language modeling and why is it a good pretraining objective?

MLM is BERT's core training objective. During training, ~15% of tokens are randomly selected:

MLM Token Selection

The

[MASK]

sat

chair

→

80% → [MASK]

10% → random word

10% → unchanged

Why it works: Bidirectional context forces the model to use both left and right context, building richer representations. It's fully self-supervised — no labeled data needed.

Limitation: The [MASK] token never appears at inference time (train-test mismatch). Also, only ~15% of tokens provide learning signal per example — less sample-efficient than autoregressive training where every token is a target.

→ Intuition

MLM is like a cloze test at massive scale — fill in the blank forces holistic understanding. The model can't cheat by just looking left.

Q12

What is next sentence prediction (NSP) and when is it used?

NSP was introduced alongside MLM in the original BERT paper. The model is given two sentences and must predict whether sentence B actually follows sentence A in the corpus.

Input	Label
"The dog barked loudly." + "It was a German Shepherd."	IsNext ✓
"The dog barked loudly." + "Paris is the capital of France."	NotNext ✗

Purpose: Help the model understand inter-sentence relationships, useful for QA and natural language inference.

Why it fell out of favor: RoBERTa showed that NSP doesn't help and can even hurt performance. The task is too easy — the model can distinguish random pairs using topic mismatch alone, without learning deep discourse relationships.

What replaced it: RoBERTa dropped NSP entirely. ALBERT replaced it with Sentence Order Prediction (SOP) — both sentences are from the same document but may be swapped, forcing actual coherence understanding.

→ Key Lesson

NSP was well-intentioned but too easy to game. Harder objectives like SOP worked better. A good interview answer connects this to the general principle: your pretraining objective shapes what the model actually learns.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part Three

Transformers &
Attention

The mechanism that changed everything. Why Transformers won, how attention actually computes what to focus on, and why multi-head attention is more than just "more is better."

Q13 — Why Transformers Won Q14 — Positional Encodings Q15 — Attention Explained Q16 — Multi-Head Attention Q17 — Computing Attention Scores Q18 — Why Softmax?

Questions 13 – 18

Saurabh Singh

LinkedIn Medium GitHub

Q13

Why did Transformers outperform older seq2seq architectures?

RNN (Sequential) vs Transformer (Parallel)

RNN / LSTM

h₁ → h₂ → h₃ → h₄

Sequential — can't parallelize

Gradient vanishes over distance

Fixed-size context bottleneck

Transformer

t₁↔t₂

t₁↔t₃

t₂↔t₄

t₃↔t₄

All positions in parallel

O(1) path between any two tokens

Per-token representations maintained

Five reasons Transformers won:

1. Parallelization: RNNs process tokens sequentially — Transformers process all positions in parallel via self-attention, enabling massive GPU utilization.

2. Long-range dependencies: In RNNs, information degrades over distance. In Transformers, self-attention connects every position to every other — path length is O(1) regardless of distance.

3. No information bottleneck: RNN seq2seq compressed the entire input into one fixed-size vector. Transformers maintain per-token representations throughout.

4. Scalability: Parallel nature makes it scale efficiently. This enabled training on orders of magnitude more data.

5. Cleaner gradient flow: Residual connections + layer normalization give much cleaner gradient paths than deep RNNs.

→ Core Insight

Replace recurrence with attention. It's simpler, more parallel, and scales better. The paper title was literally "Attention Is All You Need."

Q14

What are positional encodings and why are they needed?

Self-attention is permutation-invariant — it treats the input as a set, not a sequence. Without positional information, "dog bites man" and "man bites dog" would produce identical representations.

Method	How It Works	Advantage	Used In
Sinusoidal	Fixed sin/cos functions of different frequencies	Generalizes to unseen lengths	Original Transformer
Learned	A position embedding per position, trained	Simple to implement	GPT, BERT
RoPE	Rotates Q and K vectors by position	Captures relative positions, extrapolates	LLaMA, Mistral
ALiBi	Linear bias on attention scores based on distance	Efficient for long contexts	BLOOM

The choice of positional encoding significantly impacts a model's ability to handle long contexts. RoPE has become the dominant choice for modern open-source LLMs because it naturally captures relative positions and supports context length extension via techniques like NTK-aware scaling.

→ Key Point

Positional encodings inject sequence order into an otherwise order-agnostic architecture. RoPE is the modern standard — know why it's preferred over learned embeddings for long-context tasks.

Q15

What is attention, and what does it enable in Transformers?

Attention is a mechanism that lets each token dynamically focus on the most relevant parts of the input when computing its representation. It computes a weighted sum of values, where weights are determined by the compatibility between queries and keys.

Attention: What Each Vector Means

Q
Query

"What am I looking for?"

K
Key

"What information do I contain?"

V
Value

"What do I actually offer if selected?"

Types of attention in Transformers:

Self-attention: Tokens in a sequence attend to each other — every token can interact with every other token.

Cross-attention: Decoder tokens attend to encoder representations — used in encoder-decoder models.

Causal attention: Restricted self-attention where tokens only attend to past positions — used in all decoder-only LLMs.

→ Intuition

Attention = learned, content-dependent routing of information between tokens. Unlike convolutions (fixed window), attention adaptively selects what to focus on based on the content itself.

Q16

What does "multi-head" attention add beyond single-head attention?

Multi-head attention runs several attention operations in parallel, each with its own learned projection matrices, then concatenates and projects the results.

Multi-Head Attention — Parallel Heads

Input Embeddings

Head 1
Syntax?

Head 2
Semantics?

Head 3
Position?

Head N
…?

Concat → Linear Projection

Output

Why multiple heads matter:

One head might focus on syntactic relationships (subject-verb agreement), another on semantic relationships (coreference), and another on positional patterns. A single head can only capture one type of interaction per layer.

Practical details: If model dimension d=512 and we use h=8 heads, each head operates in d/h=64 dimensions. Total compute ≈ same as single-head at d=512, but representation capacity is much richer.

Research finding: Not all heads are equally important — some can be pruned with minimal quality loss, while others are critical for specific capabilities.

→ Key Insight

Multi-head attention lets the model attend to information from different representation subspaces simultaneously. It's "divide and specialize" — each head learns different types of relationships.

Q17

How are attention weights/scores computed at a high level?

Scaled Dot-Product Attention — Step by Step

Q (Queries)
W_Q × Input

K (Keys)
W_K × Input

V (Values)
W_V × Input

→

Scores = Q · Kᵀ
Dot product similarity

→

Scale by √d_k
Prevent gradient saturation

→

Softmax → Weights (sum=1)

→

Output = Weights · V

The full formula: Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V

Why divide by √d_k? The raw dot product grows with dimension (expected value increases with d_k). Without scaling, softmax would saturate on high-dimensional vectors, producing near-zero gradients and making training unstable.

→ The Formula

Memorize: Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V. Know what each piece does. The √d_k scaling is a common interview gotcha — know why it's there.

Q18

Why does attention use softmax?

Softmax serves three critical purposes in the attention mechanism:

Purpose	Why It Matters
Normalization	Converts raw scores into a probability distribution that sums to 1 — ensures output is a proper weighted average of values
Sparsification	Amplifies large differences — high-probability tokens get disproportionately high weights, low ones pushed toward zero — creates focused selection
Differentiability	Unlike hard argmax, softmax is smooth everywhere, enabling gradient-based training

Alternatives being explored:

Linear attention: Removes softmax for O(n) complexity instead of O(n²), but loses sparsification benefit. Sigmoid attention: Some recent work replaces softmax with sigmoid for more independent per-head attention. Sparse attention: Uses top-k or local windows to approximate softmax more efficiently (Longformer, BigBird).

→ Core Reason

Softmax turns raw similarity scores into a learnable, differentiable probability distribution over context. The alternatives (linear, sparse) trade the quality of this distribution for computational efficiency at scale.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part Four

Decoding &
Generation

How does the model turn probability distributions into words? Temperature, top-p, beam search — these are the controls that separate generic output from precisely what you wanted.

Q19 — Dot Product in Attention Q20 — Beam vs Greedy Q21 — Temperature Q22 — Top-k vs Top-p Q23 — Adaptive Softmax Q24 — Cross-Entropy Loss

Questions 19 – 24

Saurabh Singh

LinkedIn Medium GitHub

Q19

Where does the dot product show up in self-attention and why?

The dot product appears as the similarity function between query and key vectors: score(q, k) = q · k = Σ(qᵢ × kᵢ)

Reason for Dot Product	Why It Works
Computational efficiency	Dot products are just matrix multiplications — extremely fast on GPUs (QKᵀ is a single batched op)
Geometric meaning	Dot product measures vector alignment. Similar directions (semantically related) → high score. Orthogonal → zero score
Simplicity	Fewer parameters than additive attention, faster to compute
Differentiable	Smooth gradients for backprop through the similarity computation

The scaling: Raw dot products grow with dimension — expected value ∝ d_k. Without the ÷ √d_k scaling, softmax saturates at high dimensions, producing near-zero gradients. That's why QKᵀ / √d_k is always seen together.

In practice, QKᵀ is a single matrix multiplication computing every query's dot product with every key simultaneously — the core of efficient Transformer inference.

→ GPU Intuition

Dot product = fast, parallelizable similarity that GPUs love. The entire attention mechanism is essentially a sequence of matrix multiplications, which is why modern hardware (NVIDIA, TPUs) can run it so efficiently.

Q20

Beam search vs greedy decoding: how do they differ and when would you use each?

Greedy vs Beam Search (beam width B=2)

Greedy Decoding

[START]

↓ pick max prob

The

↓ pick max prob

cat

Fast, myopic.
May miss global optimum.

Beam Search (B=2)

[START]

The ✓

A ✓

cat

dog

big✗

small✗

Keeps top-B candidates.
Better global sequences.

	Greedy	Beam Search
Speed	O(V) per step	O(B·V) per step
Quality	Locally optimal	Globally better sequences
Use for	Conversational AI, creative writing	Translation, summarization, ASR
Output style	Diverse, natural	Often generic, repetitive

Modern practice: Most LLM apps use sampling (temperature + top-p) rather than beam search, because beam search tends to produce generic text. Beam search remains important for structured output tasks with a narrow correct answer space.

→ Production Advice

Beam search finds the most probable sequence; sampling finds diverse, natural-sounding text. For chatbots use sampling. For translation/transcription use beam search.

Q21

What does temperature control in text generation?

Temperature is a scalar that modifies logits before softmax: softmax(logits / T)

Temperature Effect on Token Probability Distribution

T = 0.2 (Low — Focused)

the

85%

10%

this

Deterministic, repetitive

T = 1.5 (High — Creative)

the

32%

26%

this

22%

20%

Diverse, surprising

Temperature	Effect	Use For
T → 0	Greedy decoding (argmax)	Deterministic outputs
T = 0.0–0.3	Focused, consistent	Factual QA, code, structured outputs
T = 0.7–1.0	Natural, varied	Conversational AI, writing
T > 1.0	Chaotic, unexpected	Brainstorming (rarely used in production)
T → ∞	Uniform random sampling	Never useful

→ Mental Model

Temperature controls the explore-exploit tradeoff: low = exploit known patterns, high = explore novel combinations. Always pair with top-p in production — temperature alone can still sample very low probability tokens.

Q22

Top-k vs top-p (nucleus) sampling: how are they different?

Both are filtering strategies applied before sampling to avoid drawing from the long tail of low-probability tokens.

	Top-k Sampling	Top-p (Nucleus) Sampling
How it works	Keep the k highest-probability tokens	Keep smallest set where cumulative prob ≥ p
Candidate set size	Fixed at k	Adapts dynamically to model confidence
When model is confident	Still keeps k candidates	Keeps fewer candidates (maybe 2–3)
When model is uncertain	Still keeps k candidates	Keeps more candidates (maybe 50+)
Common default	k = 50	p = 0.9
Preference	Simpler but less adaptive	Preferred in most production systems

Problem with top-k: If k=50 but 3 tokens have 95% of probability mass, you're still sampling from 47 near-zero-probability tokens — wasting candidate slots on bad choices.

Best practice: Many APIs apply both — top-k to cap maximum candidates, then top-p within that set. Common combination: top-p=0.9, top-k=50.

→ Production Preference

Top-p adapts to model confidence; top-k uses a fixed cutoff. In practice, top-p is preferred. Use p=0.9 for most tasks. Combine with temperature for fine-grained control.

Q23

What is adaptive softmax and how can it improve efficiency?

Adaptive softmax is an approximation technique for the output softmax layer when vocabulary is very large (100K+ tokens).

The problem: Computing softmax over the entire vocabulary requires computing logits for every token — a huge matrix multiplication that is the computational bottleneck during training.

Adaptive Softmax — Zipf-based Partitioning

Head cluster
~500 most frequent tokens
Full-dimension representations

90%+ of tokens land here

Tail cluster 1
~10,000 medium-freq tokens
Reduced-dimension reps

8% of tokens

Tail cluster 2
Remaining rare tokens
Low-dimension reps

2% of tokens

Efficiency gain: Most of the time, the model only needs to compute the full softmax over the small head cluster. Tail clusters are only evaluated when needed, providing 2–10× speedups on the output layer.

Modern relevance: With BPE vocabularies around 32K–100K and faster hardware, adaptive softmax is less critical than before — but the concept of variable-capacity allocation based on frequency remains influential in efficient NLP.

→ Core Idea

Adaptive softmax exploits Zipf's law — a small fraction of tokens are very frequent, the rest rare. Give frequent tokens full capacity, rare tokens reduced capacity. Trade uniform treatment for efficiency.

Q24

What is cross-entropy loss and why is it common in language modeling?

Cross-entropy loss measures the difference between the model's predicted probability distribution and the true distribution: L = -log(P(correct_token))

Property	Why It Matters
Direct likelihood optimization	Minimizing cross-entropy ≡ maximizing likelihood of training data. The most principled objective for probabilistic models.
Information-theoretic meaning	Measures average bits needed to encode true data using model's distribution. Lower = better compression = better model.
Relationship to perplexity	Perplexity = e^(cross-entropy). Perplexity of 10 = "model as confused as if 10 equally likely tokens."
Gradient behavior	Penalizes confident wrong predictions heavily, moderate gradients for uncertain predictions — encourages calibration naturally.

Perplexity formula: PPL = exp(H(p,q)) — the standard evaluation metric for language models. Lower is better.

→ Intuition

Cross-entropy = "how surprised is the model by the correct answer?" Lower surprise = better model. A model that assigns 100% probability to the correct token has zero cross-entropy loss.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part Five

Training &
Optimization

The math that makes learning happen. Gradients, backprop, vanishing signals, and why ReLU matters. This section separates engineers who use LLMs from those who understand them.

Q25 — Gradient Flow to Embeddings Q26 — The Jacobian Q27 — Chain Rule Q28 — ReLU Activation Q29 — Vanishing Gradients Q30 — Eigenvalues in Dim Reduction

Questions 25 – 30

Saurabh Singh

LinkedIn Medium GitHub

Q25

How do gradients flow to/update embedding vectors during training?

Embedding layers are lookup tables (matrices) where each row corresponds to a token. During training:

Embedding Update During Training

Token ID "cat"
Index = 2847

→

Embedding Matrix
vocab_size × hidden_dim

→

Row 2847
[0.2, -0.5, ...]

↑ Backward pass: gradient flows only to row 2847

Other rows = zero gradient this step

Sparse gradients: The backward pass computes gradient only for the rows corresponding to tokens that appeared in the current batch. This is why gradients are sparse — only the rows corresponding to tokens that appeared get updated.

Key implications:

• Rare tokens learn slowly — fewer gradient updates, less refined embeddings. This is why models struggle with very rare proper nouns.

• The embedding matrix is often the largest single parameter block in the model (vocab_size × hidden_dim).

• Weight tying: Many models share the embedding matrix with the output projection (lm_head), which acts as a regularizer and reduces parameter count.

→ Key Insight

Only tokens present in the batch get their embeddings updated. Rare tokens learn slowly because they appear in few training steps. This is a fundamental limitation that tokenization can partially address by breaking rare words into common subwords.

Q26

What is the Jacobian and why does it matter in backprop?

The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function. For f: ℝⁿ → ℝᵐ, the Jacobian J is an m×n matrix where J_ij = ∂fᵢ/∂xⱼ.

Role in Backprop	What It Means
Gradient transformation	Gradient through a layer = Jᵀ · upstream_gradient. The Jacobian determines how error signals are transformed at each layer.
Vanishing gradients	If singular values of J are consistently < 1, gradients shrink exponentially. This causes vanishing gradients in deep networks.
Exploding gradients	If singular values > 1 consistently, gradients explode. Gradient clipping is the mitigation.
Computational efficiency	Backprop never builds the full m×n Jacobian — it computes Jacobian-vector products (JVPs) in O(n), not O(n²).

Conditioning: Well-conditioned Jacobians (singular values ≈ 1) lead to stable training. Layer normalization and careful initialization aim to maintain this throughout the network.

→ Practical Meaning

The Jacobian controls how gradients transform as they flow backward through each layer. Large Jacobians → exploding gradients. Small Jacobians → vanishing gradients. Residual connections keep the Jacobian close to identity.

Q27

How is the chain rule applied in deep learning training?

The chain rule is the mathematical foundation of backpropagation. For composition f(g(x)), the derivative is f'(g(x)) · g'(x).

Backpropagation — Chain Rule in Action

Input x

→

Layer 1
f₁(x)

→

Layer 2
f₂(·)

→

Loss L

← Backward pass: ∂L/∂x = (∂L/∂f₂) · (∂f₂/∂f₁) · (∂f₁/∂x)

Why backprop is efficient: Computing the gradient for layer 1 in an L-layer network is O(L) multiplications — not O(L!) (which you'd get by naively applying the chain rule). It reuses intermediate results from the forward pass.

Practical concerns: The chain of multiplications can cause vanishing/exploding gradients. Residual connections (x + f(x)) mitigate this by providing a direct gradient highway — the gradient of x + f(x) w.r.t. x is I + ∂f/∂x, so even if ∂f/∂x vanishes, the identity term I preserves the gradient.

→ Core Efficiency

Backprop = chain rule applied systematically from output to input, reusing intermediate results. The key insight is that it's O(L) not O(L!) because of the dynamic programming structure.

Q28

What is ReLU and why is it a popular activation?

ReLU (Rectified Linear Unit): f(x) = max(0, x). Despite its simplicity, it revolutionized deep learning training.

Activation	Formula	Gradient Issue	Used In
Sigmoid	1/(1+e⁻ˣ)	Saturates at both ends → vanishing gradient	Old networks, output layers
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	Saturates at ±1 → vanishing gradient	RNNs, old networks
ReLU	max(0,x)	Gradient = 1 for x>0, dying ReLU for x<0	CNNs, ResNets
GELU	x·Φ(x)	Smooth, no dying neurons	BERT, GPT-2, GPT-3
SwiGLU	Gated: x·σ(Wx)·(Vx)	Best empirical performance	LLaMA, Mistral, GPT-4

Why ReLU was a breakthrough: The gradient is 1 for x > 0, enabling constant gradient flow in the positive region. Unlike sigmoid/tanh which saturate at both extremes, ReLU maintains a constant gradient, enabling much deeper networks.

Why modern LLMs use GELU/SwiGLU: ReLU has a "dying ReLU" problem (neurons stuck at 0) and isn't zero-centered. GELU provides a smooth approximation; SwiGLU adds a gating mechanism that empirically outperforms both.

→ Modern Usage

ReLU solved the vanishing gradient problem for deep networks. GELU/SwiGLU refined it for Transformers. If asked about activations, mention SwiGLU as what LLaMA/GPT-4 use and explain the gating mechanism briefly.

Q29

What causes vanishing gradients, and how do Transformers mitigate it?

Root cause: During backpropagation, gradients are multiplied by the Jacobian at each layer. If these Jacobians have spectral norms consistently < 1 (from saturating activations), the gradient shrinks exponentially with depth.

Transformer Mitigations for Vanishing Gradients

Residual Connections
x + f(x)

Direct gradient highway. Even if f'(x) → 0, the identity term I preserves gradient.

Layer Normalization
Pre-LN or Post-LN

Prevents activations drifting into saturation regions. Stabilizes gradient magnitudes.

GELU / SwiGLU
Non-saturating

Smooth activations that don't saturate at extremes. No "dying neuron" problem.

Xavier/He Init
Careful initialization

Preserves variance of activations/gradients across layers at initialization.

→ Most Important Mitigation

Residual connections are the single most important mitigation — they create gradient highways. The gradient of x + f(x) w.r.t. x is always at least I regardless of what f does, making very deep Transformers trainable.

Q30

In dimensionality reduction, what do eigenvalues/eigenvectors represent?

In PCA (Principal Component Analysis) and spectral methods, eigenvalues and eigenvectors provide the geometric decomposition of the data's variance structure.

Concept	What It Represents	How It's Used
Eigenvectors	Principal directions (axes) of maximum variance in the data	Define the new coordinate system for PCA projection
Eigenvalues	Magnitude of variance along each eigenvector	Determine how much information each direction captures
Top-k eigenvectors	The most informative directions	Keep these to reduce dimensions while preserving structure
Eigenvalue ratio	λ_k / Σλᵢ = fraction of variance retained	Decide how many dimensions to keep

Connection to LLMs:

• LoRA: Exploits the fact that weight update matrices have low intrinsic rank — SVD decomposition finds the eigenvectors of the update subspace

• Embedding visualization: PCA projects 768-dim BERT embeddings to 2D for visualization

• Attention analysis: Eigendecomposition of attention matrices reveals what the model focuses on

→ Intuition

Eigenvectors = directions that matter. Eigenvalues = how much they matter. Keep the big ones, discard the rest. This is precisely why LoRA works — weight updates during fine-tuning live in a low-eigenvalue subspace.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part Six

Evaluation &
Fine-Tuning

How do you make a model better without breaking what it already knows? LoRA, QLoRA, KL divergence, catastrophic forgetting, and knowledge distillation — the practical toolkit for customizing LLMs.

Q31 — KL Divergence Q32 — LoRA & QLoRA Q33 — Catastrophic Forgetting Q34 — PEFT vs Full Fine-Tune Q35 — Knowledge Distillation Q36 — Overfitting in Fine-Tuning

Questions 31 – 36

Saurabh Singh

LinkedIn Medium GitHub

Q31

What is KL divergence and how is it used in ML/LLM evaluation?

KL divergence (Kullback-Leibler divergence) measures how one probability distribution P differs from a reference distribution Q:

KL(P||Q) = Σ P(x) · log(P(x)/Q(x))

Property	What It Means
Always ≥ 0	Equals 0 only when P = Q exactly
Asymmetric	KL(P\|\|Q) ≠ KL(Q\|\|P). Matters which distribution is the reference.
Relation to cross-entropy	KL(P\|\|Q) = H(P,Q) − H(P). Cross-entropy minus entropy.

Key uses in LLMs:

RLHF constraint: During RLHF training, a KL penalty prevents the fine-tuned model from diverging too far from the base model. This preserves fluency while improving alignment. Without it, the model would exploit the reward signal and produce gibberish that scores high.

Knowledge distillation: Student model is trained to minimize KL divergence between its output distribution and the teacher's soft outputs.

VAE training: KL divergence regularizes the latent space toward a prior (Gaussian) distribution.

→ Information-Theoretic Meaning

KL divergence = "how many extra bits do I need because I'm using Q instead of P?" Used everywhere in LLM alignment. The RLHF KL penalty is the key reason models stay coherent during alignment training.

Q32

What are LoRA and QLoRA, and why do people use them?

LoRA — Low-Rank Adaptation of Weight Matrices

W
d × d
Frozen ❄️

B
d × r
Trainable

rank r << d

A
r × d
Trainable

W + BA
Effective weight
during forward pass

	LoRA	QLoRA
Base weights	Frozen in full precision	Quantized to 4-bit NF4
Adapter precision	float16/bf16	bfloat16
Memory savings	10–100× fewer trainable params	Also reduces base model memory 4–8×
Use case	Multi-GPU fine-tuning	Single GPU fine-tuning (48GB for 65B model)
Inference overhead	None — BA merged into W	None — same merge trick

Why it works: Weight updates during fine-tuning have low intrinsic rank — most of the change concentrates in a small subspace. LoRA parameterizes the update as a low-rank product, capturing this efficiently.

→ Production Summary

LoRA = efficient fine-tuning via low-rank updates. QLoRA = LoRA + 4-bit quantization for consumer GPUs. Key innovation in QLoRA: double quantization + paged optimizers to handle memory spikes. Hot-swapping LoRA adapters on a shared base model is a major production pattern.

Q33

What is catastrophic forgetting and what are common mitigations?

Catastrophic forgetting occurs when a neural network trained on new data loses its ability to perform well on previously learned tasks. New gradients overwrite the weights that encoded old knowledge.

Why it's especially problematic for LLMs: Fine-tuning on a narrow domain can destroy the model's general capabilities. A model fine-tuned heavily on legal text might lose its ability to write code or do math.

Mitigation Strategy	How It Works	Example
PEFT (LoRA)	Freeze base weights — can't forget what you don't update	Default approach today
EWC	Regularization penalty for changing weights important to prior tasks (via Fisher information)	Elastic Weight Consolidation
Replay	Mix samples from original training distribution during fine-tuning	Experience replay
Small LR + Layer Freezing	Freeze early layers; very small LR on later layers	Common in transfer learning
Multi-task training	Train on new and old tasks simultaneously	Instruction-tuned models

→ Key Insight

Catastrophic forgetting is why you LoRA instead of full fine-tune — frozen weights can't be forgotten. This is a fundamental property, not just an efficiency trick.

Q34

How does PEFT reduce forgetting compared to full fine-tuning?

PEFT (Parameter-Efficient Fine-Tuning) is a family of methods that update only a small subset of model parameters while keeping most pretrained weights frozen.

PEFT Method	What It Trains	Memory
LoRA / QLoRA	Low-rank decomposition of weight updates (A and B matrices)	Very low
Prefix Tuning	Learnable prefix tokens prepended to each layer	Low
Prompt Tuning	Learnable soft prompts at the input level only	Minimal
Adapters	Small bottleneck layers inserted between Transformer layers	Low
IA³	Learned rescaling vectors for keys, values, and FFN activations	Minimal

Why PEFT prevents forgetting by construction:

• Frozen base = preserved knowledge (mathematically — frozen weights can't change)

• Small parameter budget = optimization landscape constrained — model can't drift far from original

• Composability: train separate PEFT modules for different tasks, swap at inference. Base model stays pristine.

Tradeoff: Slightly lower peak performance on the target task vs full fine-tuning, but preservation of general capabilities makes this worthwhile in practice.

→ Core Property

PEFT preserves pretraining by construction — frozen weights are mathematically incapable of forgetting. The new capability is added in the adapter; the old capability lives in the unchanged base.

Q35

What is knowledge distillation and how is it applied to LLMs?

Knowledge distillation transfers knowledge from a large "teacher" model to a smaller "student" model by training the student to mimic the teacher's output distribution, not just match hard labels.

Knowledge Distillation Pipeline

Teacher
GPT-4 (70B)

↓

Soft labels
[0.7, 0.2, 0.1, ...]

Student
7B Model

Minimize KL(teacher || student)

Why soft labels work better than hard labels: The teacher's probability distribution encodes which wrong answers are "almost right" — richer signal than a one-hot label. This dark knowledge helps the student learn the relationships between classes.

Applications in LLMs: Model compression (distilling GPT-4 quality into smaller models), synthetic data generation (large model generates training data for smaller), speculative decoding (small draft model verified by large model).

Limitation: Capacity gap — a 7B model can't fully absorb a 70B model's knowledge. Distillation also can't transfer emergent abilities that require scale.

→ Core Idea

Distillation compresses teacher knowledge into a smaller model via soft probability distributions. The "dark knowledge" in the teacher's output distribution is what makes distillation more effective than just training on hard labels.

Q36

What is overfitting in this context, and how do you reduce it?

In LLM fine-tuning, overfitting means the model memorizes the fine-tuning data rather than learning generalizable patterns.

Symptoms: Training loss keeps decreasing but validation loss plateaus or increases; the model generates verbatim training examples.

Strategy	How It Helps	Priority
Data diversity	More varied high-quality data is the single most effective defense	⭐ Highest
PEFT / LoRA	Constrains update to low-rank subspace — limits memorization capacity	⭐ High
Early stopping	Monitor validation loss; stop when it starts increasing	⭐ High
Small learning rate	Prevents aggressive weight updates that memorize individual examples	Medium
Dropout	Randomly zero activations during training — reduces co-adaptation	Medium
Weight decay (L2)	Regularization on parameter magnitudes	Low

Evaluation strategy: For LLMs, you need task-specific evaluation beyond just loss — measure actual generation quality on held-out examples. Perplexity on training set vs validation set is a quick diagnostic.

→ Practical Advice

Overfitting in fine-tuning = memorization. Fight it with data diversity first, PEFT second, early stopping third. If you're using LoRA with a reasonable rank (r=8 or r=16), you already have significant protection against overfitting.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part Seven

Generative AI
Concepts

From discriminative vs generative to RAG pipelines and chain-of-thought. The concepts every AI practitioner — technical or not — needs to articulate clearly and correctly.

Q37 — Generative vs Discriminative Q38 — Explaining AI to a PM Q39 — Prompt Design Q40 — Chain-of-Thought Q41 — RAG Pipeline Stages Q42 — Knowledge Graphs in RAG

Questions 37 – 42

Saurabh Singh

LinkedIn Medium GitHub

Q37

Generative vs discriminative models: what's the distinction?

	Discriminative Models	Generative Models
What they learn	Decision boundary P(y\|x)	Data distribution P(x) or joint P(x,y)
Goal	Given x, predict y	Generate new samples from distribution
Examples	Logistic regression, BERT classification, SVMs	GPT, VAEs, GANs, diffusion models
Best at	Classification, regression	Generation, density estimation
Data requirements	Less — only needs decision boundary	More — needs to model full distribution

Modern blur: LLMs like GPT are generative models that perform discriminative tasks (classification, NLI) by generating the answer. "Is this review positive or negative?" → GPT generates "Positive." This generative approach to discriminative tasks has been surprisingly effective, blurring the traditional boundary.

The Modern Blur

Discriminative
P(y|x)

LLMs do both →

Generative LLM
Generates answers

← can classify too

Generative
P(x)

→ Modern View

Discriminative = learn boundaries. Generative = learn distributions. LLMs showed generative models can do both. The field is moving toward unified generative models that handle all tasks through generation.

Q38

Discriminative AI vs generative AI: how would you explain it to a PM?

Two Types of AI — PM Mental Model

Discriminative AI

🗂️ Sorting Machine

Answers: "Is this spam?"
"Will this customer churn?"

Takes input, assigns label or score. Mature, predictable, cheaper to run.

📌 Recommendations · Fraud detection · Search ranking · Content moderation

Generative AI

🎨 Creation Machine

Creates: text, images, code, music
that didn't exist before.

More expensive, needs guardrails, enables new product categories.

📌 AI assistants · Code copilots · Content tools · Conversational interfaces

Business model difference: Discriminative AI tends to save costs (automating classification). Generative AI tends to create new value (enabling new capabilities). Best products often combine both: generate content (generative) then filter/rank it (discriminative).

→ PM-Friendly Summary

Discriminative = sort and decide. Generative = create and produce. Best products combine both. This framing also works for technical conversations about system design — use discriminative classifiers as safety guardrails on top of generative outputs.

Q39

How does prompt design influence LLM outputs?

Prompt design fundamentally determines what part of the model's learned distribution you're sampling from. Small changes can dramatically shift output quality.

Technique	How It Works	Example
Specificity	Vague prompts get average outputs	"Write about AI" vs "Write a 500-word technical post comparing RAG and fine-tuning, targeting ML engineers"
Role/Persona	Activates different knowledge patterns	"You are a senior ML engineer" vs bare question
Output format	Requesting structure helps organization	"Respond in JSON with keys: summary, pros, cons"
Few-shot examples	Often more effective than lengthy instructions	Provide 2–3 input→output pairs before the query
Decomposition	Breaking complex tasks into steps	Chain-of-thought: "Think step by step"

System-level prompt engineering: In production, you design system prompts that set behavioral constraints, inject context, and define output schemas. This is where prompt engineering becomes system design.

→ Mental Model

Prompt engineering is really about navigating the model's probability space to find the output region you want. You're not programming — you're steering a probability distribution through carefully chosen context.

Q40

What is chain-of-thought prompting, and when is it helpful?

Chain-of-thought (CoT) prompting encourages the model to show its reasoning steps before giving a final answer.

Standard vs Chain-of-Thought

Standard Prompting

Q: If a train travels 60mph for 2.5 hours, how far does it go?

A: 150 miles

Single jump to answer.
Error-prone on complex tasks.

Chain-of-Thought

Q: Same question. Think step by step.

A: Speed = 60mph. Time = 2.5h.
Distance = 60 × 2.5 = 150 miles

Each step constrains the next.
Catches errors early.

Helpful When	Not Helpful When
Math and arithmetic problems	Simple factual recall ("Capital of France?")
Multi-step logical reasoning	Tasks where model is already highly confident
Complex analysis with many factors	Speed-critical applications (CoT adds tokens)
Planning and decision-making	Short-answer classification tasks

Variants: Tree of Thought (explores multiple reasoning paths), Self-Consistency (sample multiple CoTs, take majority vote), Zero-shot CoT (just add "Let's think step by step").

→ Why It Works

CoT works because autoregressive generation is the model's only form of computation — more tokens = more thinking. Each generated step becomes context that constrains subsequent steps, effectively expanding the model's working memory.

Q41

What are the main stages of a RAG pipeline?

RAG (Retrieval-Augmented Generation) augments LLM generation with retrieved external knowledge.

RAG Pipeline — 4 Stages

📄

Stage 1

Indexing

Chunk · Embed · Store in vector DB

→

🔍

Stage 2

Retrieval

ANN search · Top-k chunks

→

⚖️

Stage 3

Reranking

Cross-encoder · Refine relevance

→

✍️

Stage 4

Generation

Context + Query → LLM → Answer

Stage	Key Decisions	Common Tools
Indexing	Chunk size (too small = no context; too large = dilutes relevance)	LangChain, LlamaIndex
Retrieval	Embedding model choice, top-k value, hybrid (dense+sparse)	Pinecone, Qdrant, pgvector
Reranking	Cross-encoder model, reranking threshold	Cohere Rerank, BGE Reranker
Generation	Context window budget, citation format, hallucination mitigation	Any LLM API

Evaluation metrics: Retrieval quality (Precision@k, recall, MRR), Generation quality (faithfulness, groundedness), End-to-end (user satisfaction).

→ Production Reality

RAG = your LLM's external memory. The retrieval quality is usually the bottleneck, not the generation. Invest heavily in chunking strategy, embedding model selection, and reranking before optimizing the LLM step.

Q42

How can a knowledge graph improve retrieval + generation?

Knowledge graphs (KGs) store information as structured triples (entity → relationship → entity) and can significantly enhance RAG pipelines beyond what vector search alone provides.

Vector Search vs Knowledge Graph

Vector Search

Finds similar documents.
Good for: "Tell me about metformin"

❌ Can't answer: "What drugs interact with metformin?" — requires traversing a relationship graph, not finding similar text.

Knowledge Graph

Traverses relationships.
Supports multi-hop:
A → B → C → answer

✅ "Who is CEO of the company that acquired Instagram's maker?" → traverse 3 edges.

Benefit	How KG Adds It
Structured relationships	Traverse edges (not just find similar text) — critical for relational queries
Multi-hop reasoning	Naturally chains A→B→C queries that confuse vector search
Entity disambiguation	Resolves "Apple" (company vs fruit) via entity types
Hallucination reduction	Structured facts ground generation in verified triples
Completeness signals	Can detect when information is missing → "I don't know" vs hallucinate

GraphRAG: Microsoft's approach combines vector retrieval for broad context with graph traversal for precise, structured facts. LLMs are used to build a community-based KG from documents, then queries it with graph algorithms + LLM generation.

→ When to Use KG

KGs add structure where vector search only finds similarity — critical for relational reasoning. Best in domains with well-defined entity relationships: medical, legal, enterprise. Expensive to build and maintain — don't default to KG unless vector search demonstrably fails.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part Eight

Multimodal &
Scaling

What happens at the frontier. Mixture-of-Experts, multimodal models, zero-shot and few-shot learning, context windows — and the practical limits of what scale actually buys you.

Q43 — Multimodal Models (Gemini) Q44 — Mixture-of-Experts Q45 — Zero-Shot Learning Q46 — Few-Shot Learning Q47 — Context Windows Q48 — Hyperparameters

Questions 43 – 48

Saurabh Singh

LinkedIn Medium GitHub

Q43

How do multimodal models like Gemini improve stability/efficiency compared to prior approaches?

Bolted-On vs Native Multimodal

Old Approach (Bolted-On)

Frozen Vision Encoder (CLIP)

↓ cross-attention adapter

Language Model (GPT)

Limited cross-modal understanding.
Distribution mismatch.

Native Multimodal (Gemini)

Text

Image

Audio

↓ unified Transformer

Joint Representation

Deep cross-modal attention from layer 1.
Shared representations.

Improvement	How Native Multimodal Achieves It
Training stability	Joint pretraining prevents distribution mismatch between independently-trained vision/language models
Cross-modal understanding	Cross-modal attention from earliest layers, not just late-stage adapters
Parameter efficiency	Single unified model vs maintaining separate vision + language models
Interleaved content	Handles documents with mixed text+images naturally (slides, charts, figures)

→ Key Principle

Native multimodal > bolted-on multimodal because joint training enables deeper cross-modal understanding from the ground up. The "stitching" approach (CLIP + GPT) is fundamentally limited by the shallow integration point.

Q44

How does Mixture-of-Experts (MoE) help scale models?

MoE replaces the dense feedforward network (FFN) in each Transformer layer with multiple "expert" FFNs and a learned router that activates only a subset per token.

Mixture-of-Experts — Token Routing

Input Token

↓

Gating Network (Router)
Selects top-2 of N experts

↓ routes to →

Expert 1 ✓
Active

Expert 2
Idle

Expert 3 ✓
Active

Expert 4
Idle

Expert N
Idle

↓ weighted combination

Output (only 2/N experts computed)

Benefit	Detail
Parameter efficiency	1.8T parameter MoE model activates only ~280B per token — near-dense performance at fraction of compute
Specialization	Different experts can specialize in different domains, learning more efficiently
Training cost	Scales with active parameters, not total. Large knowledge capacity for same training FLOPS

Challenges: Load balancing (ensuring all experts get used — auxiliary loss required), communication overhead in distributed training, expert collapse (all tokens routing to same experts), higher memory (all experts must fit in RAM).

Examples: Mixtral 8x7B, Switch Transformer, Grok (xAI), rumored GPT-4.

→ Core Idea

MoE = scale parameters without scaling compute. It's how you build giant models that run fast. The tradeoff: all experts need to fit in memory even when idle, so memory scales with total params but compute scales with active params.

Q45

What does "zero-shot" mean for LLMs?

Zero-shot means the model performs a task without any task-specific training examples in the prompt. It relies entirely on pretraining knowledge and instruction-following ability.

Zero-Shot vs Few-Shot vs Fine-Tuned Spectrum

ZERO-SHOT

No examples
Just the task

Tests: "Does it just know?"

FEW-SHOT

1–5 examples
in the prompt

Quick prompt tutorial

FINE-TUNED

1000s of examples
weights updated

Maximum accuracy

Why LLMs can do zero-shot: During pretraining on diverse text, the model encounters millions of implicit task demonstrations. It learns the pattern of instructions and responses. Instruction tuning (fine-tuning on instruction-following data like InstructGPT) dramatically improves zero-shot performance.

Limitations: Zero-shot performance varies wildly. Simple classification and translation work well; complex structured output or domain-specific tasks often need examples or fine-tuning.

→ Simple Summary

Zero-shot tests if the model "just knows" how to do it. Few-shot adds a quick tutorial in the prompt. Fine-tuned uses thousands of examples to update weights. Choose based on available data, required accuracy, and cost constraints.

Q46

What does "few-shot" learning look like in prompts and why does it work?

Few-shot learning provides a small number of input-output examples directly in the prompt before the actual query.

Few-Shot Prompt Structure

          # Examples (few-shot)

          Input: "The movie was fantastic" → Sentiment: Positive

          Input: "Terrible waste of time" → Sentiment: Negative

          Input: "I loved every minute" → Sentiment: Positive

          # Actual query

          Input: "Mediocre at best" → Sentiment: ???

Why Few-Shot Works	Mechanism
In-context learning (ICL)	Examples create a local pattern the model's autoregressive generation follows — "copies the format" but generalizes
Task specification	Examples implicitly define the task more precisely than instructions alone — shows format, style, expected output
Distribution anchoring	Shifts the model's output distribution toward the desired pattern without any weight updates

Best practices: Use diverse, representative examples. Order can matter (recency bias — later examples weigh more). More examples help up to a point, then returns diminish. For structured tasks, consistent formatting is critical.

→ Why It's Powerful

Few-shot works because Transformers can implicitly learn from examples during a single forward pass. No weight updates — just context. Research suggests ICL partly works through implicit gradient descent in the forward pass (the Transformer "simulates" learning).

Q47

What is a context window and what are the practical limits?

The context window is the maximum number of tokens a model can process in a single forward pass — the model's "working memory."

Model	Context Window	≈ Pages of Text
GPT-4o	128K tokens	~300 pages
Claude 3.5 (Sonnet)	200K tokens	~500 pages
Gemini 1.5 Pro	2M tokens	~5,000 pages
LLaMA 3 (open source)	128K tokens	~300 pages

"Lost in the Middle" — Attention Quality vs Position

Beginning
High recall ✓

Middle
Degraded recall ⚠️

End
High recall ✓

Context window position → Put critical info at start or end

Practical limits beyond raw size: Self-attention is O(n²) — doubling context quadruples compute. Effective context < maximum context (needle-in-a-haystack tests show degraded recall as context grows). Latency scales with context length.

Best practice: Use RAG to retrieve and inject only relevant context rather than stuffing the entire window. Quality of context >> quantity of context.

→ Production Principle

Context windows are growing fast, but effective utilization hasn't kept pace. RAG beats brute-force long context for most use cases. Always put the most important information at the beginning or end of the context, not the middle.

Q48

What is a hyperparameter (vs a learned parameter)?

	Learned Parameters	Hyperparameters
What they are	Values the model discovers via gradient descent	Values set by the engineer before training
Updated by	Optimizer (Adam, SGD)	Never — set manually or via search
Examples	Attention weight matrices, embedding vectors, layer norm scales	Learning rate, batch size, number of layers, heads
Count in GPT-4	~1.8 trillion (rumored)	Dozens of key decisions

Three types of hyperparameters:

Architecture: Number of layers, hidden dimension, number of attention heads, vocabulary size, FFN intermediate size — define the model's capacity.

Training: Learning rate, batch size, number of epochs, warmup steps, weight decay, dropout rate, gradient clipping — control the training dynamics.

Generation: Temperature, top-p, top-k, max tokens, repetition penalty — control the model's output at inference time.

→ Key Insight

A 70B parameter model has 70B learned parameters, but the hyperparameters that shaped it (learning rate schedule, architecture choices, data mix) are just as important to its quality. Learned parameters = what the model knows. Hyperparameters = the decisions you make about how it learns.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Part IX

Safety & Production

Questions 49 – 50

Responsible deployment: harm mitigation, output filtering, layered defenses, and the operational pitfalls that sink LLM products in the real world.

Output Safety

Guardrails

Production Pitfalls

Monitoring

Human-in-the-Loop

Hallucination Control

Q49

If the model outputs harmful or wrong content, what response strategy do you use?

No single layer is sufficient. Production-grade LLM systems use a layered defense model — multiple independent mechanisms at different stages of the pipeline.

Layered Defense Architecture

LAYER 1

Input Filtering

Reject or transform harmful, off-topic, or adversarial inputs before they reach the model

Prompt injection
Jailbreak detection
PII scrubbing

↓

LAYER 2

System Prompt & Context

Behavioral constraints via carefully engineered system prompts that define scope, tone, and hard refusals

Role definition
Topic boundaries
Safety instructions

↓

LAYER 3

Output Filtering

Post-generation classifiers scan outputs for toxicity, hallucinations, and policy violations before delivery

Toxicity classifier
Format validator
Sensitive data check

↓

LAYER 4

Factual Verification

Ground claims against retrieved evidence (RAG) or run a separate verifier model to detect hallucinations

Source attribution
Confidence scoring
Claim verification

↓

LAYER 5

Human-in-the-Loop

High-stakes decisions (medical, legal, financial) route to human reviewers before acting on model output

Review queues
Confidence threshold
Escalation policy

↓

LAYER 6

Monitoring & Feedback Loop

Log all inputs, outputs, and user feedback. Detect drift, emerging attack patterns, and failure modes at scale

Logging pipeline
A/B testing
Red-teaming

Response strategies for specific failure modes:

Failure Mode	Strategy
Harmful content (violence, CSAM, hate)	Hard refusal at input and output layers; incident logging; no graceful fallback
Hallucination	Ground with RAG; add uncertainty language ("I'm not certain, but..."); cite sources
Off-topic drift	System prompt constraints; semantic similarity check on output; graceful redirect
Prompt injection	Input sanitization; treat user input as untrusted; separate system/user namespaces
Jailbreaks	Layered classifiers; behavioral analysis; model-level RLHF; adversarial fine-tuning
PII leakage	Regex + NER-based scrubbing at input; output diff against training data; data minimization

→ Core Principle

Safety is not a feature — it's a system property. No single filter is reliable enough alone. Build defense-in-depth: assume each layer will occasionally fail, and the combination keeps the system acceptable. Document every layer and its failure rate so you can reason about system-level risk.

Q50

What are the most common pitfalls when deploying LLMs in production?

Most LLM deployments fail not because of model quality, but because of infrastructure, cost, and expectation gaps that weren't anticipated during prototyping.

8 Production Pitfalls

01 · Latency Underestimation

Demos run locally with a single request. Production has concurrent users, cold starts, and token queuing. P99 latency can be 10× the median. Measure tail latency, not averages.

02 · Cost Explosion

Token costs compound fast. A single GPT-4 call that works in demos becomes $10k/month at scale. Model many user journeys, cache aggressively, and right-size — use small models for classification steps.

03 · Prompt Brittleness

Prompts that worked in dev break when user inputs vary. Build a prompt regression suite. Version control your prompts. Test against adversarial inputs before launch.

04 · No Observability

Can't debug what you can't see. Log all inputs, outputs, token counts, and latencies. Tag sessions with user intent labels. Without this, you're flying blind when the model starts misbehaving.

05 · Model Version Drift

API providers silently update model versions. A prompt tuned for gpt-4-0613 may produce different outputs on gpt-4-1106. Pin model versions in production; test before upgrading.

06 · Hallucination at Scale

A 1% hallucination rate sounds fine — until you have 10,000 users/day and 100 confident wrong answers daily. Build domain-specific evals, add retrieval grounding, and set user expectations early.

07 · Context Window Overflow

Conversations grow long. Naive concatenation hits the context limit and truncates — usually at the most important part. Use sliding window, summarization, or a dedicated memory module for long sessions.

08 · No Fallback Strategy

API outages happen. Rate limits hit. What does your product do when the LLM is unavailable? Design graceful degradation: cached answers, simplified rules-based fallback, or a clear "temporarily unavailable" state.

Quick-reference checklist before launch:

Area	Pre-launch Check
Cost	Model 95th-percentile usage; set API spend alerts; enable caching
Latency	Load test at 5× expected peak; stream responses where possible
Safety	Red-team before launch; output filters live; PII handling documented
Reliability	Fallback path tested; circuit breaker for API failures; retry with backoff
Observability	All calls logged; dashboards for error rates and latency; user feedback button
Eval	Domain-specific eval suite; baseline metrics set; alerts for regression

→ Final Word

The gap between "it works in a notebook" and "it works in production at scale" is enormous for LLMs. The model is often the easiest part — the hard parts are the surrounding system: cost management, latency control, safety monitoring, and graceful degradation when things go wrong. Design the system, not just the prompt.

Saurabh Singh — AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

Complete

All 50 Questions.
Covered.

From tokenization to transformer internals, from fine-tuning strategies to production safety — this handbook is your compact reference for LLM interviews and real-world AI engineering.

Questions

Topic Areas

10+

Visual Diagrams

Saurabh Singh

AI Engineer & Builder

linkedin.com/in/iamsausi medium.com/@sausi github.com/sausi-7

RUN IT YOURSELF

Softmax with temperature

How an LLM turns raw scores (logits) into next-token probabilities — the softmax, with the temperature knob that makes sampling greedier or more random. Real Python, running live. Edit the temperature and hit Run.

HOW TO READ THE CODE — 4 IDEAS

Logits are raw, unbounded scores — one per candidate token.
Softmax exponentiates and normalises them into probabilities that sum to 1 (steps 1, 3).
Subtracting the max first is a numerical-stability trick (step 2), it doesn't change the result.
Temperature < 1 sharpens the distribution (greedier); > 1 flattens it (more random).

CPython · WebAssembly

Finished this one? 0 / 30 Handbooks done

Explore the topic

See this alongside everything else on the same subject — handbooks, system designs, challenges and tools, in one place.

AI Agents & Tools Interview Prep

50 LLMInterviewQuestions

What'sInside

LLMFoundations

In simple terms, what is an LLM (large language model)?

How do modern LLMs differ from older, classic language models?

What does the term "foundation model" mean?

In practice, what changed going from GPT-3 to GPT-4?

What is tokenization and why does it matter for LLM behavior/costs?

Why are embeddings useful, and where do they show up in LLM systems?

Architectures &Mechanisms

How do models deal with words or strings they've never seen before (OOV)?

What is a seq2seq model and what problem does it solve?

Encoder vs decoder: what does each one do?

Autoregressive vs masked models: what's the core difference?

What is masked language modeling and why is it a good pretraining objective?

What is next sentence prediction (NSP) and when is it used?

Transformers &Attention

Why did Transformers outperform older seq2seq architectures?

What are positional encodings and why are they needed?

What is attention, and what does it enable in Transformers?

What does "multi-head" attention add beyond single-head attention?

How are attention weights/scores computed at a high level?

Why does attention use softmax?

Decoding &Generation

Where does the dot product show up in self-attention and why?

Beam search vs greedy decoding: how do they differ and when would you use each?

What does temperature control in text generation?

Top-k vs top-p (nucleus) sampling: how are they different?

What is adaptive softmax and how can it improve efficiency?

What is cross-entropy loss and why is it common in language modeling?

Training &Optimization

How do gradients flow to/update embedding vectors during training?

What is the Jacobian and why does it matter in backprop?

How is the chain rule applied in deep learning training?

What is ReLU and why is it a popular activation?

What causes vanishing gradients, and how do Transformers mitigate it?

In dimensionality reduction, what do eigenvalues/eigenvectors represent?

Evaluation &Fine-Tuning

What is KL divergence and how is it used in ML/LLM evaluation?

What are LoRA and QLoRA, and why do people use them?

What is catastrophic forgetting and what are common mitigations?

How does PEFT reduce forgetting compared to full fine-tuning?

What is knowledge distillation and how is it applied to LLMs?

What is overfitting in this context, and how do you reduce it?

Generative AIConcepts

Generative vs discriminative models: what's the distinction?

Discriminative AI vs generative AI: how would you explain it to a PM?

How does prompt design influence LLM outputs?

What is chain-of-thought prompting, and when is it helpful?

What are the main stages of a RAG pipeline?

How can a knowledge graph improve retrieval + generation?

Multimodal &Scaling

How do multimodal models like Gemini improve stability/efficiency compared to prior approaches?

How does Mixture-of-Experts (MoE) help scale models?

What does "zero-shot" mean for LLMs?

What does "few-shot" learning look like in prompts and why does it work?

What is a context window and what are the practical limits?

What is a hyperparameter (vs a learned parameter)?

Safety & Production

If the model outputs harmful or wrong content, what response strategy do you use?

What are the most common pitfalls when deploying LLMs in production?

All 50 Questions.Covered.

Softmax with temperature

Explore the topic

More Handbooks

Explore more from Vibe Engines

Get the next one in your inbox.

50 LLM
Interview
Questions

What's
Inside

LLM
Foundations

Architectures &
Mechanisms

Transformers &
Attention

Decoding &
Generation

Training &
Optimization

Evaluation &
Fine-Tuning

Generative AI
Concepts

Multimodal &
Scaling

All 50 Questions.
Covered.