Why is self-attention faster than recurrent networks?

A recurrent network needs one sequential step per token, so processing scales with sequence length and can't be parallelized within a sequence. Self-attention connects any two positions in a constant number of operations and processes the whole sequence at once, making it dramatically more parallelizable on modern hardware.

Is the Transformer the architecture behind ChatGPT?

Yes. GPT models (including the models behind ChatGPT), BERT, Claude, and Gemini are all built on the Transformer architecture introduced in this paper — specifically, most modern LLMs use a decoder-only variant of it.

Paper Breakdowns / Attention Is All You Need

Paper 01~6 min videoNeurIPS 2017

Paper Breakdown

Attention Is All You Need,
explained.

Q: What is the Transformer architecture?

The Transformer is a neural network architecture introduced in 'Attention Is All You Need' (2017) that relies entirely on self-attention instead of recurrence or convolutions to process sequences. It became the foundation for GPT, BERT, Claude, Gemini, and nearly every modern large language model.

Q: What is self-attention?

Self-attention lets every element in a sequence look directly at every other element simultaneously and weigh how relevant each one is, instead of passing information one step at a time like a recurrent network. It's computed via Query, Key, and Value projections: compare a query against every key to get relevance weights, then blend the values accordingly.

Q: What is multi-head attention?

Multi-head attention runs the self-attention mechanism multiple times in parallel (8 times in the original paper), each with its own learned projections, so the model can attend to different kinds of relationships — grammar, coreference, position — at once. The results are concatenated and combined into one output.

Every AI you've ever used — ChatGPT, Gemini, Claude — traces back to one 15-page paper from 2017. This is a narrated, animated walk through the Transformer: self-attention, Query/Key/Value, multi-head attention, positional encoding, the full architecture, and the real results that made the field stop and pay attention.

Also on YouTube: watch on youtube.com →

The old way The big idea Query, Key, Value The formula Multi-Head Attention The architecture Positional encoding The results Why it still matters

The old way: AI as a game of telephone

Before this paper, AI read sentences the way you'd play a game of telephone. One word passes its meaning to the next, which passes it to the next, in a strict single-file line — a recurrent neural network (RNN). By the time the model reaches word thirty, the signal from word one has degraded or vanished entirely.

Worse, because each word has to wait for the one before it, this process can't be parallelized within a single sequence. Throwing more GPUs at an RNN doesn't make one sentence process faster — it's fundamentally, stubbornly sequential.

The big idea: self-attention

The authors asked a deceptively simple question: what if no word had to wait in line at all? What if every word could look directly at every other word, all at once, no matter how far apart they are?

It's like being at a party where you can tune into any conversation in the room instantly, and turn the volume up or down on each one depending on how relevant it is to you right now. That mechanism is self-attention — and the paper's entire bet was that you don't need recurrence, and you don't need convolutions. Attention alone is enough.

Query, Key, Value

So how does one word actually "look at" another? With three simple ideas: a Query, a Key, and a Value.

Think of it like searching a library. Your Query is what you're looking for. Every word carries a Key, like a label on a book's spine — you compare your query against every key to get a relevance score. Then you pull from each word's Value, its actual content, weighted by how relevant it was. Highly relevant words you read closely; irrelevant ones you barely glance at. Do this for every word against every other word, and each word ends up with a representation soaked in exactly the context that matters.

Scaled Dot-Product Attention

The paper's core formula: Attention(Q,K,V) = softmax( QK^T / √d_k ) V.

Comparing a query and key with a dot product can produce very large numbers, and large numbers push the softmax into an all-or-nothing collapse — terrible for learning. So the paper divides every score by the square root of the key dimension before the softmax: one line of math that keeps training stable.

Multi-Head Attention

Instead of computing attention once, the Transformer runs it eight times in parallel, each with its own learned projections. One head might specialize in grammar, another in what a pronoun refers to, another in position — like sending eight detectives to examine the same sentence, each hunting for a different kind of clue, then pooling their notes into one conclusion.

The full architecture

Stack that idea into a full architecture and you get the Transformer: an encoder that reads the input, and a decoder that writes the output, each built from six identical layers. Every layer does attention, then a small feed-forward network, wrapped in a residual connection and layer normalization so information flows cleanly even six layers deep.

The decoder gets one extra rule: it's masked so it can't peek at future tokens while generating, or it could just cheat by reading the answer it's supposed to produce.

Positional encoding

If every word looks at every other word all at once, how does the model know what order they're in? Attention on its own has no sense of sequence. The fix is almost poetic: give every word a unique positional fingerprint made of overlapping sine and cosine waves at different frequencies — like a wristwatch reading of exactly where it sits in the sentence. Add that fingerprint in before anything else happens, and order is baked in without a single recurrent step.

Why it's faster

A recurrent network needs one sequential step per token, so a hundred-word sentence takes a hundred unavoidable steps. Self-attention connects any two words in exactly one step, no matter the distance — the whole sequence gets processed at once, like a room full of people all talking simultaneously instead of one by one.

The real results

On WMT 2014 English-to-German translation, the Transformer (big) scored 28.4 BLEU, beating every previous best — including model ensembles — by more than two points, while training in a fraction of the compute. On English-to-French it set a new single-model record of 41.8 BLEU, trained in 3.5 days on eight GPUs.

Model	EN→DE BLEU	Training cost
ByteNet	23.75	—
GNMT + RL	24.6	2.3 · 10¹⁹ FLOPs
ConvS2S	25.16	9.6 · 10¹⁸ FLOPs
MoE	26.03	2.0 · 10¹⁹ FLOPs
Transformer (base)	27.3	3.3 · 10¹⁸ FLOPs
Transformer (big)	28.4	2.3 · 10¹⁹ FLOPs

Skeptics might say it only works for translation — so the authors threw it at a completely different task, English constituency parsing, with almost no tuning. It still beat nearly every specialized model built specifically for that job. The architecture wasn't a translation trick; it was a general-purpose idea.

Why it still matters

That general-purpose idea is why, years later, this paper still matters. GPT, BERT, Claude, Gemini — strip away the branding, and the Transformer is the engine underneath every one of them. One architecture, one core idea — let every element look directly at every other element — quietly became the foundation of modern AI.

→ Worth knowing

Most modern LLMs (GPT-family, Claude, Gemini) use a decoder-only variant of this architecture — no separate encoder — but the self-attention, multi-head attention, and positional encoding ideas from this paper are unchanged at the core.

Frequently asked

Quick answers

What is the Transformer architecture?

A neural network architecture that relies entirely on self-attention instead of recurrence or convolutions to process sequences. It's the foundation of GPT, BERT, Claude, Gemini, and nearly every modern LLM.

What is self-attention?

A mechanism that lets every element in a sequence look directly at every other element at once and weigh relevance, computed via Query/Key/Value: compare a query against every key, then blend the values by relevance.

What is multi-head attention?

Running self-attention multiple times in parallel (8 in the original paper), each with its own learned projections, so the model attends to different kinds of relationships at once — then combining the results.

Why is self-attention faster than RNNs?

An RNN needs one sequential step per token. Self-attention connects any two positions in a constant number of operations and processes the whole sequence in parallel.

Is this the architecture behind ChatGPT?

Yes — GPT models, BERT, Claude, and Gemini are all built on the Transformer this paper introduced, typically a decoder-only variant of it.

Finished this one? 0 / 1 Paper Breakdowns done

Explore the topic

See this alongside everything else on the same subject — handbooks, system designs, challenges and tools, in one place.

LLM Engineering ML Foundations