The Transformer Architecture Explained
The transformer powers every major LLM. Learn how attention, residual connections, layer normalization, and feed-forward networks combine into the architecture behind GPT, Claude, and Gemini.
The Transformer Architecture Explained
In June 2017, eight Google researchers published a paper titled "Attention Is All You Need." It proposed a new architecture called the transformer that replaced recurrence and convolution with a single mechanism: attention. Within two years, it had become the foundation of every frontier language model. Today, GPT, Claude, Gemini, Llama, and Mistral are all transformers.
Here's how the architecture works, piece by piece.
Before transformers: the RNN bottleneck
Recurrent neural networks (RNNs) and their improved variants (LSTMs, GRUs) processed text sequentially — one token at a time, passing a hidden state forward. This had two problems:
-
No parallelism: Each step depended on the previous step's output. You couldn't process token 50 until tokens 1-49 were done. GPUs, which excel at parallel computation, were underutilized.
-
Information decay: Information from early tokens had to survive being passed through dozens of sequential steps. Despite gating mechanisms in LSTMs, long-range dependencies were still hard to capture.
The transformer solved both problems. Attention gives every token direct access to every other token (no decay), and all attention computations happen in parallel (full GPU utilization).
The original architecture: encoder-decoder
The 2017 paper proposed an encoder-decoder architecture for machine translation:
-
Encoder: Processes the input sentence (e.g., French). Each layer applies self-attention (every input token attends to every other input token) followed by a feed-forward network. The output is a rich representation of the input.
-
Decoder: Generates the output sentence (e.g., English) one token at a time. Each layer applies causal self-attention (each output token attends only to previous output tokens), then cross-attention (each output token attends to the encoder's representations), then a feed-forward network.
This design was elegant for translation. But the field quickly discovered that you could get remarkable results with just half the architecture.
Three flavors of transformer
| Architecture | Example Models | Use Case | |-------------|---------------|----------| | Encoder-only | BERT, RoBERTa | Understanding: classification, NER, search | | Decoder-only | GPT, Claude, Llama | Generation: chatbots, code, creative writing | | Encoder-decoder | T5, BART, original transformer | Sequence-to-sequence: translation, summarization |
The modern LLM landscape is dominated by decoder-only transformers. GPT-1 (2018) showed that a decoder-only transformer, pre-trained on next-token prediction, could be fine-tuned for many tasks. GPT-2 (2019) showed it could perform tasks zero-shot. GPT-3 (2020) showed it could do so remarkably well. The rest is scaling history.
Inside a transformer layer
Every transformer layer (there are dozens to hundreds of them) contains the same two sub-components, repeated:
1. Multi-head self-attention
Each token generates Query, Key, and Value vectors. The attention mechanism computes how much each token should attend to every other token, then produces a weighted combination of Value vectors. Multi-head attention runs this process multiple times in parallel with different learned projections, capturing different types of relationships simultaneously.
For details on the attention math, see our deep dive on how the attention mechanism works.
2. Feed-forward network (FFN)
After attention, each token's representation passes through a two-layer neural network (typically with a hidden dimension 4x the model dimension). This operates on each token independently — no interaction between tokens.
Research suggests the FFN layers serve as knowledge stores. While attention routes information between tokens, FFNs transform individual representations and are where much of the model's factual knowledge is encoded. Studies have shown that specific neurons in FFN layers activate for specific facts ("The Eiffel Tower is in [Paris]") and that editing these neurons can change the model's factual outputs.
The critical connective tissue
Two mechanisms make deep transformers trainable:
Residual connections (skip connections)
Each sub-component (attention, FFN) adds its output to its input rather than replacing it:
output = sublayer(x) + x
Without residual connections, a 96-layer model would be nearly impossible to train — gradients would vanish or explode as they propagated through dozens of nonlinear transformations. The skip connection provides a direct gradient highway from output to input, enabling very deep networks.
This is the same idea from ResNets (2015) applied to transformers. It's simple but fundamental — without it, scaling beyond a few layers would fail.
Layer normalization
After (or before) each sub-component, the representation is normalized:
LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β
This stabilizes training by preventing activations from growing or shrinking across layers. The original transformer used post-norm (normalize after the residual addition). Modern models like GPT and Llama use pre-norm (normalize before the sub-component), which is more stable for very deep networks.
The combination of residual connections + layer normalization is what lets transformers scale to 120+ layers. Remove either one and training collapses.
Positional encoding: Teaching order
Attention is inherently permutation-invariant — it treats the input as a set, not a sequence. "The cat sat on the mat" and "mat the on sat cat the" produce identical attention scores (before positional information is added).
Since word order matters, the original transformer added sinusoidal positional encodings to the input embeddings — fixed patterns of sines and cosines that encode each position uniquely.
Modern models use Rotary Position Embeddings (RoPE), which encode relative positions directly into the attention computation. RoPE enables better generalization to sequence lengths longer than those seen during training, which is critical for long-context models.
Scaling laws: From 117M to trillions
The transformer's power lies not just in its architecture but in how predictably it scales:
| Model | Year | Parameters | Layers | Context | |-------|------|-----------|--------|---------| | GPT-1 | 2018 | 117M | 12 | 512 | | GPT-2 | 2019 | 1.5B | 48 | 1,024 | | GPT-3 | 2020 | 175B | 96 | 2,048 | | GPT-4 | 2023 | ~1.8T (MoE) | ~120 | 128,000 |
Each generation increased parameters by 10-100x and performance improved reliably. This isn't luck — it follows empirically observed scaling laws (Kaplan et al., 2020): model performance improves as a predictable power law function of compute, parameters, and data.
The Mixture of Experts (MoE) architecture used in GPT-4 and others adds a twist: instead of one massive feed-forward network, the model has multiple "expert" FFNs and a routing mechanism that activates only a subset for each token. This increases total parameters (and knowledge capacity) without proportionally increasing compute per token.
Why this architecture won
The transformer succeeded because of a rare combination:
- Parallelism: Fully utilizes modern GPU hardware (unlike RNNs)
- Expressiveness: Attention can model any pairwise relationship
- Scalability: Performance improves predictably with scale
- Simplicity: The same block, repeated. No complex recurrence or gating
- Flexibility: The same architecture handles text, code, images, audio, and video
Nine years after the original paper, no fundamentally different architecture has displaced it. There are innovations within the framework (MoE, RoPE, GQA, FlashAttention), but the core — attention + FFN + residual connections + layer norm — remains.
Explore the layer stack interactively
See how data flows through the transformer stack in our interactive layer visualization — adjust the depth slider and watch how representations evolve from raw tokens to rich contextual encodings.