Transformer Architecture Explained
In 2017, Google Brain published "Attention Is All You Need" — a paper that discarded recurrent networks entirely and replaced them with a new mechanism called self-attention. Seven years later, nearly every breakthrough in AI — GPT-4, Gemini, Llama, Stable Diffusion — uses this architecture.
1. Why Not RNNs?
Before Transformers, sequence models relied on Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU). These process tokens one at a time, maintaining a hidden state that summarises everything seen so far.
Two fundamental problems:
- Sequential bottleneck: You cannot parallelise over token positions — step N depends on step N−1. On modern GPUs with thousands of cores, most hardware sits idle.
- Vanishing gradients / long-range forgetting: Gradients shrink exponentially as they flow backward through hundreds of steps. Relationships between distant tokens (e.g., a pronoun and its antecedent 500 tokens apart) are lost.
The Transformer solves both by computing relationships between all token pairs simultaneously — with full parallelism and no distance penalty.
2. Tokens and Embeddings
Text is split into tokens — typically sub-word pieces using
BPE (Byte Pair Encoding) or SentencePiece. The word "unbelievable" might
be three tokens: un, believ, able.
Each token index is mapped to a dense vector d_model dimensions
long (GPT-3 uses 12 288 dimensions). This is the embedding.
It starts random and is trained to encode semantic meaning — similar concepts
end up near each other in this high-dimensional space.
3. Self-Attention
Self-attention asks: "For each token, how much should it attend to every other token?" It computes this by deriving three vectors from each token's embedding via learned projections:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I contribute?"
Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V
The dot product Q · Kᵀ gives a score for every (query, key)
pair — how relevant token j is to token i. Dividing by √d_k
prevents the softmax from saturating when dimensions are large.
The softmax turns scores into a probability distribution, and multiplying
by V produces a weighted average of all values.
4. Multi-Head Attention
A single attention layer can only capture one type of relationship at a time. Multi-head attention runs h independent attention mechanisms ("heads") in parallel, each with its own learned W_Q, W_K, W_V matrices.
MultiHead(Q, K, V) = Concat(head_1, …, head_h) · W_O
Different heads learn to specialise: one might track syntactic dependencies, another semantic similarity, another coreference. GPT-3 uses 96 attention heads per layer.
5. Positional Encoding
Attention treats its input as a set, not a sequence — there is no inherent notion of order. To re-introduce order, a positional encoding is added to each token's embedding before the first layer.
The original paper used fixed sinusoidal encodings:
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Each dimension oscillates at a different frequency, so each position gets a unique fingerprint. Modern models use Rotary Position Embedding (RoPE) (used in Llama, GPT-NeoX) or ALiBi, which encode relative rather than absolute positions — enabling generalisation to longer sequences than seen during training.
6. Feed-Forward Layer
After attention, each token's representation passes through an identical position-wise feed-forward network — a two-layer MLP with a non-linear activation:
W_1 ∈ ℝ^(d_model × d_ff) , W_2 ∈ ℝ^(d_ff × d_model)
Typically d_ff = 4 × d_model
While attention mixes information across positions, the FFN layer processes each token independently. Research suggests the FFN layers act as key-value memories — storing factual knowledge about the world. They contain ~⅔ of all model parameters.
7. Encoder and Decoder Stacks
The original Transformer (for translation) had two stacks:
- Encoder: Processes the input sequence. Each layer has (1) multi-head self-attention + (2) FFN, with Add & Norm (residual connection + layer normalisation) around each sub-layer.
- Decoder: Generates the output token by token. Each layer has (1) masked self-attention (can only see past tokens), (2) cross-attention over encoder output, and (3) FFN.
Residual connections (x + SubLayer(x)) let gradients flow
directly to early layers, enabling very deep networks to train. Layer
normalisation stabilises activations.
8. GPT, BERT, and Variants
- BERT (Google, 2018) — encoder-only. Pre-trained by masking random tokens and predicting them (Masked Language Model). Excellent for classification, NER, question answering. Bidirectional context.
- GPT series (OpenAI) — decoder-only. Pre-trained by predicting the next token (causal language model). Excellent for generation. GPT-4 has an estimated ~1.7 trillion parameters with a Mixture-of-Experts architecture.
- T5 / BART — encoder-decoder. Used for translation, summarisation, seq2seq tasks.
- Vision Transformer (ViT) — image split into 16×16 patches, each treated as a token. Transformers now dominate computer vision too.
- Diffusion Transformers (DiT) — power Stable Diffusion 3, Sora. Transformer backbone replaces the U-Net in diffusion models.
9. Why Scale Changes Everything
The empirical observation driving modern AI is the scaling law: loss decreases as a power law with both model size and training data. Multiply compute tenfold and the model becomes reliably better.
But something qualitative happens at scale. Abilities that are entirely absent in small models — multi-step maths, in-context learning, code generation — suddenly appear at certain parameter counts. These are called emergent abilities.
Understanding why scale produces intelligence — not just that it does — remains one of the deepest open questions in AI.