🤖 Machine Learning · Deep Learning
📅 March 2026 ⏱ ~12 min read 🔴 Advanced

Transformer Architecture Explained

In 2017, Google Brain published "Attention Is All You Need" — a paper that discarded recurrent networks entirely and replaced them with a new mechanism called self-attention. Seven years later, nearly every breakthrough in AI — GPT-4, Gemini, Llama, Stable Diffusion — uses this architecture.

1. Why Not RNNs?

Before Transformers, sequence models relied on Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU). These process tokens one at a time, maintaining a hidden state that summarises everything seen so far.

Two fundamental problems:

The Transformer solves both by computing relationships between all token pairs simultaneously — with full parallelism and no distance penalty.

2. Tokens and Embeddings

Text is split into tokens — typically sub-word pieces using BPE (Byte Pair Encoding) or SentencePiece. The word "unbelievable" might be three tokens: un, believ, able.

Each token index is mapped to a dense vector d_model dimensions long (GPT-3 uses 12 288 dimensions). This is the embedding. It starts random and is trained to encode semantic meaning — similar concepts end up near each other in this high-dimensional space.

Famous example: In well-trained embeddings: king − man + woman ≈ queen. The embedding space captures analogies as vector arithmetic.

3. Self-Attention

Self-attention asks: "For each token, how much should it attend to every other token?" It computes this by deriving three vectors from each token's embedding via learned projections:

Scaled Dot-Product Attention Q = X · W_Q , K = X · W_K , V = X · W_V

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

The dot product Q · Kᵀ gives a score for every (query, key) pair — how relevant token j is to token i. Dividing by √d_k prevents the softmax from saturating when dimensions are large. The softmax turns scores into a probability distribution, and multiplying by V produces a weighted average of all values.

Computational cost: Attention is O(N²) in sequence length N. For N = 4096 tokens this is 16 million inner products. Much research (sparse attention, linear attention, Flash Attention) focuses on reducing this cost.

4. Multi-Head Attention

A single attention layer can only capture one type of relationship at a time. Multi-head attention runs h independent attention mechanisms ("heads") in parallel, each with its own learned W_Q, W_K, W_V matrices.

Multi-Head Attention head_i = Attention(Q·W_Qᵢ, K·W_Kᵢ, V·W_Vᵢ)

MultiHead(Q, K, V) = Concat(head_1, …, head_h) · W_O

Different heads learn to specialise: one might track syntactic dependencies, another semantic similarity, another coreference. GPT-3 uses 96 attention heads per layer.

Head 1
Syntax
Head 2
Coreference
Head 3
Semantics
Head …
Position
Output
Concat + W_O

5. Positional Encoding

Attention treats its input as a set, not a sequence — there is no inherent notion of order. To re-introduce order, a positional encoding is added to each token's embedding before the first layer.

The original paper used fixed sinusoidal encodings:

Sinusoidal positional encoding PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each dimension oscillates at a different frequency, so each position gets a unique fingerprint. Modern models use Rotary Position Embedding (RoPE) (used in Llama, GPT-NeoX) or ALiBi, which encode relative rather than absolute positions — enabling generalisation to longer sequences than seen during training.

6. Feed-Forward Layer

After attention, each token's representation passes through an identical position-wise feed-forward network — a two-layer MLP with a non-linear activation:

Position-wise FFN FFN(x) = max(0, x · W_1 + b_1) · W_2 + b_2

W_1 ∈ ℝ^(d_model × d_ff) , W_2 ∈ ℝ^(d_ff × d_model)
Typically d_ff = 4 × d_model

While attention mixes information across positions, the FFN layer processes each token independently. Research suggests the FFN layers act as key-value memories — storing factual knowledge about the world. They contain ~⅔ of all model parameters.

7. Encoder and Decoder Stacks

The original Transformer (for translation) had two stacks:

Input
Embedding + PE
×N layers
Self-Attn + FFN
Output
Linear + Softmax

Residual connections (x + SubLayer(x)) let gradients flow directly to early layers, enabling very deep networks to train. Layer normalisation stabilises activations.

8. GPT, BERT, and Variants

9. Why Scale Changes Everything

The empirical observation driving modern AI is the scaling law: loss decreases as a power law with both model size and training data. Multiply compute tenfold and the model becomes reliably better.

But something qualitative happens at scale. Abilities that are entirely absent in small models — multi-step maths, in-context learning, code generation — suddenly appear at certain parameter counts. These are called emergent abilities.

GPT-3 scale: 175 billion parameters, 96 layers, 96 attention heads per layer, d_model = 12 288. Trained on ~300 billion tokens. Training cost was estimated at $4–12 million.

Understanding why scale produces intelligence — not just that it does — remains one of the deepest open questions in AI.