🤖 Machine Learning · Deep Learning

📅 March 2026 ⏱ ~12 min read 🔴 Advanced

Transformer Architecture Explained

In 2017, Google Brain published "Attention Is All You Need" — a paper that discarded recurrent networks entirely and replaced them with a new mechanism called self-attention. Seven years later, nearly every breakthrough in AI — GPT-4, Gemini, Llama, Stable Diffusion — uses this architecture.

1. Why Not RNNs?

Before Transformers, sequence models relied on Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU). These process tokens one at a time, maintaining a hidden state that summarises everything seen so far.

Two fundamental problems:

Sequential bottleneck: You cannot parallelise over token positions — step N depends on step N−1. On modern GPUs with thousands of cores, most hardware sits idle.
Vanishing gradients / long-range forgetting: Gradients shrink exponentially as they flow backward through hundreds of steps. Relationships between distant tokens (e.g., a pronoun and its antecedent 500 tokens apart) are lost.

The Transformer solves both by computing relationships between all token pairs simultaneously — with full parallelism and no distance penalty.

2. Tokens and Embeddings

Text is split into tokens — typically sub-word pieces using BPE (Byte Pair Encoding) or SentencePiece. The word "unbelievable" might be three tokens: un, believ, able.

Each token index is mapped to a dense vector d_model dimensions long (GPT-3 uses 12 288 dimensions). This is the embedding. It starts random and is trained to encode semantic meaning — similar concepts end up near each other in this high-dimensional space.

Famous example: In well-trained embeddings: king − man + woman ≈ queen. The embedding space captures analogies as vector arithmetic.

3. Self-Attention

Self-attention asks: "For each token, how much should it attend to every other token?" It computes this by deriving three vectors from each token's embedding via learned projections:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I contribute?"

Scaled Dot-Product Attention Q = X \cdot W_Q , K = X \cdot W_K , V = X \cdot W_V Attention(Q, K, V) = softmax( Q \cdot Kᵀ / \sqrtd_k ) \cdot V

The dot product Q · Kᵀ gives a score for every (query, key) pair — how relevant token j is to token i. Dividing by √d_k prevents the softmax from saturating when dimensions are large. The softmax turns scores into a probability distribution, and multiplying by V produces a weighted average of all values.

Computational cost: Attention is O(N²) in sequence length N. For N = 4096 tokens this is 16 million inner products. Much research (sparse attention, linear attention, Flash Attention) focuses on reducing this cost.

4. Multi-Head Attention

A single attention layer can only capture one type of relationship at a time. Multi-head attention runs h independent attention mechanisms ("heads") in parallel, each with its own learned W_Q, W_K, W_V matrices.

Multi-Head Attention head_i = Attention(Q\cdotW_Qᵢ, K\cdotW_Kᵢ, V\cdotW_Vᵢ) MultiHead(Q, K, V) = Concat(head_1, \dots, head_h) \cdot W_O

Different heads learn to specialise: one might track syntactic dependencies, another semantic similarity, another coreference. GPT-3 uses 96 attention heads per layer.

Head 1

Syntax

Head 2

Coreference

Head 3

Semantics

Head …

Position

→

Output

Concat + W_O

5. Positional Encoding

Attention treats its input as a set, not a sequence — there is no inherent notion of order. To re-introduce order, a positional encoding is added to each token's embedding before the first layer.

The original paper used fixed sinusoidal encodings:

Sinusoidal positional encoding PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each dimension oscillates at a different frequency, so each position gets a unique fingerprint. Modern models use Rotary Position Embedding (RoPE) (used in Llama, GPT-NeoX) or ALiBi, which encode relative rather than absolute positions — enabling generalisation to longer sequences than seen during training.

6. Feed-Forward Layer

After attention, each token's representation passes through an identical position-wise feed-forward network — a two-layer MLP with a non-linear activation:

Position-wise FFN FFN(x) = max(0, x \cdot W_1 + b_1) \cdot W_2 + b_2 W_1 \in ℝ^(d_model \times d_ff) , W_2 \in ℝ^(d_ff \times d_model) Typically d_ff = 4 \times d_model

While attention mixes information across positions, the FFN layer processes each token independently. Research suggests the FFN layers act as key-value memories — storing factual knowledge about the world. They contain ~⅔ of all model parameters.

7. Encoder and Decoder Stacks

The original Transformer (for translation) had two stacks:

Encoder: Processes the input sequence. Each layer has (1) multi-head self-attention + (2) FFN, with Add & Norm (residual connection + layer normalisation) around each sub-layer.
Decoder: Generates the output token by token. Each layer has (1) masked self-attention (can only see past tokens), (2) cross-attention over encoder output, and (3) FFN.

Input

Embedding + PE

→

×N layers

Self-Attn + FFN

→

Output

Linear + Softmax

Residual connections (x + SubLayer(x)) let gradients flow directly to early layers, enabling very deep networks to train. Layer normalisation stabilises activations.

8. GPT, BERT, and Variants

BERT (Google, 2018) — encoder-only. Pre-trained by masking random tokens and predicting them (Masked Language Model). Excellent for classification, NER, question answering. Bidirectional context.
GPT series (OpenAI) — decoder-only. Pre-trained by predicting the next token (causal language model). Excellent for generation. GPT-4 has an estimated ~1.7 trillion parameters with a Mixture-of-Experts architecture.
T5 / BART — encoder-decoder. Used for translation, summarisation, seq2seq tasks.
Vision Transformer (ViT) — image split into 16×16 patches, each treated as a token. Transformers now dominate computer vision too.
Diffusion Transformers (DiT) — power Stable Diffusion 3, Sora. Transformer backbone replaces the U-Net in diffusion models.

9. Why Scale Changes Everything

The empirical observation driving modern AI is the scaling law: loss decreases as a power law with both model size and training data. Multiply compute tenfold and the model becomes reliably better.

But something qualitative happens at scale. Abilities that are entirely absent in small models — multi-step maths, in-context learning, code generation — suddenly appear at certain parameter counts. These are called emergent abilities.

GPT-3 scale: 175 billion parameters, 96 layers, 96 attention heads per layer, d_model = 12 288. Trained on ~300 billion tokens. Training cost was estimated at $4–12 million.

Understanding why scale produces intelligence — not just that it does — remains one of the deepest open questions in AI.