Neural Network Backpropagation
A neural network learns by repeatedly asking: "which weights, if nudged, would reduce the error?" Backpropagation is the efficient algorithm that answers this question for every weight simultaneously, using nothing more than the chain rule of calculus.
1. Forward Pass
A neural network is a function composition. Each layer computes a linear transformation followed by a nonlinearity:
The forward pass propagates input x through all L layers to produce output ŷ = a^(L). Nothing about "learning" yet — this is just function evaluation.
Compute all activations from input to output. Store intermediate values (needed for backprop).
Propagate error signal from output back to input. Compute gradient of loss w.r.t. every weight.
2. Loss Functions
The loss L(ŷ, y) measures how wrong the prediction is. It must be differentiable so we can compute gradients.
- Mean squared error (regression): L = (1/N) Σ (ŷ_i − y_i)². Gradient: ∂L/∂ŷ = 2(ŷ − y)/N.
- Binary cross-entropy (classification): L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]. Gradient w.r.t. sigmoid output: ∂L/∂ŷ = ŷ − y (elegant).
- Categorical cross-entropy (multi-class): L = −Σ_c y_c · log(ŷ_c). Combined with softmax: ∂L/∂z_c = ŷ_c − y_c.
3. The Chain Rule
If f and g are differentiable functions and h(x) = f(g(x)), then:
A neural network is a composition of many functions. The loss depends on the final output, which depends on the last layer's weights, which depend on the previous layer's activations — all the way back to the input. The chain rule lets us compute ∂L/∂w for any weight w by multiplying local gradients along the path.
4. Backpropagation
Define the "delta" error signal δ^(l) = ∂L/∂z^(l). The backprop algorithm computes these deltas from output to input:
This is the full algorithm. In code, modern frameworks compute this via a computation graph where every operation registers its backward function. The "backward pass" traverses the graph in reverse topological order.
5. Gradient Descent
Once gradients are known, weights are updated to reduce the loss:
Variants address SGD's limitations:
- Momentum: Accumulate gradient history (exponential moving average). Smooths oscillations: v = βv − α∇L, then W += v.
- RMSProp: Divide by running RMSE of gradients — larger gradients get smaller steps. Helps with different scale parameters.
- Adam (most popular): Combines momentum (first moment) + RMSProp (second moment). Self-tuning per parameter. Default: β₁=0.9, β₂=0.999, ε=1e-8.
Mini-batch gradient descent: Rather than one sample (stochastic) or all samples (batch), use batches of 32–256. This gives a good noise/compute tradeoff — noisy enough to escape local minima, efficient enough for GPU parallelism.
6. Activation Functions & Vanishing Gradients
The choice of σ determines how gradients flow through many layers.
- Sigmoid σ(x) = 1/(1+e^−x): Output ∈ (0,1), gradient ≤ 0.25. After 10 layers, gradients are (0.25)¹⁰ ≈ 10⁻⁶ — the vanishing gradient problem.
- Tanh: Centered at 0 (better than sigmoid), but max gradient 1 — still vanishes in deep nets.
- ReLU max(0,x): Gradient is exactly 1 for x>0, 0 for x<0. Cheap, non-saturating — solved the vanishing gradient problem for deep networks. Default choice.
- Leaky ReLU: Small slope (0.01x) for x<0, fixing the "dying ReLU" problem where neurons output 0 forever.
- GELU x·Φ(x): Smooth approximation to ReLU, used in transformers (BERT, GPT).
7. Automatic Differentiation
Modern frameworks (PyTorch, JAX, TensorFlow) implement reverse-mode automatic differentiation — a generalisation of backpropagation to arbitrary computation graphs.
Every operation on a tensor records a backward function in a tape/graph. During the backward pass, the engine traverses this graph and accumulates gradients using the chain rule, entirely automatically. You write the forward computation; gradients come for free.
Forward-mode autodiff (where you propagate a tangent, not gradient) is efficient when input dimension << output dimension. Reverse-mode is efficient when output dimension << input (i.e. computing scalar loss gradient wrt millions of weights). Neural network training always uses reverse mode.