Neural Network Backpropagation

A neural network learns by repeatedly asking: "which weights, if nudged, would reduce the error?" Backpropagation is the efficient algorithm that answers this question for every weight simultaneously, using nothing more than the chain rule of calculus.

1. Forward Pass

A neural network is a function composition. Each layer computes a linear transformation followed by a nonlinearity:

Layer computation z^(l) = W^(l) · a^(l−1) + b^(l) // pre-activation a^(l) = σ(z^(l)) // post-activation (e.g. ReLU) Shapes: W^(l) ∈ ℝ^(n_l × n_{l-1}), b^(l) ∈ ℝ^(n_l)

The forward pass propagates input x through all L layers to produce output ŷ = a^(L). Nothing about "learning" yet — this is just function evaluation.

Forward pass →

Compute all activations from input to output. Store intermediate values (needed for backprop).

← Backward pass

Propagate error signal from output back to input. Compute gradient of loss w.r.t. every weight.

2. Loss Functions

The loss L(ŷ, y) measures how wrong the prediction is. It must be differentiable so we can compute gradients.

Mean squared error (regression): L = (1/N) Σ (ŷ_i − y_i)². Gradient: ∂L/∂ŷ = 2(ŷ − y)/N.
Binary cross-entropy (classification): L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]. Gradient w.r.t. sigmoid output: ∂L/∂ŷ = ŷ − y (elegant).
Categorical cross-entropy (multi-class): L = −Σ_c y_c · log(ŷ_c). Combined with softmax: ∂L/∂z_c = ŷ_c − y_c.

3. The Chain Rule

If f and g are differentiable functions and h(x) = f(g(x)), then:

Chain rule dh/dx = (df/dg) \cdot (dg/dx) Or in partial derivative notation: \partialL/\partialw = \partialL/\partiala \cdot \partiala/\partialz \cdot \partialz/\partialw

A neural network is a composition of many functions. The loss depends on the final output, which depends on the last layer's weights, which depend on the previous layer's activations — all the way back to the input. The chain rule lets us compute ∂L/∂w for any weight w by multiplying local gradients along the path.

Why this is efficient: Naïvely computing ∂L/∂w_ij for every weight independently would require one full forward pass per weight — prohibitive for millions of weights. Backpropagation reuses intermediate computations so one backward pass computes all gradients simultaneously in O(P) time, matching the forward pass cost.

4. Backpropagation

Define the "delta" error signal δ^(l) = ∂L/∂z^(l). The backprop algorithm computes these deltas from output to input:

Backprop equations // Output layer delta: δ^(L) = \partialL/\partiala^(L) ⊙ σ'(z^(L)) // Hidden layer delta (propagate backward): δ^(l) = (W^(l+1)ᵀ \cdot δ^(l+1)) ⊙ σ'(z^(l)) // Gradients for this layer's parameters: \partialL/\partialW^(l) = δ^(l) \cdot (a^(l-1))ᵀ \partialL/\partialb^(l) = δ^(l) ⊙ = element-wise product ᵀ = transpose

This is the full algorithm. In code, modern frameworks compute this via a computation graph where every operation registers its backward function. The "backward pass" traverses the graph in reverse topological order.

5. Gradient Descent

Once gradients are known, weights are updated to reduce the loss:

SGD weight update W \leftarrow W - α \cdot \partialL/\partialW α = learning rate (typically 1e-4 to 1e-2)

Variants address SGD's limitations:

Momentum: Accumulate gradient history (exponential moving average). Smooths oscillations: v = βv − α∇L, then W += v.
RMSProp: Divide by running RMSE of gradients — larger gradients get smaller steps. Helps with different scale parameters.
Adam (most popular): Combines momentum (first moment) + RMSProp (second moment). Self-tuning per parameter. Default: β₁=0.9, β₂=0.999, ε=1e-8.

Mini-batch gradient descent: Rather than one sample (stochastic) or all samples (batch), use batches of 32–256. This gives a good noise/compute tradeoff — noisy enough to escape local minima, efficient enough for GPU parallelism.

6. Activation Functions & Vanishing Gradients

The choice of σ determines how gradients flow through many layers.

Sigmoid σ(x) = 1/(1+e^−x): Output ∈ (0,1), gradient ≤ 0.25. After 10 layers, gradients are (0.25)¹⁰ ≈ 10⁻⁶ — the vanishing gradient problem.
Tanh: Centered at 0 (better than sigmoid), but max gradient 1 — still vanishes in deep nets.
ReLU max(0,x): Gradient is exactly 1 for x>0, 0 for x<0. Cheap, non-saturating — solved the vanishing gradient problem for deep networks. Default choice.
Leaky ReLU: Small slope (0.01x) for x<0, fixing the "dying ReLU" problem where neurons output 0 forever.
GELU x·Φ(x): Smooth approximation to ReLU, used in transformers (BERT, GPT).

Batch normalisation: Normalise layer inputs to zero mean/unit variance during training. Reduces internal covariate shift, stabilises gradients, allows higher learning rates. Makes very deep networks trainable.

7. Automatic Differentiation

Modern frameworks (PyTorch, JAX, TensorFlow) implement reverse-mode automatic differentiation — a generalisation of backpropagation to arbitrary computation graphs.

Every operation on a tensor records a backward function in a tape/graph. During the backward pass, the engine traverses this graph and accumulates gradients using the chain rule, entirely automatically. You write the forward computation; gradients come for free.

PyTorch example (conceptual) x = torch.tensor([2.0], requires_grad=True) y = x ** 3 + 2 * x // y = 12 y.backward() // dy/dx = 3x² + 2 = 14 print(x.grad) // tensor([14.])

Forward-mode autodiff (where you propagate a tangent, not gradient) is efficient when input dimension << output dimension. Reverse-mode is efficient when output dimension << input (i.e. computing scalar loss gradient wrt millions of weights). Neural network training always uses reverse mode.

Gradient checkpointing: For very deep or memory-constrained models, don't store all intermediate activations. Recompute them during backward pass from checkpointed values. Trades compute for memory (useful for training large language models).