Machine Learning Fundamentals — Gradient Descent, Backprop & Neural Networks from Scratch

Every neural network is just a composition of matrix multiplications and non-linearities trained by gradient descent. This post unpacks the maths — SGD, momentum, Adam, backpropagation chain rule, softmax cross-entropy — and ties each concept to an interactive simulation.

1. Gradient Descent

Machine learning reduces to one universal problem: find the parameter vector θ that minimises a scalar loss function L(θ). Gradient descent does this by repeatedly stepping in the direction of steepest descent:

SGD and Variants

Vanilla SGD: θ ← θ − η · ∇L(θ)

Momentum (heavy-ball):
  v ← β v − η · ∇L(θ)
  θ ← θ + v

Adam (adaptive moments):
  m ← β₁ m + (1−β₁) g   (1st moment)
  v ← β₂ v + (1−β₂) g²  (2nd moment)
  θ ← θ − η · m̂ / (√v̂ + ε)

The loss landscape for small networks is a surface in high-dimensional space. Saddle points are far more common than local minima — modern optimisers escape them through noise and adaptive step-sizes.

2. Backpropagation — the Chain Rule Applied

Backprop is nothing more than the chain rule of calculus applied systematically to a computation graph. For a loss L = f(g(x)):

Chain Rule — Forward & Backward Pass

Forward: z¹ = W¹x + b¹, a¹ = σ(z¹), z² = W²a¹ + b², ŷ = softmax(z²)
Loss: L = −Σ yᵢ log ŷᵢ (cross-entropy)

Backward: ∂L/∂W² = (∂L/∂ŷ)(∂ŷ/∂z²)(∂z²/∂W²) = (ŷ − y)ᵀ · a¹
∂L/∂W¹ = (W²ᵀ δ²) ⊙ σ′(z¹) · xᵀ  (where δ² = ŷ − y)

The key insight: gradients flow backwards through each layer, each multiplied by the local Jacobian. With ReLU activations (σ′ = 0 or 1) the gradient either passes through unmodified or is zero — which is both the source of efficiency and the "dying ReLU" pathology.

// Minimal backprop for a 2-layer network (JavaScript pseudocode)
function backward(x, y, W1, W2, b1, b2) {
  // Forward
  const z1 = matMul(W1, x).add(b1);
  const a1 = relu(z1);
  const z2 = matMul(W2, a1).add(b2);
  const yHat = softmax(z2);

  // Output layer gradient (softmax + cross-entropy simplifies nicely)
  const dz2 = yHat.sub(y);                    // (C × 1)
  const dW2 = matMul(dz2, a1.T);             // (C × H)

  // Hidden layer gradient
  const da1 = matMul(W2.T, dz2);             // (H × 1)
  const dz1 = da1.mul(reluGrad(z1));         // element-wise ReLU′
  const dW1 = matMul(dz1, x.T);             // (H × D)

  return { dW1, dW2, db1: dz1, db2: dz2 };
}

3. Convolutional Intuition

A convolutional layer slides a small kernel (e.g. 3×3) over the input. Each position produces one scalar by computing a dot product between the kernel weights and the local receptive field. This parameter sharing — re-using the same kernel at every location — gives CNNs translation invariance and dramatically reduces parameter count compared with fully-connected alternatives.

4. Reinforcement Learning & Evolution

Algorithms at a Glance

Stochastic Gradient Descent Adam optimiser Backpropagation Chain rule Softmax cross-entropy ReLU /Tanh Convolutional layers Q-learning Policy gradient Genetic algorithm Kohonen SOM CART Gini impurity

Why does ReLU work better than sigmoid? The sigmoid saturates at 0 and 1 — gradients vanish for large activations (the "vanishing gradient" problem). ReLU is linear for positive inputs, so gradients flow through without shrinkage. Variants like Leaky ReLU (slope 0.01 for x < 0) and GELU (Gaussian-error linear unit, used in transformers) address the "dying ReLU" problem where neurons permanently output zero.