Spotlight #22 — Statistics & Probability: From Coin Flips to Bayesian Inference

Probability as a Mathematical Framework

Statistics and probability are not the same discipline. Probability is pure mathematics: given a model, what outcomes are likely? Statistics inverts this: given outcomes, what model is likely? The simulations in this spotlight sit at the boundary between the two — each one lets you see a theoretical result emerge from repeated random experiments, making the abstract concrete.

What unites them is randomness with structure. Coin flips are unpredictable individually, yet 10 000 flips produce a histogram indistinguishable from a normal curve. A drunk's random walk around the city is chaos, yet the fraction of time spent in each neighbourhood converges to a precise fraction. Randomness and order are not opposites — they are the same phenomenon viewed at different time scales.

Normal Distribution Bayes' Theorem OLS Regression Markov Chains Maxwell-Boltzmann Bootstrap Resampling Monte Carlo

Layer 1: The Universal Convergence — Central Limit Theorem

Central Limit Theorem Simulation

The Central Limit Theorem (CLT) is arguably the most important result in all of statistics. It states that the sum of a large number of independent random variables — regardless of their individual distributions — converges to a normal distribution. Roll a die once and you get a flat distribution. Average two dice rolls and peaks emerge. Average thirty and you have a bell curve so smooth that normal-distribution tables give answers accurate to six decimal places.

This is why the normal distribution appears everywhere in nature: body heights, measurement errors, stock returns, IQ scores, and the random velocity of gas molecules all emerge from summing many small independent contributions. The CLT is the mathematical reason that the Gaussian is not just a convenience but a law of large numbers.

Central Limit Theorem — Statement & Convergence Rate

Classic CLT (Lindeberg-Lévy):
  X₁, X₂, ..., Xₙ — i.i.d. with mean μ, variance σ²
  S̄ₙ = (X₁ + X₂ + ... + Xₙ) / n  (sample mean)

  As n → ∞:
    √n · (S̄ₙ − μ) / σ  →  N(0, 1)  (in distribution)

  Equivalently:
    S̄ₙ ~ N(μ, σ²/n)  approximately, for large n

Standard error of the mean:
  SE = σ / √n
  (halving SE requires 4× more samples)

Berry-Esseen bound (convergence rate):
  sup_x |P(√n(S̄ₙ−μ)/σ ≤ x) − Φ(x)| ≤ C · ρ / (σ³ · √n)
  ρ = E[|X − μ|³]  (third absolute moment)
  C ≈ 0.4748  (tight constant, Shevtsova 2010)

Key insight: convergence is faster when ρ/σ³ is small
  — symmetric distributions converge faster than skewed ones
  — lighter tails converge faster than heavy tails

Source distributions in the simulation:
  Uniform U(0,1):    μ=0.5, σ²=1/12,  fast CLT
  Exponential(1):    μ=1,   σ²=1,      moderate (skewed)
  Bimodal:           μ=0.5, σ²≈0.25,   slow (bimodal)
  Poisson(λ=1):      μ=1,   σ²=1,      moderate

The Central Limit Theorem simulation lets you choose from five source distributions — flat, exponential, bimodal, Poisson, and right-skewed — and watch histograms of sample means accumulate in real time. Drag the sample size slider from n = 1 to n = 200 and see the convergence unfold: a bimodal distribution's sample means look normally distributed by n = 30, while a heavy-tailed distribution needs n = 100 to converge cleanly.

Why does this matter? The t-test, ANOVA, Pearson correlation, and most classical statistics rely on sample means. The CLT is the foundational justification for applying these tests to non-normal data — as long as your sample size is large enough, the sampling distribution of the mean is approximately normal regardless of the population shape.

Layer 2: Updating Beliefs — Bayesian Inference

Bayesian Inference Simulation

Classical statistics asks: given a hypothesis, what is the probability of the observed data? Bayesian statistics asks the inverse: given the data, what is the probability the hypothesis is true? These sound like the same question, but they give different answers, and confusing them is the source of many errors in scientific literature.

Bayes' theorem provides the update rule: multiply the prior probability by the likelihood of the data, then normalise. Each piece of evidence reshapes the probability distribution over hypotheses, pulling it toward consistent explanations. Start with complete uncertainty and after enough observations, even a vague prior concentrates sharply on the true value.

Bayes' Theorem — Statement & Medical Testing Example

Bayes' theorem:
  P(H | E) = P(E | H) · P(H) / P(E)

  H = hypothesis (e.g. patient has disease)
  E = evidence  (e.g. test result is positive)
  P(H)       = prior probability of H
  P(E | H)   = likelihood: P(positive test | disease)
  P(E)       = marginal likelihood (normalisation constant)
  P(H | E)   = posterior probability of H given E

Law of total probability:
  P(E) = P(E|H)·P(H) + P(E|¬H)·P(¬H)

Medical test example (in simulation):
  Disease prevalence:  P(H)       = 0.01   (1 in 100)
  Test sensitivity:    P(E|H)     = 0.99   (true positive rate)
  Test specificity:    P(¬E|¬H)   = 0.95   (true negative rate)
  → False positive rate: P(E|¬H)  = 0.05

  P(H|E) = 0.99 × 0.01 / [0.99×0.01 + 0.05×0.99]
         = 0.0099 / (0.0099 + 0.0495)
         = 0.0099 / 0.0594
         ≈ 0.167  ← only 16.7% chance of disease!

Why this is non-intuitive:
  False positives dominate when prevalence is low.
  The test "looks" 99% accurate but the posterior is only 17%.
  This is called the base-rate neglect fallacy.

The Bayesian Inference simulation visualises prior, likelihood, and posterior distributions as draggable curves. The medical test scenario illustrates why even a 99%-accurate test has a low positive predictive value when the condition is rare — a result that surprises students and clinicians alike. Flip the coin test to watch the posterior sharpen from 50/50 uncertainty toward the true bias after each observed flip.

Prosecutor's fallacy: P(evidence | innocent) is not the same as P(innocent | evidence). DNA found at a crime scene may have a 1-in-a-million random-match probability — but that only becomes the probability of innocence after accounting for how many people were actually in the suspect pool.

Layer 3: Counterintuitive Probability — The Birthday Paradox

Birthday Paradox Simulation

How many people do you need in a room before two of them share a birthday? Most people guess somewhere around 180 — roughly half of 365. The correct answer is 23. This dramatic under-estimation is the birthday paradox, and it reveals how poorly human intuition handles the combinatorial explosion of pairwise comparisons.

With 23 people there are C(23,2) = 253 distinct pairs. Each pair independently has a 1/365 chance of sharing a birthday. The probability that none of the 253 pairs share a birthday is approximately e⁻²⁵³/³⁶⁵ ≈ 0.50. For 30 people it rises to 70%, for 50 people to 97%, and for 70 people to 99.9%.

Birthday Problem — Exact Formula & Approximation

Exact probability that at least two people share a birthday:
  n = number of people,  d = 365 days

  P(match | n) = 1 − P(no match | n)
  P(no match | n) = d/d · (d−1)/d · (d−2)/d · ... · (d−n+1)/d
                  = d! / [(d−n)! · dⁿ]

Stirling-based approximation (n ≪ d):
  P(no match) ≈ exp(−n(n−1) / (2d))
  P(match)    ≈ 1 − exp(−n²/(2d))  for n ≪ d

Key values:
  n = 23 → P(match) ≈ 50.7%
  n = 30 → P(match) ≈ 70.6%
  n = 50 → P(match) ≈ 97.0%
  n = 70 → P(match) ≈ 99.9%

Generalisation — finding a collision in a set of d items:
  Expected number of samples before first collision ≈ √(π·d/2)
  For d = 365: √(π·365/2) ≈ 23.9  (the "birthday bound")

Cryptographic relevance:
  A hash function producing n-bit digests has d = 2ⁿ possible outputs.
  A collision is expected after ≈ 2^(n/2) hashes.
  SHA-256 (n=256): collision after ~2¹²⁸ hashes — computationally infeasible.

The Birthday Paradox simulation runs both an exact analytical curve and a live Monte Carlo experiment. The room view shows 365 slots around a ring, filling as people enter and flashing red when a collision is found. Run 10 000 trials instantly with the batch button and watch the empirical frequency settle exactly on the theoretical curve.

Layer 4: Fitting Lines — Linear Regression

Ordinary Least Squares Simulation

Given a scatter of data points, linear regression finds the line that minimises the sum of squared vertical residuals — the distances from each point to the fitted line. This Ordinary Least Squares (OLS) criterion has a closed-form solution, making it the most widely used statistical method in science, economics, and machine learning.

R², the coefficient of determination, measures how much variance in y is explained by x: R² = 1.0 means a perfect fit; R² = 0 means the regression line does no better than guessing the mean. Pearson's r is the square root of R² (signed), measuring the direction and strength of the linear relationship. An influential outlier can shift the regression line dramatically — a live demonstration is worth a thousand words.

Ordinary Least Squares — Closed-Form Solution

OLS objective:
  Minimise SSE = Σᵢ (yᵢ − ŷᵢ)²  where ŷᵢ = a + b·xᵢ

Closed-form solution (simple linear regression):
  b = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²
    = Cov(x, y) / Var(x)

  a = ȳ − b·x̄  (intercept through centroid)

  Equivalently in matrix form (multiple regression):
    β̂ = (XᵀX)⁻¹ Xᵀy
    X = [1, x₁; 1, x₂; ...; 1, xₙ]  (design matrix)

Goodness-of-fit metrics:
  SST = Σ(yᵢ − ȳ)²           (total sum of squares)
  SSR = Σ(ŷᵢ − ȳ)²           (regression sum of squares)
  SSE = Σ(yᵢ − ŷᵢ)²          (residual sum of squares)
  R²  = 1 − SSE/SST = SSR/SST  ∈ [0, 1]

Pearson correlation:
  r = Cov(x,y) / [σₓ · σᵧ]  ∈ [−1, 1]
  b = r · σᵧ / σₓ            (slope ↔ correlation relationship)

Gauss-Markov assumptions for BLUE estimators:
  1. Linearity: y = Xβ + ε
  2. Strict exogeneity: E[ε | X] = 0
  3. Homoskedasticity: Var(ε | X) = σ²I
  4. No perfect multicollinearity: rank(X) = p

The Linear Regression simulation turns your mouse into a data editor: click to add points, drag to move them, and watch the OLS line, R² score, Pearson r, and residual squares update in real time. Drag one outlier point far from the cluster and watch R² collapse — making vivid the fragility of correlation to influential observations. Five presets demonstrate perfect fit, zero correlation, quadratic patterns (which OLS cannot capture), and classic outlier scenarios.

Correlation ≠ causation: OLS measures linear association, not cause-and-effect. Ice cream sales and drowning deaths are correlated (both increase in summer), but eating ice cream does not cause drowning. Always consider potential confounders before interpreting a regression slope as causal.

Layer 5: Random Walks with Memory — Markov Chains

Markov Chain Simulation

A Markov chain is a stochastic process where the future state depends only on the present state — not on how you arrived there. This "memoryless" Markov property is surprisingly powerful: it underlies Google's PageRank, financial option pricing, speech recognition, protein folding models, and Monte Carlo integration methods.

Given a transition matrix P where P[i][j] is the probability of moving from state i to state j, a Markov chain has a unique stationary distribution π satisfying πP = π. Starting from any initial state, repeated transitions converge to π — a mathematical guarantee with profound implications for simulation, optimisation, and network analysis.

Markov Chains — Transition Matrix & Stationary Distribution

Transition matrix P (n×n, row-stochastic):
  P[i][j] ≥ 0,   Σⱼ P[i][j] = 1  for all i

State probability vector at step t:
  π(t) = π(0) · P^t

Stationary (steady-state) distribution π*:
  π* · P = π*    (left eigenvector for eigenvalue 1)
  Σᵢ π*ᵢ = 1

Power iteration convergence (for ergodic chains):
  π(t) → π*  as t → ∞
  Rate governed by second-largest eigenvalue λ₂:
    convergence time ∝ 1 / (1 − |λ₂|)

Weather preset (in simulation):
  States: Sunny, Cloudy, Rainy
  P = [0.70  0.20  0.10]   (from Sunny)
      [0.30  0.40  0.30]   (from Cloudy)
      [0.20  0.30  0.50]   (from Rainy)
  Stationary: π* ≈ [0.46, 0.28, 0.26]

PageRank preset:
  d = 0.85 (damping factor)
  P_PR = d · A_norm + (1−d) · (1/n) · 1·1ᵀ
  Stationary π* = Page importance scores

The Markov Chains simulation draws animated transition arrows between states, with arc widths proportional to transition probabilities. An animated walker hops between states at configurable speed, while bar charts compare the empirical visit frequency against the theoretical stationary distribution — converging on π* before your eyes. Five presets cover weather transitions, PageRank, gambler's ruin, genetics, and a random four-state chain.

Layer 6: Physics of Probability — Maxwell-Boltzmann Distribution

Maxwell-Boltzmann Distribution Simulation

Statistical mechanics is where physics and probability meet. A gas containers holds ~10²³ molecules, each moving at a different speed. No deterministic description is feasible, but statistical description is exact: the fraction of molecules with speeds in [v, v+dv] follows the Maxwell-Boltzmann distribution, derived by Maxwell (1860) using only three assumptions: isotropy, independence of velocity components, and energy conservation.

The result reveals that the distribution of molecular speeds is not peaked at zero — most molecules are moving significantly faster than the thermal speed √(k_BT/m). At higher temperatures, the peak shifts right and the distribution broadens; at lower temperatures, it narrows. This has direct consequences for chemical reaction rates (Arrhenius), the escape of atmosphere from planets, and the operation of gas lasers.

Maxwell-Boltzmann Speed Distribution

Probability density function:
  f(v) = 4π · (m / 2πk_BT)^(3/2) · v² · exp(−mv² / 2k_BT)

  where:
    m   = molecular mass [kg]
    k_B = Boltzmann constant = 1.380649 × 10⁻²³ J/K
    T   = absolute temperature [K]

Characteristic speeds:
  Most probable speed:  v_p  = √(2k_BT / m)
  Mean speed:             = √(8k_BT / πm)  = v_p · √(4/π)
  RMS speed:            v_rms = √(3k_BT / m)  = v_p · √(3/2)

Ordering: v_p <  < v_rms  (ratio ≈ 1 : 1.128 : 1.225)

Normalisation (sanity check):
  ∫₀^∞ f(v) dv = 1  ✓

Temperature scaling:
  v_p ∝ √T   — doubling temperature increases most probable speed by √2
  v_p ∝ 1/√m — nitrogen (28 u) moves ~3.4× slower than hydrogen (2 u) at same T

Atmospheric escape (Jeans escape):
  Escape velocity from Earth: v_esc ≈ 11.2 km/s
  High-energy tail f(v) determines loss rate
  Hydrogen (m=2u) escapes; nitrogen (m=28u) does not

The Maxwell-Boltzmann simulation models a gas of hard-sphere particles colliding elastically in a 2D box. The real-time histogram of particle speeds updates with each collision, converging on the MB curve as the simulation runs. Adjust the temperature slider and watch both the theoretical curve and the particle velocities respond — making viscerally clear how temperature is just the average kinetic energy of chaotic motion.

Connection to the CLT: The Maxwell-Boltzmann distribution is itself a consequence of the CLT applied to many small momentum transfers. The velocity components vₓ, vy, vz each follow a normal distribution N(0, k_BT/m) — because each is the sum of many tiny impulses from collisions. The speed |v| is then the magnitude of a 3D Gaussian vector, giving the χ distribution which reduces to Maxwell-Boltzmann.

Layer 7: Resampling Without Assumptions — Bootstrap

Bootstrap Resampling Simulation

Classical statistics assumes you know the population distribution — then derives formulas for confidence intervals and p-values under that assumption. But what if you cannot assume normality? What if you have only 15 data points and want a confidence interval for the median, or a complex statistic like the Gini coefficient?

Bootstrap resampling, introduced by Bradley Efron in 1979, sidesteps distributional assumptions entirely. The idea is simple: treat your sample as a stand-in for the population and draw thousands of new samples with replacement from it. The distribution of your statistic across these re-samples approximates its true sampling distribution — no formulas required.

Bootstrap Confidence Intervals — Percentile Method

Algorithm (nonparametric bootstrap):
  1. Given original data: x = {x₁, x₂, ..., xₙ}
  2. For b = 1 to B (B = 10 000 typical):
       Sample x*_b = {x*₁, ..., x*ₙ} with replacement from x
       Compute statistic θ̂*_b = f(x*_b)  (e.g. mean, median, R²)
  3. Sort {θ̂*_1, ..., θ̂*_B}
  4. 95% CI = [θ̂*_{0.025·B}, θ̂*_{0.975·B}]  (percentile method)

Bias-corrected and accelerated (BCa) interval:
  Adjusts for bias and skewness in the bootstrap distribution
  Preferred over simple percentile method for small samples

Why it works (Efron's insight):
  The empirical distribution F̂ₙ converges to true F as n → ∞
  Bootstrap world:  F̂ₙ → F̂ₙ  (known exactly)
  Real world:       F̂ₙ → F    (unknown target)
  The two sampling variabilities are asymptotically equal

Key advantages:
  • Works for any statistic (mean, median, correlation, Gini, etc.)
  • No normality assumption
  • Handles complex sampling designs
  • Confidence intervals for ML model performance (cross-val)

The Bootstrap Resampling simulation visualises each of the 1 000+ resamples as a small dot on a dot plot, building up the bootstrap distribution live. A draggable outlier slider lets you inject a single extreme value into the original sample and watch confidence intervals widen in real time — demonstrating graphically how robust bootstrap CIs are compared to t-intervals when assumptions are violated.

All Seven Simulations at a Glance

Probability · Statistics

Central Limit Theorem

Five source distributions, live histogram accumulation, normal overlay, σ/√n standard error.

Bayesian · Statistics

Bayesian Inference

Prior × likelihood → posterior, medical test scenario, coin-flip sequential updating.

Probability · Combinatorics

Birthday Paradox

Exact P(n) curve, Monte Carlo ring view, batch mode 10 000 trials, collision flash.

Statistics · ML

Linear Regression (OLS)

Drag-and-drop scatter plot, live R²/r/slope/SSE, residual lines, five presets.

Probability · Networks

Markov Chains

Animated transition graph, random walker, empirical vs theoretical stationary distribution.

Statistical Mechanics

Maxwell-Boltzmann Gas

Hard-sphere particle collisions, live speed histogram, v_p / ⟨v⟩ / v_rms markers.

Statistics · Inference

Bootstrap Resampling

Nonparametric CIs, live resample dot plot, outlier injection, BCa vs percentile comparison.

The Deeper Connection

These seven simulations are not isolated topics — they form a coherent intellectual chain. The CLT explains why normal distributions appear everywhere, which justifies the t-test and OLS regression. Bayesian inference provides the framework for updating any probability estimate, including the parameters of a regression model. Markov chains underlie MCMC (Markov Chain Monte Carlo), a technique used to sample from complex Bayesian posteriors. The birthday paradox is a special case of the collision probability in hash tables and cryptocurrency mining. And the Maxwell-Boltzmann distribution is the CLT applied to 10²³ gas molecules — the same mathematics that explains sample means also governs thermodynamics.

Statistics is not a collection of formulas to memorise but a way of thinking about uncertainty, evidence, and inference. Each simulation here builds one room in that mental architecture — and together they form a home for quantitative reasoning.