What Is Machine Learning?
Machine learning is the science of getting computers to learn from data without being explicitly programmed. Understanding the three main paradigms — supervised, unsupervised, and reinforcement learning — plus the bias–variance tradeoff is enough to reason clearly about nearly every ML system in production.
What Machine Learning Actually Does
Classical software takes explicit rules as input and produces answers. Machine learning flips that: it takes examples (inputs + correct answers) and produces rules (a model that can answer new questions).
ML: data + answers → rules (model)
Technically, ML finds a function f such that f(x) ≈ y for all training pairs (x, y), and then generalises to unseen x values.
Supervised Learning
The most common paradigm. Every training example has a labelled answer. The algorithm minimises the difference between its predictions and those labels.
Regression
Predict a continuous number. House prices, temperature, stock returns.
Classification
Predict a category. Spam/not spam, cat/dog, disease/no disease.
Ranking
Order items by relevance. Search results, recommendation feeds.
Common algorithms: linear/logistic regression, decision trees, random forests, gradient-boosted trees (XGBoost), support vector machines, and neural networks.
Unsupervised Learning
No labels — the algorithm must find structure in the data by itself. It groups similar examples, compresses representations, or detects anomalies without being told the "right answer".
- Clustering (K-means, DBSCAN): Group customers by buying behaviour; identify cell types in single-cell RNA-seq data.
- Dimensionality reduction (PCA, t-SNE, UMAP): Compress 1000 features into 2D for visualisation.
- Generative models (VAEs, GANs, diffusion): Learn the distribution of the data and sample new examples — the basis of image synthesis and language models.
- Anomaly detection: A model learns what "normal" looks like; deviations flag fraud or equipment failure.
Reinforcement Learning
An agent takes actions in an environment and receives rewards. The goal is to learn the policy (action selection strategy) that maximises cumulative reward over time.
Unlike supervised learning, there are no (x, y) pairs — the agent must discover which actions lead to reward through trial-and-error, often with long delays between action and reward.
Applications: game-playing AI (AlphaGo, OpenAI Five), robot locomotion, data-centre cooling optimisation, RLHF (fine-tuning language models to be helpful and safe).
Bias–Variance Tradeoff
Every prediction error from a model can be decomposed into:
- Bias: Error from wrong assumptions in the model (e.g., fitting a line to quadratic data). A high-bias model underfits — it's too simple.
- Variance: Error from sensitivity to small fluctuations in training data. A high-variance model overfits — it memorises rather than generalising.
- Irreducible noise: Randomness in the data that no model can remove.
Increasing model complexity (more parameters, higher polynomial degree) lowers bias but raises variance. The art of ML is finding the sweet spot given the amount of available data.
Overfitting and Regularisation
An overfitted model performs very well on training data but poorly on new examples — it has "memorised" rather than "learned".
Common fixes
- More data: The most reliable fix. Models overfit when training data is scarce relative to model size.
- L2 regularisation (weight decay): Adds a penalty proportional to the square of each weight to the loss: L = L_data + λΣwᵢ². Forces weights to stay small.
- Dropout: Randomly zero out neurons during training so no single neuron becomes indispensable. Reduces co-adaptation between neurons.
- Early stopping: Monitor validation loss during training and stop when it starts to rise.
- Data augmentation: Artificially expand the training set by transforming examples (flip images, add noise, back-translate text).
How Models Generalise
Generalisation is the core mystery of ML. Overparameterised models — like a 175-billion-parameter GPT-3 trained on a trillion tokens — should overfit catastrophically by classical theory.
They don't, because of what researchers call the double descent phenomenon: as model size increases beyond the interpolation threshold, test error decreases again. SGD's implicit bias toward flat minima and structured data both play a role.
The practical upshot: if your data is large and diverse enough, bigger models often generalise better, not worse. This is counter-intuitive but now well-established empirically.
Choosing the Right Method
- Have labelled data, predict a number or class? → Supervised learning. Start with gradient-boosted trees (XGBoost/LightGBM) for tabular data, neural networks for images/text/audio.
- No labels, want to find structure? → Unsupervised. Cluster with K-means; visualise with UMAP; compress with autoencoder.
- Sequential decisions with a reward signal? → Reinforcement learning. Much harder to train; use it only when the problem is inherently sequential.
- Very little labelled data? → Semi-supervised learning or fine-tuning a pre-trained model (transfer learning). Foundation models like GPT or CLIP can be fine-tuned with just hundreds of examples.