K-Means & DBSCAN — Two Philosophies of Clustering
Clustering is unsupervised learning: group data into meaningful subsets without labels. K-Means (MacQueen 1967) partitions by minimising within-cluster variance — simple, fast, but assumes spherical clusters of similar size. DBSCAN (Ester et al., 1996) defines clusters by density — discovers arbitrary shapes and explicitly labels noise. Knowing when to use each, how to tune their parameters, and how to evaluate results separates competent data analysis from cargo-cult machine learning.
1. K-Means Algorithm
K-Means minimises the within-cluster sum of squared distances (inertia) via Lloyd's EM-like iteration:
2. K-Means++ Initialisation
Random centroid initialisation often leads to poor local minima. K-Means++ selects initial centroids with a probability proportional to squared distance from already-chosen centroids:
3. Choosing K: Elbow and Silhouette
4. DBSCAN Algorithm
DBSCAN defines clusters as dense regions separated by sparse ones. Two parameters: ε (neighbourhood radius) and MinPts (density threshold):
5. Tuning ε and MinPts
6. Comparison and When to Use Each
K-Means ✓
When K is known, clusters are roughly spherical and similar size, data is numeric, speed matters (O(nK) per iter). Great for colour quantisation, customer segmentation.
K-Means ✗
Fails on elongated/ring clusters, unequal densities, data with outliers. Sensitive to scale — always normalise. Cannot identify noise points.
DBSCAN ✓
Arbitrary cluster shapes, automatic noise detection, no need to specify K. Works on geographic point clouds, anomaly detection, image segmentation with spatial coherence.
DBSCAN ✗
Struggles with variable-density clusters (HDBSCAN fixes this). High-dimensional data: ε-neighbourhoods become meaningless (curse of dimensionality).
7. JavaScript Implementation
// ── K-Means ────────────────────────────────────────
function kMeans(points, K, maxIter = 300) {
const n = points.length, d = points[0].length;
// K-Means++ init
const centroids = [points[Math.floor(Math.random()*n)].slice()];
while (centroids.length < K) {
const dists = points.map(p => Math.min(...centroids.map(c => dist2(p, c))));
const sum = dists.reduce((a, b) => a + b, 0);
let r = Math.random() * sum;
const idx = dists.findIndex(d => (r -= d) <= 0);
centroids.push(points[idx < 0 ? n-1 : idx].slice());
}
let labels = new Int32Array(n);
for (let iter = 0; iter < maxIter; iter++) {
// E step: assign
let changed = 0;
for (let i = 0; i < n; i++) {
let best = 0, bestD = Infinity;
for (let k = 0; k < K; k++) {
const dd = dist2(points[i], centroids[k]);
if (dd < bestD) { bestD = dd; best = k; }
}
if (labels[i] !== best) { labels[i] = best; changed++; }
}
if (!changed) break;
// M step: recompute centroids
for (let k = 0; k < K; k++) centroids[k].fill(0);
const counts = new Int32Array(K);
for (let i = 0; i < n; i++) {
for (let j = 0; j < d; j++) centroids[labels[i]][j] += points[i][j];
counts[labels[i]]++;
}
for (let k = 0; k < K; k++)
if (counts[k]) for (let j = 0; j < d; j++) centroids[k][j] /= counts[k];
}
return {labels, centroids};
}
// ── DBSCAN ────────────────────────────────────────
function dbscan(points, eps, minPts) {
const n = points.length;
const labels = new Int32Array(n).fill(-1); // -1 = unvisited
let cluster = 0;
const eps2 = eps * eps;
function neighbours(i) {
return points.reduce((acc, p, j) =>
(dist2(points[i], p) <= eps2 ? acc.push(j) : 0, acc), []);
}
for (let i = 0; i < n; i++) {
if (labels[i] !== -1) continue;
const nb = neighbours(i);
if (nb.length < minPts) { labels[i] = 0; continue; } // noise (0)
const c = ++cluster;
labels[i] = c;
const Q = [...nb];
while (Q.length) {
const q = Q.pop();
if (labels[q] === 0) labels[q] = c; // border point, reassign
if (labels[q] !== -1) continue;
labels[q] = c;
const qNb = neighbours(q);
if (qNb.length >= minPts) Q.push(...qNb);
}
}
return labels; // 0 = noise, 1…K = cluster ids
}
function dist2(a, b) {
let s = 0;
for (let i = 0; i < a.length; i++) s += (a[i]-b[i])**2;
return s;
}
8. HDBSCAN and Spectral Clustering
HDBSCAN
Hierarchical DBSCAN builds a cluster hierarchy by varying ε continuously. Extracts stable clusters of varying density. Works where DBSCAN's fixed ε fails. State of the art for general clustering.
Spectral Clustering
Build a similarity graph, compute graph Laplacian eigenvectors, then run K-Means in eigenspace. Discovers clusters of any shape. Expensive O(n³) but powerful for manifold data.
Gaussian Mixture Models
Soft K-Means: every point has fractional membership in each cluster. EM algorithm fits Gaussians. Explicitly models cluster shape and uncertainty; BIC selects K.
OPTICS
Like DBSCAN but produces a reachability plot (ordering) that encodes the hierarchical cluster structure, allowing multiple ε thresholds to be read off a single pass.