Article
Machine Learning · Algorithms · ⏱ ≈ 12 хв читання

K-Means & DBSCAN — Two Philosophies of Clustering

Clustering is unsupervised learning: group data into meaningful subsets without labels. K-Means (MacQueen 1967) partitions by minimising within-cluster variance — simple, fast, but assumes spherical clusters of similar size. DBSCAN (Ester et al., 1996) defines clusters by density — discovers arbitrary shapes and explicitly labels noise. Knowing when to use each, how to tune their parameters, and how to evaluate results separates competent data analysis from cargo-cult machine learning.

1. K-Means Algorithm

K-Means minimises the within-cluster sum of squared distances (inertia) via Lloyd's EM-like iteration:

Objective (minimise): J = Σₖ Σ_{x ∈ Cₖ} ||x − μₖ||² Lloyd's algorithm: INIT: choose K initial centroids μ₁ … μₖ (random or K-Means++) E step: assign each x to nearest centroid c(x) = argminₖ ||x − μₖ||² M step: recompute centroids μₖ = mean of all x assigned to Cₖ REPEAT until centroids stop moving (or max_iter) Convergence: guaranteed to converge (J never increases), but may converge to local minimum → run multiple times, keep best J. Complexity: O(n·K·d·iterations) n = data points, K = clusters, d = dimensions

2. K-Means++ Initialisation

Random centroid initialisation often leads to poor local minima. K-Means++ selects initial centroids with a probability proportional to squared distance from already-chosen centroids:

K-Means++ init: 1. Choose first centroid μ₁ uniformly at random 2. For i = 2 … K: D(x) = min_j ||x − μⱼ||² (distance to nearest chosen centroid) Choose next centroid from data with probability ∝ D(x)² 3. Proceed with standard Lloyd's iteration Theoretical guarantee: E[J_kmeans++] ≤ 8(ln K + 2) · J_optimal In practice: K-Means++ typically achieves inertia within 5–10% of the optimum on first attempt, much better than random init's ~50% gap.

3. Choosing K: Elbow and Silhouette

Inertia (elbow method): Run K-Means for K = 1, 2, … K_max Plot inertia vs K: look for "elbow" where gains diminish Problem: elbow is often ambiguous or non-existent Silhouette score (per-point): a(i) = mean intra-cluster distance for point i b(i) = mean distance to nearest OTHER cluster s(i) = (b(i) − a(i)) / max(a(i), b(i)) ∈ [−1, 1] s(i) ≈ 1 → well clustered s(i) ≈ 0 → on cluster boundary s(i) < 0 → misclassified Mean silhouette score: choose K that maximises it. Complexity: O(n²) per K → expensive for large datasets; use approximate version (sample-based) at scale.

4. DBSCAN Algorithm

DBSCAN defines clusters as dense regions separated by sparse ones. Two parameters: ε (neighbourhood radius) and MinPts (density threshold):

Definitions: ε-neighbourhood N_ε(p) = {q : dist(p,q) ≤ ε} Core point: |N_ε(p)| ≥ MinPts Border point: in N_ε of a core point, but not core itself Noise: not a core point and not in any core's N_ε Algorithm: FOR each unvisited point p: IF p is core point: expand cluster from p (BFS/DFS through density-connected points) ELSE: mark as noise (may be reassigned if found in N_ε of core) Density-reachability (transitive): p is density-reachable from q if there is a chain q → p₁ → p₂ → … → p where each step is core + within ε Complexity: O(n · log n) with spatial index, O(n²) naive

5. Tuning ε and MinPts

MinPts rule of thumb: MinPts ≥ d + 1 (d = dimensions) MinPts = 2 → minimum spanning tree (too loose) Common choice: MinPts = 2·d ε selection via k-NN distance plot: 1. Compute k-th nearest-neighbour distance for every point (k=MinPts) 2. Sort distances and plot 3. Choose ε at the "knee" of the sorted distance curve → separates dense regions from sparse Effect of ε: ε too small → most points noise, many micro-clusters ε too large → all points in one cluster Effect of MinPts: MinPts too small → noisy, irregular cluster shapes MinPts too large → only very dense regions qualify as clusters

6. Comparison and When to Use Each

K-Means ✓

When K is known, clusters are roughly spherical and similar size, data is numeric, speed matters (O(nK) per iter). Great for colour quantisation, customer segmentation.

K-Means ✗

Fails on elongated/ring clusters, unequal densities, data with outliers. Sensitive to scale — always normalise. Cannot identify noise points.

DBSCAN ✓

Arbitrary cluster shapes, automatic noise detection, no need to specify K. Works on geographic point clouds, anomaly detection, image segmentation with spatial coherence.

DBSCAN ✗

Struggles with variable-density clusters (HDBSCAN fixes this). High-dimensional data: ε-neighbourhoods become meaningless (curse of dimensionality).

7. JavaScript Implementation

// ── K-Means ────────────────────────────────────────
function kMeans(points, K, maxIter = 300) {
  const n = points.length, d = points[0].length;
  // K-Means++ init
  const centroids = [points[Math.floor(Math.random()*n)].slice()];
  while (centroids.length < K) {
    const dists = points.map(p => Math.min(...centroids.map(c => dist2(p, c))));
    const sum = dists.reduce((a, b) => a + b, 0);
    let r = Math.random() * sum;
    const idx = dists.findIndex(d => (r -= d) <= 0);
    centroids.push(points[idx < 0 ? n-1 : idx].slice());
  }

  let labels = new Int32Array(n);
  for (let iter = 0; iter < maxIter; iter++) {
    // E step: assign
    let changed = 0;
    for (let i = 0; i < n; i++) {
      let best = 0, bestD = Infinity;
      for (let k = 0; k < K; k++) {
        const dd = dist2(points[i], centroids[k]);
        if (dd < bestD) { bestD = dd; best = k; }
      }
      if (labels[i] !== best) { labels[i] = best; changed++; }
    }
    if (!changed) break;
    // M step: recompute centroids
    for (let k = 0; k < K; k++) centroids[k].fill(0);
    const counts = new Int32Array(K);
    for (let i = 0; i < n; i++) {
      for (let j = 0; j < d; j++) centroids[labels[i]][j] += points[i][j];
      counts[labels[i]]++;
    }
    for (let k = 0; k < K; k++)
      if (counts[k]) for (let j = 0; j < d; j++) centroids[k][j] /= counts[k];
  }
  return {labels, centroids};
}

// ── DBSCAN ────────────────────────────────────────
function dbscan(points, eps, minPts) {
  const n = points.length;
  const labels = new Int32Array(n).fill(-1); // -1 = unvisited
  let cluster = 0;
  const eps2 = eps * eps;

  function neighbours(i) {
    return points.reduce((acc, p, j) =>
      (dist2(points[i], p) <= eps2 ? acc.push(j) : 0, acc), []);
  }

  for (let i = 0; i < n; i++) {
    if (labels[i] !== -1) continue;
    const nb = neighbours(i);
    if (nb.length < minPts) { labels[i] = 0; continue; } // noise (0)
    const c = ++cluster;
    labels[i] = c;
    const Q = [...nb];
    while (Q.length) {
      const q = Q.pop();
      if (labels[q] === 0) labels[q] = c; // border point, reassign
      if (labels[q] !== -1) continue;
      labels[q] = c;
      const qNb = neighbours(q);
      if (qNb.length >= minPts) Q.push(...qNb);
    }
  }
  return labels; // 0 = noise, 1…K = cluster ids
}

function dist2(a, b) {
  let s = 0;
  for (let i = 0; i < a.length; i++) s += (a[i]-b[i])**2;
  return s;
}

8. HDBSCAN and Spectral Clustering

HDBSCAN

Hierarchical DBSCAN builds a cluster hierarchy by varying ε continuously. Extracts stable clusters of varying density. Works where DBSCAN's fixed ε fails. State of the art for general clustering.

Spectral Clustering

Build a similarity graph, compute graph Laplacian eigenvectors, then run K-Means in eigenspace. Discovers clusters of any shape. Expensive O(n³) but powerful for manifold data.

Gaussian Mixture Models

Soft K-Means: every point has fractional membership in each cluster. EM algorithm fits Gaussians. Explicitly models cluster shape and uncertainty; BIC selects K.

OPTICS

Like DBSCAN but produces a reachability plot (ordering) that encodes the hierarchical cluster structure, allowing multiple ε thresholds to be read off a single pass.