Computer Vision Basics: From Pixels to Understanding

Computer vision — teaching machines to interpret visual information — is one of the most successful applications of deep learning. From digit recognition to real-time object detection in self-driving cars, CNNs now match or exceed human accuracy on many visual tasks. Understanding the mathematical substrate reveals why these systems are so powerful and where they still fail.

1. Image Representation and Preprocessing

A digital image is a 2D array (or 3D tensor for colour) of integer pixel values. For an RGB image of height H, width W: the tensor has shape [H × W × 3], where each channel stores values 0–255.

Colour spaces: RGB: (R, G, B) — additive colour model, device-specific HSV: (Hue 0-360°, Saturation 0-1, Value 0-1) — more perceptual, useful for colour-based segmentation and tracking Grayscale: I = 0.299R + 0.587G + 0.114B (luminance weighting) YCbCr: used in JPEG compression; Y=luma, Cb/Cr=chrominance channels Normalization (required before neural network input): x_norm = (x - μ_channel) / σ_channel per-channel normalization ImageNet mean: μ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225] Without normalization: large gradients and unstable training Image augmentation (training data expansion): Random horizontal flip, random crop to 224\times224, colour jitter, rotation, random erasing — artificially expands training set, improves generalisation

2. Convolutions and Filters

A convolution passes a small kernel (filter) over the image, computing a weighted sum at each location. This operation is the core building block of CNNs:

Discrete 2D convolution: (I * K)[i,j] = Σ_m Σ_n I[i+m, j+n] \cdot K[m,n] For a 3\times3 Gaussian blur kernel: K = (1/16) \times [[1,2,1], [2,4,2], [1,2,1]] \to Smooths noise (low-pass filter) Sobel edge detector (horizontal gradient): K_x = [[-1, 0, +1], [-2, 0, +2], [-1, 0, +1]] |\nablaI| = \sqrt(G_x² + G_y²) \to edge magnitude θ = arctan(G_y / G_x) \to edge direction Output feature map size: Input: H \times W \times C_in Kernel: k\timesk, C_out filters, stride s, padding p Output: [(H + 2p - k)/s + 1] \times [(W + 2p - k)/s + 1] \times C_out For k=3, s=1, p=1 (same padding): output size = H \times W (unchanged) For k=3, s=2, p=1 (strided): output size \approx H/2 \times W/2 (spatial halving)

3. CNN Architecture

A CNN stacks multiple types of layers to progressively learn higher-level features:

Convolutional layer: Learns kernels (weights) that activate on specific patterns. Early layers detect edges and textures; deeper layers detect object parts and complete objects.
Activation function: ReLU (Rectified Linear Unit): f(x) = max(0, x). Introduces non-linearity. Leaky ReLU, GELU used in modern networks.
Batch Normalisation: Normalises activations per batch to have mean 0, std 1, then scales learnable parameters γ, β. Dramatically stabilises training, allows higher learning rates.
Pooling: Max pooling (subsamples feature maps) reduces spatial dimensions, increases receptive field, provides some translation invariance. 2×2 max pool with stride 2 → halves both dimensions.
Fully connected (FC) layer: Final layers flatten the feature volume and learn global combinations for classification.

VGG-16 architecture (2014): Input: 224\times224\times3 Block 1: 2\times [Conv 3\times3, 64] + MaxPool 2\times2 \to 112\times112\times64 Block 2: 2\times [Conv 3\times3, 128] + MaxPool \to 56\times56\times128 Block 3: 3\times [Conv 3\times3, 256] + MaxPool \to 28\times28\times256 Block 4: 3\times [Conv 3\times3, 512] + MaxPool \to 14\times14\times512 Block 5: 3\times [Conv 3\times3, 512] + MaxPool \to 7\times7\times512 FC 4096 \to FC 4096 \to FC 1000 (softmax) \to class probabilities Total parameters: ~138M ResNet insight (2015 — won ILSVRC): Skip connections: x \to Conv \to BN \to ReLU \to Conv \to BN \to (+x) \to ReLU Allows training networks of 50-152+ layers without vanishing gradients. Key: the network learns residual F(x) = H(x) - x rather than H(x) directly. Top-5 ImageNet error: 3.57% (surpassing human ~5.1%)

4. Classical Feature Detection

Before deep learning, hand-crafted feature detectors dominated computer vision and remain relevant for lightweight applications and geometric tasks:

Harris Corner Detector (1988): Computes the structure tensor M of image gradients. At a corner, both eigenvalues of M are large. Decision: R = det(M) − k·trace(M)². R > threshold → corner.
HOG (Histogram of Oriented Gradients, 2005): Divide image into cells, compute gradient orientation histogram per cell, normalise across overlapping blocks. Used in the first practical pedestrian detector (Dalal & Triggs). Still used as feature input to SVMs.
SIFT (Scale-Invariant Feature Transform, 1999/2004): Detects keypoints in a scale space (Difference of Gaussians), computes a 128-dimensional descriptor invariant to scale, rotation, and illumination. Widely used in image stitching, panoramas, 3D reconstruction (COLMAP).

5. Object Detection: YOLO and R-CNN

Object detection requires both classifying objects and localising them with bounding boxes. Two major paradigms:

Two-stage: R-CNN family (Region-based CNN) 1. Region Proposal Network (RPN) generates ~2000 candidate regions of interest 2. Each RoI is feature-extracted and classified independently Faster R-CNN (2015): RPN shares convolutional backbone with detector head ~5 fps on GPU, ~70 mAP on COCO Accurate but relatively slow for real-time use One-stage: YOLO (You Only Look Once, Redmon 2016) Single forward pass through the network. Image divided into S\timesS grid (e.g. 13\times13 for YOLOv3 at 416\times416 input) Each cell predicts B bounding boxes with confidence + C class probabilities. Output tensor: S\timesS\times(B\times5 + C) where 5 = {x,y,w,h,confidence} YOLOv8 (2023): ~50+ fps at 640\times640, ~53 mAP on COCO Anchor-free, C3 modules, NMS (Non-Maximum Suppression) post-processing IoU (Intersection over Union): IoU = Area(A\capB) / Area(A\cupB) Metric for bounding box quality: IoU > 0.5 typically considered "correct" mAP (mean Average Precision): area under precision-recall curve, averaged over all classes and IoU thresholds

6. Semantic and Instance Segmentation

Rather than bounding boxes, segmentation assigns a class label to every pixel:

Semantic segmentation: Every pixel labelled with a class — "sky", "road", "person". Doesn't distinguish between different instances of the same class (all cars are labelled "car"). FCN (Fully Convolutional Network) and DeepLab (with dilated convolutions and CRF post-processing) are benchmarks.
Instance segmentation: Separate mask per object instance — each individual car gets its own mask. Mask R-CNN adds a mask prediction head to Faster R-CNN, producing binary segmentation masks for each detected instance at minimal additional cost.
Panoptic segmentation: Combines semantic (for background "stuff") and instance (for foreground "things") — a single unified labelling. State-of-the-art systems include Panoptic-FPN, DETR-based models.

Encoder–decoder (U-Net) architecture: Encodes image to a bottleneck representation (contracting path), then decodes back to full resolution (expanding path) with skip connections carrying high-resolution features from encoder. Originally designed for biomedical image segmentation with limited data. Skip connections are crucial: the decoder needs both semantic context (from deep layers) and spatial detail (from early layers) to place precise boundaries.

7. Modern Vision: Transformers and Beyond

Vision Transformers (ViT, 2020) apply the self-attention mechanism of NLP transformers directly to images:

Image divided into 16×16 patches, each flattened and linearly embedded as a "token".
Self-attention computes pairwise token interactions — a global receptive field from the first layer, unlike CNNs that build up gradually.
Pre-trained on large datasets (ImageNet-21k, JFT-3B), ViT outperforms CNNs at scale.
Hybrid models (CvT, ConvNeXt) combine convolutional locality bias with attention-based global context.

CLIP (Contrastive Language–Image Pre-training, OpenAI 2021): Jointly trains image encoder and text encoder on 400M image-text pairs. Can perform zero-shot classification by comparing image to text descriptions. Foundation for DALL-E, Stable Diffusion conditioning.
Segment Anything Model (SAM, Meta 2023): Promptable segmentation via point/box/text prompts. Trained on 1 billion masks. Generalises to unseen objects and domains with no fine-tuning.
Open-vocabulary detection: Models like Grounding DINO detect arbitrary classes from text prompts, not just a fixed category set — moving toward true open-world understanding.