Computer Vision Basics: From Pixels to Understanding
Computer vision — teaching machines to interpret visual information — is one of the most successful applications of deep learning. From digit recognition to real-time object detection in self-driving cars, CNNs now match or exceed human accuracy on many visual tasks. Understanding the mathematical substrate reveals why these systems are so powerful and where they still fail.
1. Image Representation and Preprocessing
A digital image is a 2D array (or 3D tensor for colour) of integer pixel values. For an RGB image of height H, width W: the tensor has shape [H × W × 3], where each channel stores values 0–255.
2. Convolutions and Filters
A convolution passes a small kernel (filter) over the image, computing a weighted sum at each location. This operation is the core building block of CNNs:
3. CNN Architecture
A CNN stacks multiple types of layers to progressively learn higher-level features:
- Convolutional layer: Learns kernels (weights) that activate on specific patterns. Early layers detect edges and textures; deeper layers detect object parts and complete objects.
- Activation function: ReLU (Rectified Linear Unit): f(x) = max(0, x). Introduces non-linearity. Leaky ReLU, GELU used in modern networks.
- Batch Normalisation: Normalises activations per batch to have mean 0, std 1, then scales learnable parameters γ, β. Dramatically stabilises training, allows higher learning rates.
- Pooling: Max pooling (subsamples feature maps) reduces spatial dimensions, increases receptive field, provides some translation invariance. 2×2 max pool with stride 2 → halves both dimensions.
- Fully connected (FC) layer: Final layers flatten the feature volume and learn global combinations for classification.
4. Classical Feature Detection
Before deep learning, hand-crafted feature detectors dominated computer vision and remain relevant for lightweight applications and geometric tasks:
- Harris Corner Detector (1988): Computes the structure tensor M of image gradients. At a corner, both eigenvalues of M are large. Decision: R = det(M) − k·trace(M)². R > threshold → corner.
- HOG (Histogram of Oriented Gradients, 2005): Divide image into cells, compute gradient orientation histogram per cell, normalise across overlapping blocks. Used in the first practical pedestrian detector (Dalal & Triggs). Still used as feature input to SVMs.
- SIFT (Scale-Invariant Feature Transform, 1999/2004): Detects keypoints in a scale space (Difference of Gaussians), computes a 128-dimensional descriptor invariant to scale, rotation, and illumination. Widely used in image stitching, panoramas, 3D reconstruction (COLMAP).
5. Object Detection: YOLO and R-CNN
Object detection requires both classifying objects and localising them with bounding boxes. Two major paradigms:
6. Semantic and Instance Segmentation
Rather than bounding boxes, segmentation assigns a class label to every pixel:
- Semantic segmentation: Every pixel labelled with a class — "sky", "road", "person". Doesn't distinguish between different instances of the same class (all cars are labelled "car"). FCN (Fully Convolutional Network) and DeepLab (with dilated convolutions and CRF post-processing) are benchmarks.
- Instance segmentation: Separate mask per object instance — each individual car gets its own mask. Mask R-CNN adds a mask prediction head to Faster R-CNN, producing binary segmentation masks for each detected instance at minimal additional cost.
- Panoptic segmentation: Combines semantic (for background "stuff") and instance (for foreground "things") — a single unified labelling. State-of-the-art systems include Panoptic-FPN, DETR-based models.
7. Modern Vision: Transformers and Beyond
Vision Transformers (ViT, 2020) apply the self-attention mechanism of NLP transformers directly to images:
- Image divided into 16×16 patches, each flattened and linearly embedded as a "token".
- Self-attention computes pairwise token interactions — a global receptive field from the first layer, unlike CNNs that build up gradually.
- Pre-trained on large datasets (ImageNet-21k, JFT-3B), ViT outperforms CNNs at scale.
- Hybrid models (CvT, ConvNeXt) combine convolutional locality bias with attention-based global context.
- CLIP (Contrastive Language–Image Pre-training, OpenAI 2021): Jointly trains image encoder and text encoder on 400M image-text pairs. Can perform zero-shot classification by comparing image to text descriptions. Foundation for DALL-E, Stable Diffusion conditioning.
- Segment Anything Model (SAM, Meta 2023): Promptable segmentation via point/box/text prompts. Trained on 1 billion masks. Generalises to unseen objects and domains with no fine-tuning.
- Open-vocabulary detection: Models like Grounding DINO detect arbitrary classes from text prompts, not just a fixed category set — moving toward true open-world understanding.