💻 Computer Science · AI
📅 Березень 2026⏱ 12 min🟡 Середній

Computer Vision Basics: From Pixels to Understanding

Computer vision — teaching machines to interpret visual information — is one of the most successful applications of deep learning. From digit recognition to real-time object detection in self-driving cars, CNNs now match or exceed human accuracy on many visual tasks. Understanding the mathematical substrate reveals why these systems are so powerful and where they still fail.

1. Image Representation and Preprocessing

A digital image is a 2D array (or 3D tensor for colour) of integer pixel values. For an RGB image of height H, width W: the tensor has shape [H × W × 3], where each channel stores values 0–255.

Colour spaces: RGB: (R, G, B) — additive colour model, device-specific HSV: (Hue 0-360°, Saturation 0-1, Value 0-1) — more perceptual, useful for colour-based segmentation and tracking Grayscale: I = 0.299R + 0.587G + 0.114B (luminance weighting) YCbCr: used in JPEG compression; Y=luma, Cb/Cr=chrominance channels Normalization (required before neural network input): x_norm = (x - μ_channel) / σ_channel per-channel normalization ImageNet mean: μ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225] Without normalization: large gradients and unstable training Image augmentation (training data expansion): Random horizontal flip, random crop to 224×224, colour jitter, rotation, random erasing — artificially expands training set, improves generalisation

2. Convolutions and Filters

A convolution passes a small kernel (filter) over the image, computing a weighted sum at each location. This operation is the core building block of CNNs:

Discrete 2D convolution: (I * K)[i,j] = Σ_m Σ_n I[i+m, j+n] · K[m,n] For a 3×3 Gaussian blur kernel: K = (1/16) × [[1,2,1], [2,4,2], [1,2,1]] → Smooths noise (low-pass filter) Sobel edge detector (horizontal gradient): K_x = [[-1, 0, +1], [-2, 0, +2], [-1, 0, +1]] |∇I| = √(G_x² + G_y²) → edge magnitude θ = arctan(G_y / G_x) → edge direction Output feature map size: Input: H × W × C_in Kernel: k×k, C_out filters, stride s, padding p Output: [(H + 2p - k)/s + 1] × [(W + 2p - k)/s + 1] × C_out For k=3, s=1, p=1 (same padding): output size = H × W (unchanged) For k=3, s=2, p=1 (strided): output size ≈ H/2 × W/2 (spatial halving)

3. CNN Architecture

A CNN stacks multiple types of layers to progressively learn higher-level features:

VGG-16 architecture (2014): Input: 224×224×3 Block 1: 2× [Conv 3×3, 64] + MaxPool 2×2 → 112×112×64 Block 2: 2× [Conv 3×3, 128] + MaxPool → 56×56×128 Block 3: 3× [Conv 3×3, 256] + MaxPool → 28×28×256 Block 4: 3× [Conv 3×3, 512] + MaxPool → 14×14×512 Block 5: 3× [Conv 3×3, 512] + MaxPool → 7×7×512 FC 4096 → FC 4096 → FC 1000 (softmax) → class probabilities Total parameters: ~138M ResNet insight (2015 — won ILSVRC): Skip connections: x → Conv → BN → ReLU → Conv → BN → (+x) → ReLU Allows training networks of 50-152+ layers without vanishing gradients. Key: the network learns residual F(x) = H(x) - x rather than H(x) directly. Top-5 ImageNet error: 3.57% (surpassing human ~5.1%)

4. Classical Feature Detection

Before deep learning, hand-crafted feature detectors dominated computer vision and remain relevant for lightweight applications and geometric tasks:

5. Object Detection: YOLO and R-CNN

Object detection requires both classifying objects and localising them with bounding boxes. Two major paradigms:

Two-stage: R-CNN family (Region-based CNN) 1. Region Proposal Network (RPN) generates ~2000 candidate regions of interest 2. Each RoI is feature-extracted and classified independently Faster R-CNN (2015): RPN shares convolutional backbone with detector head ~5 fps on GPU, ~70 mAP on COCO Accurate but relatively slow for real-time use One-stage: YOLO (You Only Look Once, Redmon 2016) Single forward pass through the network. Image divided into S×S grid (e.g. 13×13 for YOLOv3 at 416×416 input) Each cell predicts B bounding boxes with confidence + C class probabilities. Output tensor: S×S×(B×5 + C) where 5 = {x,y,w,h,confidence} YOLOv8 (2023): ~50+ fps at 640×640, ~53 mAP on COCO Anchor-free, C3 modules, NMS (Non-Maximum Suppression) post-processing IoU (Intersection over Union): IoU = Area(A∩B) / Area(A∪B) Metric for bounding box quality: IoU > 0.5 typically considered "correct" mAP (mean Average Precision): area under precision-recall curve, averaged over all classes and IoU thresholds

6. Semantic and Instance Segmentation

Rather than bounding boxes, segmentation assigns a class label to every pixel:

Encoder–decoder (U-Net) architecture: Encodes image to a bottleneck representation (contracting path), then decodes back to full resolution (expanding path) with skip connections carrying high-resolution features from encoder. Originally designed for biomedical image segmentation with limited data. Skip connections are crucial: the decoder needs both semantic context (from deep layers) and spatial detail (from early layers) to place precise boundaries.

7. Modern Vision: Transformers and Beyond

Vision Transformers (ViT, 2020) apply the self-attention mechanism of NLP transformers directly to images: