GPU Parallel Computing — SIMT, Warp Divergence, and the Roofline Model
A modern discrete GPU executes tens of thousands of threads
simultaneously. An NVIDIA H100 has 16,896 CUDA cores and peaks at ~60
TFLOPS of FP32. Achieving even 50% of peak performance requires
understanding the execution model — SIMT, warps, memory hierarchy, and
occupancy — and how to avoid the performance killers: warp divergence,
memory bank conflicts, and global memory latency.
1. SIMD vs SIMT Execution
CPU SIMD (Single Instruction, Multiple Data): One instruction operates
on a VECTOR of 4/8/16 data elements simultaneously. e.g., AVX-512:
single instruction adds 16 floats at once. Programmer explicitly
vectorizes; each element is independently addressable. Branch
divergence: handled by masking some lanes. GPU SIMT (Single
Instruction, Multiple Threads): One instruction operates on 32 THREADS
simultaneously (a warp). Each thread has ITS OWN registers, program
counter, and stack. Threads appear independent to the programmer. Key
difference from SIMD: SIMD: programmer manages vector width
explicitly. SIMT: hardware auto-vectorizes 32 threads with the same
PC. SIMT enables: - Each thread can independently branch (but see warp
divergence below) - Each thread has its own register file (no explicit
masking needed) - Threads within a warp can voluntarily synchronize
(__syncthreads)
2. Warps, Thread Blocks, and Grids
Thread hierarchy (CUDA terminology): Grid (all thread blocks for one
kernel launch) └── Block (up to 1024 threads, share shared memory,
sync with __syncthreads) └── Warp (32 threads — the scheduling unit)
└── Thread (independent register file + PC) Warp divergence: When
threads within a warp take DIFFERENT branches (if/else): The warp
executes BOTH paths with inactive lanes masked off. Effectively
serializes: 32 threads execute path A, then path B. Performance
penalty: up to 2× slowdown at 50/50 branching. Example — avoid this:
if (threadIdx.x % 2 == 0) { doA(); } else { doB(); } → Half of every
warp diverges. Better: restructure so thread blocks are either all-A
or all-B. Key numbers (NVIDIA Ampere/Hopper): Threads per warp: 32 Max
threads per block: 1024 Max warps per SM: 48–64 Warp size has been 32
since G80 (2006); unlikely to change.
Amortizing warp scheduling latency: Global memory
latency is ~300–500 cycles. GPUs tolerate this by switching to another
ready warp while the first warp waits for memory. This
latency hiding only works if there are enough other warps
ready — hence the importance of occupancy.
3. GPU Memory Hierarchy
Level
Size
Latency
Scope
Bandwidth
Registers
~64K per SM
~1 cycle
Thread-private
~120+ TFLOPS equivalent
Shared memory / L1
~48–228 KB/SM
~20–40 cycles
Block-shared
~TB/s aggregate
L2 cache
~50 MB (H100)
~200 cycles
GPU-wide
~7 TB/s
Global (HBM)
~16–80 GB
~300–500 cycles
GPU-wide, host-visible
3.35 TB/s (H100 SXM)
Shared Memory Bank Conflicts
Shared memory is divided into 32 BANKS (one per warp thread).
Successive 4-byte words map to successive banks: bank = (addr/4) % 32.
Conflict-free access: each warp thread hits a different bank → 1
cycle. N-way bank conflict: N threads all access bank B → N serialized
accesses. Common mistake: float tile[32][32]; // Thread i accesses
tile[i][j]: column-wise stride = 1 (bank = 0,1,...31) ✓ // Thread i
accesses tile[j][i]: row-wise access, all threads hit bank j % 32 // →
32-way bank conflict! → 32× slower Fix: add padding: float
tile[32][33]; — shifts each row by 1 word, spreading accesses across
different banks.
4. The Roofline Model
The roofline model (Williams, Waterman, Patterson
2009) is a simple visual tool to understand whether a kernel is
compute-bound or memory-bound. Two "roofs" define the performance
ceiling:
Arithmetic Intensity (AI) = FLOPs ÷ Bytes of memory traffic Roofline
equation: Attainable_GFLOPs = min(Peak_GFLOPs, AI ×
Peak_Bandwidth_GB/s) Example: NVIDIA H100 SXM Peak FP32 throughput:
~60 TFLOPS = 60,000 GFLOPS HBM bandwidth: ~3,350 GB/s Ridge point AI:
60,000 / 3,350 ≈ 17.9 FLOPs/byte A kernel with AI < 17.9 is
memory-bound (bandwidth-limited). A kernel with AI > 17.9 is
compute-bound (FLOP-limited). Common kernel arithmetic intensities:
Vector copy (y = x): AI = 0.08 (2 bytes, ~0 FLOPs) → deeply
memory-bound DAXPY (y = αx + y): AI = 0.125 → memory-bound Sparse
matrix-vector: AI ~ 0.25–2.0 → memory-bound Dense matrix mult: AI ~
n/2 (grows with n) → compute-bound for large n Optimization strategy:
Memory-bound → reduce bytes (half-precision, tiling, fusion, caching)
Compute-bound → reduce FLOPs (better algorithms) or add parallelism
5. Occupancy and Limiting Resources
Occupancy = Active warps per SM / Maximum warps per SM Resources
limiting occupancy (per SM, choose block config to maximize): 1.
Shared memory: allocating more shared memory per block → fewer blocks
fit. 2. Registers: more registers per thread → fewer warps fit. 3. Max
blocks: hardware cap on concurrent blocks per SM. Example: SM can hold
max 48 warps. A block of 256 threads = 8 warps. If each block uses 16
KB shared memory and SM has 64 KB: → Max 4 blocks = 32 warps →
occupancy = 32/48 = 67%. If block is 128 threads (4 warps), only 8 KB
shared: → 8 blocks = 32 warps → occupancy = 67% (same, limited
differently). Reduce to 4 KB → 16 blocks = 48 warps → 100% occupancy.
Registers: if a kernel uses 64 registers/thread, and SM has 64K
registers: 64K / 64 = 1024 threads = 32 warps → occupancy limited.
Important: 100% occupancy does not guarantee peak performance! A
compute-bound kernel may run fast at 50% occupancy if it keeps all
functional units busy. Use NVIDIA Nsight Compute to profile. NVIDIA
CUDA occupancy calculator formula: active_blocks = min(
floor(max_shared / shared_per_block), floor(max_registers /
(reg_per_thread × block_size)), max_blocks_per_sm ) active_warps =
active_blocks × ceil(block_size / 32)
6. WebGPU Compute Shaders
WebGPU (available in Chrome 113+, Firefox Nightly) exposes GPU compute
shaders in the browser using WGSL (WebGPU Shading Language):
Dispatch this shader with
passEncoder.dispatchWorkgroups(Math.ceil(n/256), 1, 1).
Each workgroup processes 256 elements in ~8 barrier-separated passes,
with shared memory eliminating global memory round-trips between
passes.
7. Key Optimization Patterns
Coalesced global memory access: Warp threads should
access consecutive memory addresses (stride-1). Non-coalesced access
fragments into multiple transactions. Ensure struct-of-arrays (SoA)
layout instead of array-of-structs (AoS) when possible.
Thread coarsening: Assign multiple elements per
thread to reduce scheduling overhead and improve instruction-level
parallelism within a thread. Useful when occupancy is already high.
Kernel fusion: Combine multiple memory-bound
kernels into one to avoid repeated global memory reads/writes
between kernels. Can increase arithmetic intensity toward the
compute roof.
Tiling for matrix multiplication: Load tiles from
global memory into shared memory, compute partial sums, accumulate
in registers. Converts O(N³) global bytes to O(N³/T) with tile size
T, dramatically increasing arithmetic intensity.
Warp-level primitives: Use shuffle instructions
(__shfl_xor_sync) for communication within a warp
without shared memory — zero latency, no bank conflicts.
Asynchronous copies (Ampere+):memcpy_async overlaps data loading with computation
using a software pipeline (prefetch two tiles alternately).
Mixed precision: FP16 has 2× throughput, BF16/TF32
Tensor Cores give 4–8× throughput for matrix multiply. Use FP16/BF16
accumulation where precision allows.
CPU vs GPU: A high-end CPU (32 cores, AVX-512)
delivers ~3 TFLOPS FP32. A discrete GPU (H100) delivers ~60 TFLOPS.
But throughput is only realized for parallel workloads with sufficient
arithmetic intensity — sequential code with dependencies runs much
faster on the CPU.