GPU Parallel Computing — SIMT, Warp Divergence, and the Roofline Model

A modern discrete GPU executes tens of thousands of threads simultaneously. An NVIDIA H100 has 16,896 CUDA cores and peaks at ~60 TFLOPS of FP32. Achieving even 50% of peak performance requires understanding the execution model — SIMT, warps, memory hierarchy, and occupancy — and how to avoid the performance killers: warp divergence, memory bank conflicts, and global memory latency.

1. SIMD vs SIMT Execution

CPU SIMD (Single Instruction, Multiple Data): One instruction operates on a VECTOR of 4/8/16 data elements simultaneously. e.g., AVX-512: single instruction adds 16 floats at once. Programmer explicitly vectorizes; each element is independently addressable. Branch divergence: handled by masking some lanes. GPU SIMT (Single Instruction, Multiple Threads): One instruction operates on 32 THREADS simultaneously (a warp). Each thread has ITS OWN registers, program counter, and stack. Threads appear independent to the programmer. Key difference from SIMD: SIMD: programmer manages vector width explicitly. SIMT: hardware auto-vectorizes 32 threads with the same PC. SIMT enables: - Each thread can independently branch (but see warp divergence below) - Each thread has its own register file (no explicit masking needed) - Threads within a warp can voluntarily synchronize (__syncthreads)

2. Warps, Thread Blocks, and Grids

Thread hierarchy (CUDA terminology): Grid (all thread blocks for one kernel launch) └── Block (up to 1024 threads, share shared memory, sync with __syncthreads) └── Warp (32 threads — the scheduling unit) └── Thread (independent register file + PC) Warp divergence: When threads within a warp take DIFFERENT branches (if/else): The warp executes BOTH paths with inactive lanes masked off. Effectively serializes: 32 threads execute path A, then path B. Performance penalty: up to 2× slowdown at 50/50 branching. Example — avoid this: if (threadIdx.x % 2 == 0) { doA(); } else { doB(); } → Half of every warp diverges. Better: restructure so thread blocks are either all-A or all-B. Key numbers (NVIDIA Ampere/Hopper): Threads per warp: 32 Max threads per block: 1024 Max warps per SM: 48–64 Warp size has been 32 since G80 (2006); unlikely to change.

Amortizing warp scheduling latency: Global memory latency is ~300–500 cycles. GPUs tolerate this by switching to another ready warp while the first warp waits for memory. This latency hiding only works if there are enough other warps ready — hence the importance of occupancy.

3. GPU Memory Hierarchy

Level	Size	Latency	Scope	Bandwidth
Registers	~64K per SM	~1 cycle	Thread-private	~120+ TFLOPS equivalent
Shared memory / L1	~48–228 KB/SM	~20–40 cycles	Block-shared	~TB/s aggregate
L2 cache	~50 MB (H100)	~200 cycles	GPU-wide	~7 TB/s
Global (HBM)	~16–80 GB	~300–500 cycles	GPU-wide, host-visible	3.35 TB/s (H100 SXM)

Shared Memory Bank Conflicts

Shared memory is divided into 32 BANKS (one per warp thread). Successive 4-byte words map to successive banks: bank = (addr/4) % 32. Conflict-free access: each warp thread hits a different bank \to 1 cycle. N-way bank conflict: N threads all access bank B \to N serialized accesses. Common mistake: float tile[32][32]; // Thread i accesses tile[i][j]: column-wise stride = 1 (bank = 0,1,...31) ✓ // Thread i accesses tile[j][i]: row-wise access, all threads hit bank j % 32 // \to 32-way bank conflict! \to 32\times slower Fix: add padding: float tile[32][33]; — shifts each row by 1 word, spreading accesses across different banks.

4. The Roofline Model

The roofline model (Williams, Waterman, Patterson 2009) is a simple visual tool to understand whether a kernel is compute-bound or memory-bound. Two "roofs" define the performance ceiling:

Arithmetic Intensity (AI) = FLOPs \div Bytes of memory traffic Roofline equation: Attainable_GFLOPs = min(Peak_GFLOPs, AI \times Peak_Bandwidth_GB/s) Example: NVIDIA H100 SXM Peak FP32 throughput: ~60 TFLOPS = 60,000 GFLOPS HBM bandwidth: ~3,350 GB/s Ridge point AI: 60,000 / 3,350 \approx 17.9 FLOPs/byte A kernel with AI < 17.9 is memory-bound (bandwidth-limited). A kernel with AI > 17.9 is compute-bound (FLOP-limited). Common kernel arithmetic intensities: Vector copy (y = x): AI = 0.08 (2 bytes, ~0 FLOPs) \to deeply memory-bound DAXPY (y = αx + y): AI = 0.125 \to memory-bound Sparse matrix-vector: AI ~ 0.25-2.0 \to memory-bound Dense matrix mult: AI ~ n/2 (grows with n) \to compute-bound for large n Optimization strategy: Memory-bound \to reduce bytes (half-precision, tiling, fusion, caching) Compute-bound \to reduce FLOPs (better algorithms) or add parallelism

5. Occupancy and Limiting Resources

Occupancy = Active warps per SM / Maximum warps per SM Resources limiting occupancy (per SM, choose block config to maximize): 1. Shared memory: allocating more shared memory per block \to fewer blocks fit. 2. Registers: more registers per thread \to fewer warps fit. 3. Max blocks: hardware cap on concurrent blocks per SM. Example: SM can hold max 48 warps. A block of 256 threads = 8 warps. If each block uses 16 KB shared memory and SM has 64 KB: \to Max 4 blocks = 32 warps \to occupancy = 32/48 = 67%. If block is 128 threads (4 warps), only 8 KB shared: \to 8 blocks = 32 warps \to occupancy = 67% (same, limited differently). Reduce to 4 KB \to 16 blocks = 48 warps \to 100% occupancy. Registers: if a kernel uses 64 registers/thread, and SM has 64K registers: 64K / 64 = 1024 threads = 32 warps \to occupancy limited. Important: 100% occupancy does not guarantee peak performance! A compute-bound kernel may run fast at 50% occupancy if it keeps all functional units busy. Use NVIDIA Nsight Compute to profile. NVIDIA CUDA occupancy calculator formula: active_blocks = min( floor(max_shared / shared_per_block), floor(max_registers / (reg_per_thread \times block_size)), max_blocks_per_sm ) active_warps = active_blocks \times ceil(block_size / 32)

6. WebGPU Compute Shaders

WebGPU (available in Chrome 113+, Firefox Nightly) exposes GPU compute shaders in the browser using WGSL (WebGPU Shading Language):

// WGSL compute shader — parallel prefix sum (one pass)
@group(0) @binding(0) var<storage, read>  input  : array<f32>;
@group(0) @binding(1) var<storage, read_write> output : array<f32>;
@group(0) @binding(2) var<uniform> params : struct { n: u32 };

var<workgroup> shared_data : array<f32, 256>;

@compute @workgroup_size(256, 1, 1)
fn main(
  @builtin(global_invocation_id) gid : vec3<u32>,
  @builtin(local_invocation_id)  lid : vec3<u32>
) {
  let i = gid.x;
  let l = lid.x;

  // Load into workgroup shared memory
  shared_data[l] = select(0.0f, input[i], i < params.n);
  workgroupBarrier();

  // Hillis-Steel parallel scan (log₂ 256 = 8 iterations)
  for (var offset = 1u; offset < 256u; offset <<= 1u) {
    if (l >= offset) {
      shared_data[l] += shared_data[l - offset];
    }
    workgroupBarrier();
  }

  if (i < params.n) { output[i] = shared_data[l]; }
}

Dispatch this shader with passEncoder.dispatchWorkgroups(Math.ceil(n/256), 1, 1). Each workgroup processes 256 elements in ~8 barrier-separated passes, with shared memory eliminating global memory round-trips between passes.

7. Key Optimization Patterns

Coalesced global memory access: Warp threads should access consecutive memory addresses (stride-1). Non-coalesced access fragments into multiple transactions. Ensure struct-of-arrays (SoA) layout instead of array-of-structs (AoS) when possible.
Thread coarsening: Assign multiple elements per thread to reduce scheduling overhead and improve instruction-level parallelism within a thread. Useful when occupancy is already high.
Kernel fusion: Combine multiple memory-bound kernels into one to avoid repeated global memory reads/writes between kernels. Can increase arithmetic intensity toward the compute roof.
Tiling for matrix multiplication: Load tiles from global memory into shared memory, compute partial sums, accumulate in registers. Converts O(N³) global bytes to O(N³/T) with tile size T, dramatically increasing arithmetic intensity.
Warp-level primitives: Use shuffle instructions (__shfl_xor_sync) for communication within a warp without shared memory — zero latency, no bank conflicts.
Asynchronous copies (Ampere+): memcpy_async overlaps data loading with computation using a software pipeline (prefetch two tiles alternately).
Mixed precision: FP16 has 2× throughput, BF16/TF32 Tensor Cores give 4–8× throughput for matrix multiply. Use FP16/BF16 accumulation where precision allows.

CPU vs GPU: A high-end CPU (32 cores, AVX-512) delivers ~3 TFLOPS FP32. A discrete GPU (H100) delivers ~60 TFLOPS. But throughput is only realized for parallel workloads with sufficient arithmetic intensity — sequential code with dependencies runs much faster on the CPU.

⚙️ Explore Algorithms →