GPU · Computer Architecture · Parallel Computing
📅 Квітень 2026 ⏱ ≈ 13 хв читання 🎯 Advanced

GPU Parallel Computing — SIMT, Warp Divergence, and the Roofline Model

A modern discrete GPU executes tens of thousands of threads simultaneously. An NVIDIA H100 has 16,896 CUDA cores and peaks at ~60 TFLOPS of FP32. Achieving even 50% of peak performance requires understanding the execution model — SIMT, warps, memory hierarchy, and occupancy — and how to avoid the performance killers: warp divergence, memory bank conflicts, and global memory latency.

1. SIMD vs SIMT Execution

CPU SIMD (Single Instruction, Multiple Data): One instruction operates on a VECTOR of 4/8/16 data elements simultaneously. e.g., AVX-512: single instruction adds 16 floats at once. Programmer explicitly vectorizes; each element is independently addressable. Branch divergence: handled by masking some lanes. GPU SIMT (Single Instruction, Multiple Threads): One instruction operates on 32 THREADS simultaneously (a warp). Each thread has ITS OWN registers, program counter, and stack. Threads appear independent to the programmer. Key difference from SIMD: SIMD: programmer manages vector width explicitly. SIMT: hardware auto-vectorizes 32 threads with the same PC. SIMT enables: - Each thread can independently branch (but see warp divergence below) - Each thread has its own register file (no explicit masking needed) - Threads within a warp can voluntarily synchronize (__syncthreads)

2. Warps, Thread Blocks, and Grids

Thread hierarchy (CUDA terminology): Grid (all thread blocks for one kernel launch) └── Block (up to 1024 threads, share shared memory, sync with __syncthreads) └── Warp (32 threads — the scheduling unit) └── Thread (independent register file + PC) Warp divergence: When threads within a warp take DIFFERENT branches (if/else): The warp executes BOTH paths with inactive lanes masked off. Effectively serializes: 32 threads execute path A, then path B. Performance penalty: up to 2× slowdown at 50/50 branching. Example — avoid this: if (threadIdx.x % 2 == 0) { doA(); } else { doB(); } → Half of every warp diverges. Better: restructure so thread blocks are either all-A or all-B. Key numbers (NVIDIA Ampere/Hopper): Threads per warp: 32 Max threads per block: 1024 Max warps per SM: 48–64 Warp size has been 32 since G80 (2006); unlikely to change.
Amortizing warp scheduling latency: Global memory latency is ~300–500 cycles. GPUs tolerate this by switching to another ready warp while the first warp waits for memory. This latency hiding only works if there are enough other warps ready — hence the importance of occupancy.

3. GPU Memory Hierarchy

Level Size Latency Scope Bandwidth
Registers ~64K per SM ~1 cycle Thread-private ~120+ TFLOPS equivalent
Shared memory / L1 ~48–228 KB/SM ~20–40 cycles Block-shared ~TB/s aggregate
L2 cache ~50 MB (H100) ~200 cycles GPU-wide ~7 TB/s
Global (HBM) ~16–80 GB ~300–500 cycles GPU-wide, host-visible 3.35 TB/s (H100 SXM)

Shared Memory Bank Conflicts

Shared memory is divided into 32 BANKS (one per warp thread). Successive 4-byte words map to successive banks: bank = (addr/4) % 32. Conflict-free access: each warp thread hits a different bank → 1 cycle. N-way bank conflict: N threads all access bank B → N serialized accesses. Common mistake: float tile[32][32]; // Thread i accesses tile[i][j]: column-wise stride = 1 (bank = 0,1,...31) ✓ // Thread i accesses tile[j][i]: row-wise access, all threads hit bank j % 32 // → 32-way bank conflict! → 32× slower Fix: add padding: float tile[32][33]; — shifts each row by 1 word, spreading accesses across different banks.

4. The Roofline Model

The roofline model (Williams, Waterman, Patterson 2009) is a simple visual tool to understand whether a kernel is compute-bound or memory-bound. Two "roofs" define the performance ceiling:

Arithmetic Intensity (AI) = FLOPs ÷ Bytes of memory traffic Roofline equation: Attainable_GFLOPs = min(Peak_GFLOPs, AI × Peak_Bandwidth_GB/s) Example: NVIDIA H100 SXM Peak FP32 throughput: ~60 TFLOPS = 60,000 GFLOPS HBM bandwidth: ~3,350 GB/s Ridge point AI: 60,000 / 3,350 ≈ 17.9 FLOPs/byte A kernel with AI < 17.9 is memory-bound (bandwidth-limited). A kernel with AI > 17.9 is compute-bound (FLOP-limited). Common kernel arithmetic intensities: Vector copy (y = x): AI = 0.08 (2 bytes, ~0 FLOPs) → deeply memory-bound DAXPY (y = αx + y): AI = 0.125 → memory-bound Sparse matrix-vector: AI ~ 0.25–2.0 → memory-bound Dense matrix mult: AI ~ n/2 (grows with n) → compute-bound for large n Optimization strategy: Memory-bound → reduce bytes (half-precision, tiling, fusion, caching) Compute-bound → reduce FLOPs (better algorithms) or add parallelism

5. Occupancy and Limiting Resources

Occupancy = Active warps per SM / Maximum warps per SM Resources limiting occupancy (per SM, choose block config to maximize): 1. Shared memory: allocating more shared memory per block → fewer blocks fit. 2. Registers: more registers per thread → fewer warps fit. 3. Max blocks: hardware cap on concurrent blocks per SM. Example: SM can hold max 48 warps. A block of 256 threads = 8 warps. If each block uses 16 KB shared memory and SM has 64 KB: → Max 4 blocks = 32 warps → occupancy = 32/48 = 67%. If block is 128 threads (4 warps), only 8 KB shared: → 8 blocks = 32 warps → occupancy = 67% (same, limited differently). Reduce to 4 KB → 16 blocks = 48 warps → 100% occupancy. Registers: if a kernel uses 64 registers/thread, and SM has 64K registers: 64K / 64 = 1024 threads = 32 warps → occupancy limited. Important: 100% occupancy does not guarantee peak performance! A compute-bound kernel may run fast at 50% occupancy if it keeps all functional units busy. Use NVIDIA Nsight Compute to profile. NVIDIA CUDA occupancy calculator formula: active_blocks = min( floor(max_shared / shared_per_block), floor(max_registers / (reg_per_thread × block_size)), max_blocks_per_sm ) active_warps = active_blocks × ceil(block_size / 32)

6. WebGPU Compute Shaders

WebGPU (available in Chrome 113+, Firefox Nightly) exposes GPU compute shaders in the browser using WGSL (WebGPU Shading Language):

// WGSL compute shader — parallel prefix sum (one pass)
@group(0) @binding(0) var<storage, read>  input  : array<f32>;
@group(0) @binding(1) var<storage, read_write> output : array<f32>;
@group(0) @binding(2) var<uniform> params : struct { n: u32 };

var<workgroup> shared_data : array<f32, 256>;

@compute @workgroup_size(256, 1, 1)
fn main(
  @builtin(global_invocation_id) gid : vec3<u32>,
  @builtin(local_invocation_id)  lid : vec3<u32>
) {
  let i = gid.x;
  let l = lid.x;

  // Load into workgroup shared memory
  shared_data[l] = select(0.0f, input[i], i < params.n);
  workgroupBarrier();

  // Hillis-Steel parallel scan (log₂ 256 = 8 iterations)
  for (var offset = 1u; offset < 256u; offset <<= 1u) {
    if (l >= offset) {
      shared_data[l] += shared_data[l - offset];
    }
    workgroupBarrier();
  }

  if (i < params.n) { output[i] = shared_data[l]; }
}

Dispatch this shader with passEncoder.dispatchWorkgroups(Math.ceil(n/256), 1, 1). Each workgroup processes 256 elements in ~8 barrier-separated passes, with shared memory eliminating global memory round-trips between passes.

7. Key Optimization Patterns

CPU vs GPU: A high-end CPU (32 cores, AVX-512) delivers ~3 TFLOPS FP32. A discrete GPU (H100) delivers ~60 TFLOPS. But throughput is only realized for parallel workloads with sufficient arithmetic intensity — sequential code with dependencies runs much faster on the CPU.
⚙️ Explore Algorithms →