WebGPU Compute Shaders: General GPU Programming in the Browser

1. WebGPU vs WebGL: Why Compute Matters

WebGL exposes OpenGL ES 2.0/3.0 — a graphics-only API where shaders can only read/write through the graphics pipeline (vertex → rasterisation → fragment). There is no way to write back to arbitrary memory from a shader. Compute-style tasks had to be hacked via render-to-texture and reading pixel colours as output.

WebGPU exposes Vulkan/Metal/D3D12 semantics with a first-class compute pipeline separate from the render pipeline. A compute shader can read from and write to arbitrary storage buffers, making proper GPGPU workloads possible for the first time in the browser:

WebGL compute hacks vs WebGPU native compute: WebGL "GPGPU": 1. Pack data into RGBA float texture 2. Render fullscreen quad with shader reading that texture 3. Write result to another texture via FBO 4. Read back with readPixels() if needed \to Requires packing/unpacking data into RGBA4, limited to float32\times4 components WebGPU compute: 1. Put data in storage buffer (arbitrary struct layout) 2. Dispatch compute shader 3. Shader reads/writes buffer directly 4. Map buffer back to CPU if needed \to Arbitrary types (u32, f32, vec3f, custom structs), random access, atomics

Browser support (2025): WebGPU is shipping in Chrome 113+ and Edge 113+. Firefox ships it behind a flag (firefox.webgpu). Safari ships it on macOS 14+ and iOS 18+. Use navigator.gpu to detect availability. For older browsers, WebGL2 with transform feedback or compute via extension is the fallback.

2. GPU Architecture for Compute

Understanding the GPU execution model is essential for writing efficient compute shaders. A GPU is not a faster CPU — it is a massively parallel SIMD processor designed to execute thousands of threads simultaneously, hiding memory latency through thread switching.

// Simplified GPU hierarchy (vendor names vary) GPU └── Streaming Multiprocessors (SM) / Compute Units (CU) — e.g. 80 on RTX 4080 └── CUDA Cores / Shader Processors — 128 per SM on Ampere └── Warp / Wave = 32 (NVIDIA) or 64 (AMD) threads └── All threads in a warp execute the SAME instruction WebGPU terminology: Workgroup = group of threads that share fast local memory and can synchronise with barriers Invocation = a single thread (one execution of the shader) Global ID = unique ID across all invocations for this dispatch Local ID = ID within the workgroup (e.g. 0..63) Dispatch call: dispatchWorkgroups(x, y, z) \to Launches x\timesy\timesz workgroups \to Each workgroup runs workgroup_size threads \to Total threads = x \times y \times z \times workgroup_size

The key performance principle: keep threads busy with useful work and avoid thread divergence (different threads taking different if/else branches). Divergent threads in the same warp execute both branches serially, halving throughput.

3. WGSL: The WebGPU Shading Language

WGSL (WebGPU Shading Language) is a statically-typed, safety-first shading language. Unlike GLSL/HLSL, it has no implicit type conversions, no undefined behaviour from uninitialized variables, and no pointer arithmetic — by design, to allow safe execution in the browser sandbox.

// Minimal compute shader — adds two arrays element-wise @group(0) @binding(0) var<storage, read> a : array<f32>; @group(0) @binding(1) var<storage, read> b : array<f32>; @group(0) @binding(2) var<storage, read_write> result : array<f32>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid : vec3u) { let i = gid.x; if (i >= arrayLength(&a)) { return; } // bounds guard result[i] = a[i] + b[i]; } Key WGSL features: @group(g) @binding(b) — bind group slot var<storage, read> — read-only storage buffer var<storage, read_write> — read-write storage buffer var<workgroup> — workgroup-shared memory var<uniform> — uniform buffer (small, read-only constants) @builtin(global_invocation_id) — vec3u: (x + gx*ws, y + gy*ws, z) @builtin(local_invocation_id) — vec3u within workgroup @builtin(workgroup_id) — vec3u: which workgroup this is

WGSL Type System

// Scalar types bool, i32, u32, f32, f16 // f16 requires shader-f16 feature flag // Vector types vec2f / vec2<f32>, vec3f, vec4f vec3i, vec3u // int / uint vectors // Matrix types mat4x4f = 4 columns of vec4f (column-major) mat3x4f = 3 columns of vec4f (3 cols, 4 rows — unusual) // Struct (used for buffer layout) struct Particle { pos: vec3f, vel: vec3f, mass: f32, _pad: f32, // explicit padding to match std140/std430 alignment } // Array in storage buffer (runtime-sized) var<storage, read_write> particles : array<Particle>; // Access: let p = particles[gid.x]; particles[gid.x].vel += dt * force / p.mass;

4. Setting Up a Compute Pipeline

A WebGPU compute workflow in JavaScript involves five steps: device initialisation, shader compilation, pipeline creation, resource binding, and dispatch.

// Step 1: Get GPU device const adapter = await navigator.gpu.requestAdapter(); const device = await adapter.requestDevice(); // Step 2: Compile compute shader const shaderModule = device.createShaderModule({ code: wgslString }); // Step 3: Create compute pipeline const pipeline = device.createComputePipeline({ layout: 'auto', // auto-generate bind group layout from shader compute: { module: shaderModule, entryPoint: 'main', }, }); // Step 4: Create storage buffers const N = 1_000_000; const bufA = device.createBuffer({ size: N * 4, // N float32 values = N * 4 bytes usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST, }); const bufResult = device.createBuffer({ size: N * 4, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC, }); // Upload data to bufA device.queue.writeBuffer(bufA, 0, new Float32Array(N).fill(1.0)); // Step 5: Create bind group and dispatch const bindGroup = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [ { binding: 0, resource: { buffer: bufA } }, { binding: 1, resource: { buffer: bufB } }, { binding: 2, resource: { buffer: bufResult } }, ], }); const encoder = device.createCommandEncoder(); const pass = encoder.beginComputePass(); pass.setPipeline(pipeline); pass.setBindGroup(0, bindGroup); pass.dispatchWorkgroups(Math.ceil(N / 64)); // workgroup_size=64 pass.end(); device.queue.submit([encoder.finish()]);

Reading Results Back to CPU

// GPU buffers cannot be mapped directly — must copy to a staging buffer const readBuf = device.createBuffer({ size: N * 4, usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST, }); const enc2 = device.createCommandEncoder(); enc2.copyBufferToBuffer(bufResult, 0, readBuf, 0, N * 4); device.queue.submit([enc2.finish()]); await readBuf.mapAsync(GPUMapMode.READ); const data = new Float32Array(readBuf.getMappedRange()); console.log(data[0]); // use results readBuf.unmap();

Staging buffer pattern: GPU compute results live in VRAM (VIDEO RAM on the GPU). The CPU cannot directly read VRAM. You always need a staging buffer with MAP_READ usage. The mapAsync call waits until the GPU finishes all queued work and the buffer is transferred to CPU-readable memory — this is the main CPU-GPU synchronisation point, and it stalls the CPU.

5. Storage Buffers and Bind Groups

Storage buffers are the primary data containers for compute shaders. They hold arbitrary structs and arrays, support random access, and allow both read and write. Bind groups organise the connection between JavaScript-allocated buffers and shader binding slots.

// Buffer usage flags (can be OR-combined) GPUBufferUsage.STORAGE // readable/writable by compute/render shaders GPUBufferUsage.UNIFORM // uniform buffer (fast, small, max 64KB on some HW) GPUBufferUsage.COPY_SRC // can be source of copyBufferToBuffer GPUBufferUsage.COPY_DST // can be destination of writeBuffer / copyBuffer GPUBufferUsage.MAP_READ // CPU can mapAsync for reading (staging pattern) GPUBufferUsage.MAP_WRITE // CPU can mapAsync for writing (upload staging) GPUBufferUsage.VERTEX // vertex attribute buffer for render pipeline GPUBufferUsage.INDEX // index buffer for render pipeline // Shared compute+render buffer (particle positions for both): const particleBuf = device.createBuffer({ size: N * 32, // 2 × vec3f + f16 padding = 32 bytes per Particle usage: GPUBufferUsage.STORAGE // compute can read/write | GPUBufferUsage.VERTEX, // render pipeline reads as vertex attrib });

Bind Group Layout

With layout: 'auto', WebGPU infers the bind group layout from the shader reflection. For production code with multiple pipelines sharing the same resources, explicit layouts allow bind group reuse across pipelines — important for reducing CPU overhead per frame.

6. Workgroups, Shared Memory & Synchronisation

The workgroup is the fundamental unit of collaboration in a compute shader. Threads within a workgroup can communicate through workgroup-shared memory — a small, fast scratchpad (~32 KB typical) shared by all invocations in the group. Access is much faster than storage buffer access (≈100× lower latency).

// Parallel reduction using shared memory (compute min of N floats) @group(0) @binding(0) var<storage, read> data : array<f32>; @group(0) @binding(1) var<storage, read_write> result : array<f32>; const WS = 256u; var<workgroup> shared : array<f32, WS>; // workgroup-shared memory @compute @workgroup_size(WS) fn reduce_min( @builtin(global_invocation_id) gid : vec3u, @builtin(local_invocation_id) lid : vec3u, @builtin(workgroup_id) wid : vec3u, ) { let i = gid.x; let n = arrayLength(&data); // Phase 1: load from storage buffer into fast shared memory shared[lid.x] = select(1e38, data[i], i < n); // fill OOB with +inf workgroupBarrier(); // wait for ALL threads to finish loading // Phase 2: tree reduction in shared memory var stride = WS / 2u; loop { if (stride == 0u) { break; } if (lid.x < stride) { shared[lid.x] = min(shared[lid.x], shared[lid.x + stride]); } workgroupBarrier(); stride /= 2u; } // Thread 0 writes the workgroup result to output if (lid.x == 0u) { result[wid.x] = shared[0]; } }

Barriers

workgroupBarrier() // memory + execution barrier within workgroup // ALL threads must reach this call before any continues // synchronises workgroup-shared memory AND storage memory storageBarrier() // only synchronises storage buffer writes within workgroup // lighter weight — useful when only storage is written // ⚠️ Calling workgroupBarrier() inside a conditional or loop where // some threads skip it causes undefined behaviour or GPU hang. // All invocations in a workgroup MUST execute the same barriers.

Choosing Workgroup Size

The optimal workgroup size balances hardware occupancy and shared memory usage. Common choices:

64 — safe minimum for NVIDIA (warp = 32, two warps per workgroup), Intel GPUs
128 or 256 — good for reductions and algorithms that benefit from large shared memory blocks
Maximum (device.limits.maxComputeInvocationsPerWorkgroup) — typically 256 (mobile) to 1024 (desktop)

Subgroup operations (2025): The subgroups WebGPU extension (shipping 2025) exposes warp-level primitives like subgroupAdd, subgroupMin, subgroupBallot. These are faster than shared-memory reductions because they use hardware shuffle instructions within a warp, eliminating shared memory writes entirely for the innermost reduction level.

7. Common Compute Patterns

Parallel Array Map

The simplest pattern: one invocation per element, each doing independent work. Ideal for element-wise operations — perfectly parallel, no communication needed.

// dispatch: Math.ceil(N / workgroup_size) workgroups @compute @workgroup_size(64) fn map(@builtin(global_invocation_id) gid: vec3u) { let i = gid.x; if (i >= arrayLength(&data)) { return; } output[i] = expensive_function(data[i]); }

Prefix Sum (Scan)

Prefix sum is used for compaction, stream compaction, and histogram building. It is not trivially parallel — each element depends on all previous elements. The parallel scan algorithm decomposes the dependency into O(log N) passes:

// Parallel exclusive prefix sum — two-phase approach // Phase 1: workgroup-local scan \to partial sums in shared memory // Phase 2: scan the partial sums (recursively) \to offsets per workgroup // Phase 3: add workgroup offset to each element in the original workgroup // This is the core of GPU stream compaction: // "keep only elements satisfying a predicate and write them compactly" // Used in: particle death/birth, ray tracing active ray compaction, // physics broadphase collision list building

Atomic Operations

// Atomic operations in WGSL (integer only — no atomic float in WebGPU) var<storage, read_write> counter : atomic<u32>; var<storage, read_write> histogram : array<atomic<u32>, 256>; // In shader: atomicAdd(&counter, 1u); // thread-safe increment let old = atomicMin(&histogram[bin], val); // returns old value atomicStore(&counter, 0u); // reset to zero let v = atomicLoad(&counter); // thread-safe read // No atomic float: use integer encoding // encoded_f32 = bitcast<u32>(f32_value * scale) // Then integer atomicMax / atomicMin on encoded values

8. Use Cases: Physics, ML & Image Processing

Physics Simulation (Particle Systems)

A WebGPU particle simulation uses two storage buffers (ping-pong between frames) and a compute shader per physics step. No CPU readback is needed when the same buffer is bound as a vertex buffer for rendering:

// Per-particle compute shader (simplified N-body gravity) @compute @workgroup_size(64) fn simulate(@builtin(global_invocation_id) gid: vec3u) { let i = gid.x; var p = particles_in[i]; var force = vec3f(0.0); for (var j = 0u; j < num_particles; j++) { if (j == i) { continue; } let diff = particles_in[j].pos - p.pos; let d2 = dot(diff, diff) + 0.01; // softening force += diff * (G * p.mass * particles_in[j].mass / (d2 * sqrt(d2))); } p.vel += dt * force / p.mass; p.pos += dt * p.vel; particles_out[i] = p; } // 100 000 particles, 64 iterations: ~8ms on RTX 3080 // Same simulation on CPU (single-threaded JS): ~600ms

Machine Learning Inference

Transformer and CNN inference runs directly in the browser using WebGPU compute shaders. Libraries like TensorFlow.js and ONNX Runtime Web use WebGPU backends that implement matrix multiply (GEMM), convolution, and activation functions as compute shaders. A 7B-parameter quantised LLM can run at 10-20 tokens/s on mid-range gaming GPUs via WebGPU.

Image Processing

// 2D convolution using workgroup tiles (e.g. Gaussian blur) const TILE = 16u; var<workgroup> tile : array<array<vec4f, TILE + 4>, TILE + 4>; @compute @workgroup_size(TILE, TILE) fn blur( @builtin(global_invocation_id) gid: vec3u, @builtin(local_invocation_id) lid: vec3u, ) { // Load tile + halo (border pixels for kernel overlap) into shared memory // Apply separable 5×5 Gaussian kernel in shared memory // Write result to output texture (write-only storage texture) } // Advantage over fragment shader blur: // - Tile-based access avoids redundant texture fetches across threads // - Shared memory halo means each pixel read from VRAM exactly once // - No render pipeline setup: just dispatch + read result

Performance reality check: WebGPU compute is not always faster than CPU. The GPU has very high memory bandwidth but also high latency to start work — submitting a command encoder has 0.1–1ms overhead. For workloads smaller than ~10 000 elements or with complex data-dependent branching, the CPU may be faster. The sweet spot for WebGPU compute is large, regular, parallelisable workloads: particle physics, n-body, convolution, sorting, reduction.