WebGPU Compute Shaders: General-Purpose GPU Programming in the Browser
WebGPU's compute pipeline unlocks general-purpose GPU programming
directly in the browser โ no graphics pipeline required. Particle
physics, neural network inference, image processing, and procedural
generation that would take seconds on the CPU can complete in
milliseconds when thousands of GPU cores run in parallel. This article
covers the complete compute workflow: WGSL shaders, workgroup
topology, storage buffers, synchronisation, and practical patterns for
simulation-scale workloads.
1. WebGPU vs WebGL: Why Compute Matters
WebGL exposes OpenGL ES 2.0/3.0 โ a graphics-only API where shaders
can only read/write through the graphics pipeline (vertex โ
rasterisation โ fragment). There is no way to write back to arbitrary
memory from a shader. Compute-style tasks had to be hacked via
render-to-texture and reading pixel colours as output.
WebGPU exposes Vulkan/Metal/D3D12 semantics with a first-class
compute pipeline separate from the render pipeline. A
compute shader can read from and write to arbitrary storage buffers,
making proper GPGPU workloads possible for the first time in the
browser:
WebGL compute hacks vs WebGPU native compute: WebGL "GPGPU": 1. Pack
data into RGBA float texture 2. Render fullscreen quad with shader
reading that texture 3. Write result to another texture via FBO 4.
Read back with readPixels() if needed โ Requires packing/unpacking
data into RGBA4, limited to float32ร4 components WebGPU compute: 1.
Put data in storage buffer (arbitrary struct layout) 2. Dispatch
compute shader 3. Shader reads/writes buffer directly 4. Map buffer
back to CPU if needed โ Arbitrary types (u32, f32, vec3f, custom
structs), random access, atomics
Browser support (2025): WebGPU is shipping in Chrome
113+ and Edge 113+. Firefox ships it behind a flag (firefox.webgpu).
Safari ships it on macOS 14+ and iOS 18+. Use
navigator.gpu to detect availability. For older browsers,
WebGL2 with transform feedback or compute via extension is the
fallback.
2. GPU Architecture for Compute
Understanding the GPU execution model is essential for writing
efficient compute shaders. A GPU is not a faster CPU โ it is a
massively parallel SIMD processor designed to execute thousands of
threads simultaneously, hiding memory latency through thread
switching.
// Simplified GPU hierarchy (vendor names vary) GPU โโโ Streaming
Multiprocessors (SM) / Compute Units (CU) โ e.g. 80 on RTX 4080 โโโ
CUDA Cores / Shader Processors โ 128 per SM on Ampere โโโ Warp / Wave
= 32 (NVIDIA) or 64 (AMD) threads โโโ All threads in a warp execute
the SAME instruction WebGPU terminology: Workgroup = group of threads
that share fast local memory and can synchronise with barriers
Invocation = a single thread (one execution of the shader) Global ID =
unique ID across all invocations for this dispatch Local ID = ID
within the workgroup (e.g. 0..63) Dispatch call: dispatchWorkgroups(x,
y, z) โ Launches xรyรz workgroups โ Each workgroup runs workgroup_size
threads โ Total threads = x ร y ร z ร workgroup_size
The key performance principle: keep threads busy with useful work and
avoid thread divergence (different threads taking
different if/else branches). Divergent threads in the same warp
execute both branches serially, halving throughput.
3. WGSL: The WebGPU Shading Language
WGSL (WebGPU Shading Language) is a statically-typed, safety-first
shading language. Unlike GLSL/HLSL, it has no implicit type
conversions, no undefined behaviour from uninitialized variables, and
no pointer arithmetic โ by design, to allow safe execution in the
browser sandbox.
// Minimal compute shader โ adds two arrays element-wise @group(0)
@binding(0) var<storage, read> a : array<f32>; @group(0)
@binding(1) var<storage, read> b : array<f32>; @group(0)
@binding(2) var<storage, read_write> result : array<f32>;
@compute @workgroup_size(64) fn main(@builtin(global_invocation_id)
gid : vec3u) { let i = gid.x; if (i >= arrayLength(&a)) { return;
} // bounds guard result[i] = a[i] + b[i]; } Key WGSL features:
@group(g) @binding(b) โ bind group slot var<storage, read> โ
read-only storage buffer var<storage, read_write> โ read-write
storage buffer var<workgroup> โ workgroup-shared memory
var<uniform> โ uniform buffer (small, read-only constants)
@builtin(global_invocation_id) โ vec3u: (x + gx*ws, y + gy*ws, z)
@builtin(local_invocation_id) โ vec3u within workgroup
@builtin(workgroup_id) โ vec3u: which workgroup this is
WGSL Type System
// Scalar types bool, i32, u32, f32, f16 // f16 requires shader-f16
feature flag // Vector types vec2f / vec2<f32>, vec3f, vec4f
vec3i, vec3u // int / uint vectors // Matrix types mat4x4f = 4 columns
of vec4f (column-major) mat3x4f = 3 columns of vec4f (3 cols, 4 rows โ
unusual) // Struct (used for buffer layout) struct Particle { pos:
vec3f, vel: vec3f, mass: f32, _pad: f32, // explicit padding to match
std140/std430 alignment } // Array in storage buffer (runtime-sized)
var<storage, read_write> particles : array<Particle>; //
Access: let p = particles[gid.x]; particles[gid.x].vel += dt * force /
p.mass;
4. Setting Up a Compute Pipeline
A WebGPU compute workflow in JavaScript involves five steps: device
initialisation, shader compilation, pipeline creation, resource
binding, and dispatch.
// GPU buffers cannot be mapped directly โ must copy to a staging
buffer const readBuf = device.createBuffer({ size: N * 4, usage:
GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST, }); const enc2 =
device.createCommandEncoder(); enc2.copyBufferToBuffer(bufResult, 0,
readBuf, 0, N * 4); device.queue.submit([enc2.finish()]); await
readBuf.mapAsync(GPUMapMode.READ); const data = new
Float32Array(readBuf.getMappedRange()); console.log(data[0]); // use
results readBuf.unmap();
Staging buffer pattern: GPU compute results live in
VRAM (VIDEO RAM on the GPU). The CPU cannot directly read VRAM. You
always need a staging buffer with MAP_READ usage. The
mapAsync call waits until the GPU finishes all queued
work and the buffer is transferred to CPU-readable memory โ this is
the main CPU-GPU synchronisation point, and it stalls the CPU.
5. Storage Buffers and Bind Groups
Storage buffers are the primary data containers for compute shaders.
They hold arbitrary structs and arrays, support random access, and
allow both read and write. Bind groups organise the connection between
JavaScript-allocated buffers and shader binding slots.
// Buffer usage flags (can be OR-combined) GPUBufferUsage.STORAGE //
readable/writable by compute/render shaders GPUBufferUsage.UNIFORM //
uniform buffer (fast, small, max 64KB on some HW)
GPUBufferUsage.COPY_SRC // can be source of copyBufferToBuffer
GPUBufferUsage.COPY_DST // can be destination of writeBuffer /
copyBuffer GPUBufferUsage.MAP_READ // CPU can mapAsync for reading
(staging pattern) GPUBufferUsage.MAP_WRITE // CPU can mapAsync for
writing (upload staging) GPUBufferUsage.VERTEX // vertex attribute
buffer for render pipeline GPUBufferUsage.INDEX // index buffer for
render pipeline // Shared compute+render buffer (particle positions
for both): const particleBuf = device.createBuffer({ size: N * 32, //
2 ร vec3f + f16 padding = 32 bytes per Particle usage:
GPUBufferUsage.STORAGE // compute can read/write |
GPUBufferUsage.VERTEX, // render pipeline reads as vertex attrib });
Bind Group Layout
With layout: 'auto', WebGPU infers the bind group layout
from the shader reflection. For production code with multiple
pipelines sharing the same resources, explicit layouts allow bind
group reuse across pipelines โ important for reducing CPU overhead per
frame.
6. Workgroups, Shared Memory & Synchronisation
The workgroup is the fundamental unit of collaboration in a compute
shader. Threads within a workgroup can communicate through
workgroup-shared memory โ a small, fast scratchpad
(~32 KB typical) shared by all invocations in the group. Access is
much faster than storage buffer access (โ100ร lower latency).
// Parallel reduction using shared memory (compute min of N floats)
@group(0) @binding(0) var<storage, read> data :
array<f32>; @group(0) @binding(1) var<storage, read_write>
result : array<f32>; const WS = 256u; var<workgroup>
shared : array<f32, WS>; // workgroup-shared memory @compute
@workgroup_size(WS) fn reduce_min( @builtin(global_invocation_id) gid
: vec3u, @builtin(local_invocation_id) lid : vec3u,
@builtin(workgroup_id) wid : vec3u, ) { let i = gid.x; let n =
arrayLength(&data); // Phase 1: load from storage buffer into fast
shared memory shared[lid.x] = select(1e38, data[i], i < n); // fill
OOB with +inf workgroupBarrier(); // wait for ALL threads to finish
loading // Phase 2: tree reduction in shared memory var stride = WS /
2u; loop { if (stride == 0u) { break; } if (lid.x < stride) {
shared[lid.x] = min(shared[lid.x], shared[lid.x + stride]); }
workgroupBarrier(); stride /= 2u; } // Thread 0 writes the workgroup
result to output if (lid.x == 0u) { result[wid.x] = shared[0]; } }
Barriers
workgroupBarrier() // memory + execution barrier within workgroup //
ALL threads must reach this call before any continues // synchronises
workgroup-shared memory AND storage memory storageBarrier() // only
synchronises storage buffer writes within workgroup // lighter weight
โ useful when only storage is written // โ ๏ธ Calling workgroupBarrier()
inside a conditional or loop where // some threads skip it causes
undefined behaviour or GPU hang. // All invocations in a workgroup
MUST execute the same barriers.
Choosing Workgroup Size
The optimal workgroup size balances hardware occupancy and shared
memory usage. Common choices:
64 โ safe minimum for NVIDIA (warp = 32, two warps
per workgroup), Intel GPUs
128 or 256 โ good for reductions and algorithms
that benefit from large shared memory blocks
Maximum (device.limits.maxComputeInvocationsPerWorkgroup)
โ typically 256 (mobile) to 1024 (desktop)
Subgroup operations (2025): The
subgroups WebGPU extension (shipping 2025) exposes
warp-level primitives like subgroupAdd,
subgroupMin, subgroupBallot. These are
faster than shared-memory reductions because they use hardware shuffle
instructions within a warp, eliminating shared memory writes entirely
for the innermost reduction level.
7. Common Compute Patterns
Parallel Array Map
The simplest pattern: one invocation per element, each doing
independent work. Ideal for element-wise operations โ perfectly
parallel, no communication needed.
// dispatch: Math.ceil(N / workgroup_size) workgroups @compute
@workgroup_size(64) fn map(@builtin(global_invocation_id) gid: vec3u)
{ let i = gid.x; if (i >= arrayLength(&data)) { return; }
output[i] = expensive_function(data[i]); }
Prefix Sum (Scan)
Prefix sum is used for compaction, stream compaction, and histogram
building. It is not trivially parallel โ each element depends on all
previous elements. The parallel scan algorithm decomposes the
dependency into O(log N) passes:
// Parallel exclusive prefix sum โ two-phase approach // Phase 1:
workgroup-local scan โ partial sums in shared memory // Phase 2: scan
the partial sums (recursively) โ offsets per workgroup // Phase 3: add
workgroup offset to each element in the original workgroup // This is
the core of GPU stream compaction: // "keep only elements satisfying a
predicate and write them compactly" // Used in: particle death/birth,
ray tracing active ray compaction, // physics broadphase collision
list building
Atomic Operations
// Atomic operations in WGSL (integer only โ no atomic float in
WebGPU) var<storage, read_write> counter : atomic<u32>;
var<storage, read_write> histogram : array<atomic<u32>,
256>; // In shader: atomicAdd(&counter, 1u); // thread-safe
increment let old = atomicMin(&histogram[bin], val); // returns
old value atomicStore(&counter, 0u); // reset to zero let v =
atomicLoad(&counter); // thread-safe read // No atomic float: use
integer encoding // encoded_f32 = bitcast<u32>(f32_value *
scale) // Then integer atomicMax / atomicMin on encoded values
8. Use Cases: Physics, ML & Image Processing
Physics Simulation (Particle Systems)
A WebGPU particle simulation uses two storage buffers (ping-pong
between frames) and a compute shader per physics step. No CPU readback
is needed when the same buffer is bound as a vertex buffer for
rendering:
// Per-particle compute shader (simplified N-body gravity) @compute
@workgroup_size(64) fn simulate(@builtin(global_invocation_id) gid:
vec3u) { let i = gid.x; var p = particles_in[i]; var force =
vec3f(0.0); for (var j = 0u; j < num_particles; j++) { if (j == i)
{ continue; } let diff = particles_in[j].pos - p.pos; let d2 =
dot(diff, diff) + 0.01; // softening force += diff * (G * p.mass *
particles_in[j].mass / (d2 * sqrt(d2))); } p.vel += dt * force /
p.mass; p.pos += dt * p.vel; particles_out[i] = p; } // 100 000
particles, 64 iterations: ~8ms on RTX 3080 // Same simulation on CPU
(single-threaded JS): ~600ms
Machine Learning Inference
Transformer and CNN inference runs directly in the browser using
WebGPU compute shaders. Libraries like TensorFlow.js and ONNX Runtime
Web use WebGPU backends that implement matrix multiply (GEMM),
convolution, and activation functions as compute shaders. A
7B-parameter quantised LLM can run at 10-20 tokens/s on mid-range
gaming GPUs via WebGPU.
Image Processing
// 2D convolution using workgroup tiles (e.g. Gaussian blur) const
TILE = 16u; var<workgroup> tile : array<array<vec4f, TILE
+ 4>, TILE + 4>; @compute @workgroup_size(TILE, TILE) fn blur(
@builtin(global_invocation_id) gid: vec3u,
@builtin(local_invocation_id) lid: vec3u, ) { // Load tile + halo
(border pixels for kernel overlap) into shared memory // Apply
separable 5ร5 Gaussian kernel in shared memory // Write result to
output texture (write-only storage texture) } // Advantage over
fragment shader blur: // - Tile-based access avoids redundant texture
fetches across threads // - Shared memory halo means each pixel read
from VRAM exactly once // - No render pipeline setup: just dispatch +
read result
Performance reality check: WebGPU compute is not
always faster than CPU. The GPU has very high memory bandwidth but
also high latency to start work โ submitting a command encoder has
0.1โ1ms overhead. For workloads smaller than ~10 000 elements or with
complex data-dependent branching, the CPU may be faster. The sweet
spot for WebGPU compute is large, regular, parallelisable workloads:
particle physics, n-body, convolution, sorting, reduction.