Instanced Rendering & LOD: Draw Millions of Objects at 60 fps
A modern game or simulation should render hundreds of thousands of trees, grass blades, particles, or simulation objects without dropping below 60 fps. Two GPU techniques make this possible: instanced rendering collapses many draw calls into one, and level-of-detail (LOD) substitutes distant objects with cheaper approximations. This article explains both at the GPU level and shows how to implement them in WebGL 2 and WebGPU.
1. The Draw-Call Bottleneck
Every call to gl.drawElements() (or its equivalents)
creates significant CPU overhead: the driver validates state, compiles
a command buffer, flushes it to the GPU command queue, and
synchronises memory. On a modern desktop GPU, the raw
draw-call budget is typically 5,000–50,000 draw calls
per frame before the CPU saturates — regardless of how fast the GPU
itself is.
For a forest scene with 200,000 trees, issuing one draw call per tree is physically impossible. The solutions are:
- Batching: merge mesh geometry into a few large vertex buffers, grouped by material. Reduces N calls to ~N/batch_size calls.
- Instanced rendering: issue one draw call with a repeat count; the GPU fetches per-instance data (transform, colour) automatically.
- Indirect draw: store draw arguments in a GPU buffer; the CPU never reads back the count. Enables fully GPU-driven culling and LOD selection.
2. Instanced Rendering: gl_InstanceID
In WebGL 2 (which exposes OpenGL ES 3.0), instanced rendering is invoked with:
// WebGL 2 — draw 100,000 trees with one call
gl.drawElementsInstanced(
gl.TRIANGLES, // primitive type
indexCount, // indices in base mesh
gl.UNSIGNED_INT, // index type
0, // byte offset
100_000 // instanceCount
);
Inside the vertex shader, the built-in variable
gl_InstanceID (GLSL ES 3.0) contains the index of the
current instance (0 … instanceCount−1). Use it to fetch per-instance
data from a uniform array or a texture:
// Vertex shader (GLSL ES 3.0)
#version 300 es
in vec3 a_position; // base mesh vertex
in vec3 a_normal;
// Per-instance data (divisor = 1)
in mat4 a_instanceMatrix; // 4 consecutive vec4 attributes
uniform mat4 u_viewProj;
out vec3 v_normal;
void main() {
mat3 normalMat = transpose(inverse(mat3(a_instanceMatrix)));
v_normal = normalize(normalMat * a_normal);
gl_Position = u_viewProj * a_instanceMatrix * vec4(a_position, 1.0);
}
The per-instance matrix occupies 4 attribute slots. Setting the attribute divisor to 1 tells the GPU to advance the instance data pointer once per instance rather than once per vertex:
// Bind per-instance transform buffer
gl.bindBuffer(gl.ARRAY_BUFFER, instanceMatrixBuffer);
const bytesPerMatrix = 16 * 4;
for (let i = 0; i < 4; i++) {
const attribLoc = matrixAttribLocation + i;
gl.enableVertexAttribArray(attribLoc);
gl.vertexAttribPointer(attribLoc, 4, gl.FLOAT, false,
bytesPerMatrix, i * 16); // each row = 4 floats × 4 bytes
gl.vertexAttribDivisor(attribLoc, 1); // advance once per INSTANCE
}
3. Per-Instance Data Layout
Each instance needs at least a 4×4 transform matrix (64 bytes). Commonly added per-instance data:
- Transform matrix — 64 bytes (4 × vec4)
- Colour / tint — 16 bytes (vec4, RGBA)
- LOD blend factor — 4 bytes (float)
- Animation frame / time offset — 4 bytes (float)
- Custom flags — 4 bytes (uint, bit-packed)
Total: ~96 bytes per instance. For 1,000,000 instances that is 96 MB — comfortably within modern GPU VRAM. Upload once; update only dirty instances per frame.
Packing into a Texture
An alternative to attribute buffers is a
data texture: pack matrices as RGBA32F pixels, look
them up in the vertex shader using texelFetch() with the
instance ID:
// Pack: N matrices → texture of width W = 4 rows/matrix, height = ceil(N/4)
// Look up in vertex shader:
int baseTexel = gl_InstanceID * 4;
mat4 transform = mat4(
texelFetch(u_instanceTex, ivec2(baseTexel+0, 0), 0),
texelFetch(u_instanceTex, ivec2(baseTexel+1, 0), 0),
texelFetch(u_instanceTex, ivec2(baseTexel+2, 0), 0),
texelFetch(u_instanceTex, ivec2(baseTexel+3, 0), 0)
);
This is particularly efficient when the instance data is already produced by a compute shader (Texture Output or Storage Texture in WebGPU), avoiding the CPU→GPU transfer entirely.
4. Frustum & Occlusion Culling
Frustum Culling
Even with instancing, drawing 1,000,000 invisible trees wastes vertex shader cycles. Frustum culling rejects instances whose bounding sphere lies entirely outside the view frustum's 6 half-spaces:
For CPU-side culling, iterate all instances and write visible instance matrices to a compact buffer (prefix-sum compaction), then draw only N_visible instances. CPU culling is simple but limits scalability.
GPU-Driven Culling
The modern approach moves culling entirely to a compute shader:
- Compute shader reads all instance bounding spheres from a storage buffer.
- Per-instance thread: test against frustum planes; if visible, atomically increment a counter and write to a compact output buffer (stream compaction).
- Write the count to an indirect draw argument buffer.
-
Call
drawIndirect()— the GPU draws exactly the culled instance count without the CPU knowing the number.
This pipeline keeps data entirely on the GPU and eliminates the CPU
readback bottleneck. In WebGPU this is straightforward; in WebGL 2 it
requires the
WEBGL_draw_instanced_base_vertex_base_instance extension
for indirect draw support.
Occlusion Culling
Occlusion culling additionally rejects objects hidden behind nearer geometry. Hierarchical Z-buffer (Hi-Z) occlusion culling builds a mip-pyramid of the depth buffer; each instance's bounding box is tested at the appropriate mip level. If all projected pixels of the boundinq box are deeper than the stored max-depth, the object is occluded. This is the technique used by Frostbite, Killzone, and other AAA engines for massive scene draw-call reduction.
5. Discrete LOD & Screen-Space Error
A level-of-detail (LOD) system stores multiple pre-simplified versions of a mesh and selects among them based on the object's distance or projected screen size:
Screen-space size is the correct metric — a small helicopter at 10 m and a large sky-scraper at 1 km may project to the same screen diameter and deserve the same LOD. Using raw distance as the threshold produces visually wrong LOD switches.
Nanite-Style Virtual Geometry
Unreal Engine 5's Nanite takes LOD to its limit: every triangle cluster in the scene has a precomputed screen-space error bound, and the runtime selects the coarsest level whose error stays below 1 pixel. No hand-authored LOD levels; the entire mesh hierarchy is built offline using a DAG (Directed Acyclic Graph) of progressively simplified cluster groups. The GPU culls and selects at cluster granularity in a compute shader, then rasterises with a custom software rasteriser for small (sub-pixel) triangles. This is far beyond what WebGL can do today, but WebGPU's compute pipelines are the necessary first step.
LOD with Instancing
Combining LOD and instancing requires sorting instances into per-LOD buckets and issuing one instanced draw call per LOD per material. With 4 LOD levels and 3 materials, that's 12 draw calls — still far fewer than one call per object. GPU-driven approaches can do this sort in a compute shader using parallel prefix-sum.
6. CLOD & Geomorphing
Discrete LOD causes a visible popping artefact when an object switches between levels — a sudden change in vertex count and position. Two techniques eliminate this:
Alpha LOD (Dithered Crossfade)
Render both the current and next LOD level simultaneously, crossfading with a screen-space dithered alpha mask. This spreads the transition over a screenSize range [d₁, d₂]. Used by Unity's LOD Group component and Three.js LOD helper. Doubles the draw calls in the crossfade band but avoids hard popping.
Geomorphing (CLOD)
Continuous LOD (CLOD) geomorphs vertex positions toward their target LOD configuration over time. Each LOD mesh records, for each vertex, where that vertex maps in the next simpler LOD (or that it simply disappears). The vertex shader linearly interpolates toward the target position:
This requires storing two position sets per vertex (doubles vertex data), and careful mesh simplification that records the collapse map. The Progressive Mesh algorithm by Hoppe (1996) is the canonical algorithm for generating the per-vertex morph target data.
Terrain CLOD: ROAM and Geoclipmap
Terrain LOD is a special case: adaptive tessellation based on view-frustum and curvature. The Geoclipmap technique (Losasso & Hoppe, Siggraph 2004) uses axis-aligned rings of geometry centred on the camera, with each outer ring at half the resolution of the inner ring. Ring transitions are geomorphed to avoid seams. This is the basis of terrain rendering in Google Earth, Houdini Terragen, and most open-world game engines.
7. Impostors & Billboard Sprites
When objects become very small on screen (screenSize < 0.005), even a 100-triangle mesh wastes vertex-shader cycles. The cheapest representation is an impostor: a single quad (2 triangles) textured with a pre-rendered image of the object.
Camera-Facing Billboards
A billboard quad always faces the camera. There are three variants:
- Spherical billboard: rotated to face the camera on all axes. Used for particle effects, lens flares, distant trees viewed from above.
- Cylindrical billboard: only the Y-axis is locked; tilted on the xz-plane to face the camera. Used for trees and characters (preserves upright orientation).
- Fixed-orientation billboard: always axis-aligned (e.g., health bars above characters). Implemented in screen space.
Multi-View Impostors (Octahedral Impostors)
Pre-render the object from a hemisphere of viewpoints (typically 8×8 = 64 directions) and pack into a texture atlas. At runtime, pick the two nearest pre-rendered views based on the current camera direction and blend between them. The result is convincing from all angles with only one texture sample per pixel. This technique (popularised in Unity Amplify Impostors and UE4 foliage) achieves visually faithful rendering of complex tree canopies with 2 triangles in the vertex budget.
Signed-Distance-Field (SDF) Impostors
Instead of storing colour, store a signed distance field in the impostor texture. In the fragment shader, use the SDF for pixel-accurate alpha blending and normal reconstruction — giving smooth silhouettes that do not pixelate when zoomed. Used for fonts, particles, and small foliage at intermediate distances.
8. WebGPU: Indirect & Multi-Indirect Draw
WebGPU (available in Chrome 113+) exposes the full GPU indirect-draw API, enabling fully GPU-driven rendering pipelines that would be impossible in WebGL.
Indirect Draw in WebGPU
// CPU: create indirect draw argument buffer
// Layout: [ vertexCount, instanceCount, firstVertex, firstInstance ]
const indirectBuffer = device.createBuffer({
size: 4 * 4, // 4 uint32 values
usage: GPUBufferUsage.INDIRECT | GPUBufferUsage.STORAGE,
});
// Compute shader populates instanceCount after culling:
// @group(0) @binding(2) var<storage, read_write> indirect: DrawIndirectArgs;
// atomic_store(&indirect.instanceCount, culledCount);
// Render pass — no CPU knowledge of count needed:
passEncoder.drawIndirect(indirectBuffer, 0);
Multi-Draw Indirect
WebGPU's drawIndirect() issues one draw. For multiple LOD
buckets or mesh clusters,
multi-draw indirect (exposed via the
multi-draw-indirect feature, available in Chrome Canary
with flags) issues an array of draw commands from a GPU buffer in a
single API call. This is how Nanite-style virtual geometry scheduling
is implemented on modern hardware.
Complete GPU-Driven Pipeline
- Upload: all instance transforms, bounding spheres, LOD thresholds to storage buffers (once or on dirty-only update).
- Cull compute pass: frustum + Hi-Z occlusion cull; write visible instances per LOD to compact buffers; write per-LOD counts to indirect args buffer.
- Sort pass: optional GPU radix sort by material ID to improve cache coherence.
-
Render pass:
drawIndirect()per LOD chunk — total ~10–20 draw calls regardless of scene complexity.