BVH: Bounding Volume Hierarchy for Ray Tracing & Collision Detection

1. The Acceleration Problem

A path tracer shoots millions of rays per frame. Each ray must find the nearest intersection with scene geometry. Without acceleration, testing one ray against N triangles costs O(N). With N = 10 million triangles and 4 million rays per frame at 60 fps, that is 2.4 × 10¹⁵ triangle tests per second — roughly 10⁶× beyond current hardware capability.

Acceleration structures solve this by culling large parts of the scene before per-triangle tests. A BVH reduces expected ray cost from O(N) to O(log N) for uniformly distributed scenes — a 10-million-triangle scene requires ~23 bounding box tests plus a handful of triangle tests per ray, rather than 10 million.

Alternative acceleration structures: Kd-trees split space at hyperplanes — they adapt to irregular geometry but are harder to build and traverse on GPUs. Uniform grids are simple but waste memory on sparse scenes. BVH has become dominant because it adapts well to complex geometry, is efficient on GPUs (coherent memory access), and maps directly to GPU hardware RT units (RTX, RDNA3 RT, Apple Pro RT).

2. Axis-Aligned Bounding Boxes

Each node in a BVH contains an AABB (Axis-Aligned Bounding Box) — the smallest box with faces parallel to the coordinate planes that encloses all primitives in that node's subtree.

// AABB definition struct AABB { vec3 min; // corner with smallest x, y, z vec3 max; // corner with largest x, y, z }; // Build AABB from a list of triangles AABB computeAABB(Triangle* tris, int count) { AABB box = { +INF, -INF }; for each tri in tris: for each vertex v in tri: box.min = componentMin(box.min, v); box.max = componentMax(box.max, v); return box; } // Merge two AABBs AABB merge(AABB a, AABB b) { return { componentMin(a.min, b.min), componentMax(a.max, b.max) }; }

Ray-AABB intersection (slab method)

Testing a ray against an AABB is extremely cheap — 6 divisions and a few comparisons — making it the fast-reject step before expensive triangle tests:

// Ray-AABB intersection — slab method // Ray: P(t) = origin + t\cdotdirection bool rayAABB(Ray ray, AABB box, float tMin, float tMax) { for axis in {x, y, z}: float invD = 1.0 / ray.direction[axis]; float t0 = (box.min[axis] - ray.origin[axis]) * invD; float t1 = (box.max[axis] - ray.origin[axis]) * invD; if (invD < 0) swap(t0, t1); tMin = max(tMin, t0); tMax = min(tMax, t1); if (tMax < tMin) return false; // missed slab \to miss box return true; } Cost: ~12 FP operations (vs ~20-50 for a triangle intersection)

3. BVH Construction: Strategies

A BVH is a binary tree. Each internal node holds an AABB and pointers to two children. Leaf nodes hold one or more primitives. The key question is: how do we partition a set of primitives into two groups to form two child nodes?

Centroid Median Split

The simplest approach: compute the centroids of all primitives in the current node. Split along the axis with the greatest centroid spread, at the median centroid value. Each primitive goes to the child whose bounding box contains its centroid.

// Median split (simple, fast, but suboptimal quality) 1. Find AABB of all primitive centroids. 2. Choose split axis = axis with largest extent. 3. Sort primitives by centroid along split axis. 4. Split at median \to two roughly equal child sets. Build time: O(N log N) for one level (sorting dominates) Full tree: O(N log² N) Quality: adequate but not SAH-optimal Typical use: quick rebuild during animation, prototypes

Object Partitioning vs Spatial Partitioning

Object partitioning (what BVH does) assigns each primitive to exactly one node. Spatial partitioning (kd-trees) clips primitives at the split plane — a triangle spanning the boundary goes to both children. Object partitioning is simpler and avoids primitive duplication; spatial partitioning can be tighter for large triangles.

4. Surface Area Heuristic (SAH)

The Surface Area Heuristic chooses the split that minimises the expected cost of a random ray hitting the node. It uses the geometric fact that the probability of a random ray hitting a convex object is proportional to its surface area (for uniformly distributed rays in a bounding sphere).

// SAH cost model for splitting node N into children L and R SAH cost = C_traverse // traversal cost + (SA(L)/SA(N)) \times N_L \times C_intersect // left child expected cost + (SA(R)/SA(N)) \times N_R \times C_intersect // right child expected cost SA(X) = surface area of AABB of set X N_L = number of primitives in left child N_R = number of primitives in right child C_traverse \approx 1-2 (relative to C_intersect = 1) The optimal split minimises this cost over all candidate split positions. SAH search (binned approximation): Divide centroid range into K=32 bins along each axis. For each bin boundary (3 axes \times 31 splits = 93 candidate splits): Compute N_L, SA(L), N_R, SA(R) in O(N) sweep. Choose the split with minimum SAH cost. Total build cost: O(N log N) — comparable to median split Quality: 2-3\times fewer node visits per ray vs median split

Binned SAH (Wald 2007) is the standard in production renderers. It approximates the exact SAH optimum (which requires O(N log N) per level) at nearly the same quality using only K=32 bins. Embree, Optix, and virtually all production path tracers use binned SAH or variants of it.

Leaf Termination

Recursion stops when a node has ≤ N_leaf primitives (typically 1–4 triangles). Smaller leaves improve traversal quality (fewer false positives) but increase node count and memory. Setting N_leaf = 4 and using 4-wide SIMD to test all 4 triangles simultaneously is a common optimisation in CPU renderers (packet tracing).

5. Ray-BVH Traversal

Given a ray and a BVH root, find the nearest intersection. The standard algorithm uses a stack to avoid recursion (GPU-friendly):

// Iterative BVH traversal with stack Hit traverse(Ray ray, BVH bvh) { int stack[64]; int top = 0; stack[top++] = 0; // start at root node Hit bestHit = MISS; float tBest = +INF; while (top > 0) { BVHNode node = bvh.nodes[stack[--top]]; if (node.isLeaf) { for each primitive in node.leaf: Hit h = intersect(ray, primitive); if (h.t < tBest) { bestHit = h; tBest = h.t; } } else { // Test both children; push closer one LAST (it's popped first) float tL, tR; bool hitL = rayAABB(ray, bvh.nodes[node.left].aabb, 0, tBest, &tL); bool hitR = rayAABB(ray, bvh.nodes[node.right].aabb, 0, tBest, &tR); if (hitL && hitR) { // Push farther child first (so nearer is on top) if (tL < tR) { stack[top++] = node.right; stack[top++] = node.left; } else { stack[top++] = node.left; stack[top++] = node.right; } } else if (hitL) stack[top++] = node.left; else if (hitR) stack[top++] = node.right; } } return bestHit; }

The near-child-first ordering is critical for performance: by visiting the nearer child first, tBest is updated to a smaller value sooner, allowing farther child AABBs to be culled more aggressively ("early exit with tighter t").

Packet and SIMD Traversal

CPU renderers test 4 or 8 rays in parallel (a "ray packet") using SSE/AVX SIMD. All rays in a packet share the same traversal stack as long as they follow the same path through the BVH — "frustum culling" at each node using the packet's shared bounding frustum. Divergent packets fall back to individual traversal. This gives 2–4× speedup on coherent rays (primary rays from a camera) but less benefit on scattered secondary rays.

6. GPU BVH: LBVH with Morton Codes

CPU SAH builds (top-down, sequential) are too slow for dynamic scenes that need rebuild every frame. Linear BVH (LBVH) builds the tree bottom-up from a Morton code sort in ~5–20 ms on a GPU — fast enough for real-time skeletal animation or particle systems.

// LBVH construction overview Step 1: Compute Morton code for each primitive centroid Normalise centroid to [0,1]³ Interleave bits of x, y, z coordinates \to 30-bit or 63-bit integer Morton code Z-orders primitives in 3D space z_order(0.5, 0.5, 0.5) = 0b 101010101010... (interleaved bits) Step 2: Sort primitives by Morton code [O(N) radix sort on GPU] Step 3: Build binary radix tree bottom-up Each internal node splits at the highest differing bit in its range. Parallel construction: each thread handles one internal node. O(N) nodes built in O(log N) parallel passes. Step 4: Refit AABBs bottom-up [one atomic min/max per leaf parent] Total GPU time: ~5-20 ms for 1M triangles (vs 200-500 ms for CPU SAH)

LBVH quality is lower than SAH — Morton-ordered trees can have poor splits when primitives cluster — but is 20–50× faster to build. HLBVH (Hierarchical LBVH) uses SAH splitting for the top levels of the tree and LBVH for subtrees, combining build speed with good traversal quality.

Karras 2012 algorithm: The fully parallel LBVH construction algorithm (Karras, NVidia, "Maximising Parallelism in the Construction of BVHs, Octrees, and k-d Trees") is the basis of Optix's GPU BVH builder. Each of the N−1 internal nodes is assigned to one GPU thread that determines its split point from the Morton code bit differences — no serial dependencies between nodes.

7. TLAS / BLAS in Modern RT APIs

DirectX Raytracing (DXR), Vulkan Ray Tracing, and Metal RT use a two-level BVH structure that separates geometric detail from instance transforms:

Two-Level Acceleration Structure: BLAS (Bottom-Level Acceleration Structure): - One per unique mesh/geometry - Contains all triangles of the mesh in local object space - Built once (static) or periodically (animated) - Driver stores in GPU-optimal format (opaque to renderer) TLAS (Top-Level Acceleration Structure): - Single tree for the entire scene - Each leaf references one BLAS + a 4\times3 world transform matrix - Rebuilt every frame when instances move (fast, only N_instances nodes) - Ray traced against TLAS first; on hit, transforms ray to object space, continues traversal into referenced BLAS Benefit: 1000 identical trees \to 1 BLAS, 1000 TLAS leaves Without instancing: 1000 \times N_triangles in one flat BVH Instancing saves 1000\times BLAS memory and rebuild cost

In DXR, BuildRaytracingAccelerationStructure() takes a D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_INPUTS descriptor specifying geometry lists (for BLAS) or instance descriptors (for TLAS). The driver handles all internal BVH format details, Morton sorting, and SAH — the application only provides geometry and instances.

Update vs Rebuild

A BLAS can be updated (refit without structural change) much faster than rebuilt from scratch. Update is correct when vertices move slightly (animation), but quality degrades as motion increases. The heuristic: use update if vertex displacement is < 10% of primitive size; otherwise rebuild. RTX 4090 refit: ~0.5 ms for 1M triangles; rebuild: ~5 ms.

8. BVH for Collision Detection

Game engines use AABB trees (a variant of BVH) for broad-phase collision detection — quickly finding pairs of objects that might be colliding before running expensive narrow-phase geometry tests (GJK, EPA).

// Broad-phase BVH collision detection // Query: find all objects whose AABBs overlap query AABB void queryOverlap(BVHNode node, AABB query, List<int>& results) { if (!aabbOverlap(node.aabb, query)) return; // prune subtree if (node.isLeaf) { results.push(node.objectID); return; } queryOverlap(node.left, query, results); queryOverlap(node.right, query, results); } // Complexity: O(log N + K) where K = number of overlapping objects // For N=10000 objects with typical scene density: K\approx4 on average // vs brute force: O(N²/2) = 50M tests

Dynamic BVH (DBVH)

For game physics, objects move every frame. Rebuilding the full BVH each frame is wasteful — instead, a dynamic BVH uses fat AABBs (extended by a "margin" in each direction) and lazy refitting:

Each moving object expands its AABB by a velocity-proportional margin: fat_aabb = aabb.expand(v × Δt × 2)
The tree is only restructured for an object when its AABB grows outside its stored fat AABB.
Restructuring uses tree rotations (similar to AVL rotations) to maintain balance.

Bullet Physics and Box2D use a dynamic AABB tree (incrementally maintained) for broad-phase. Update cost per moved object is O(log N) amortised, much better than full rebuilds.

Bounding Volume Choices Beyond AABB

Spheres: Simplest intersection test, but poor fit for thin triangles. Used in early game engines and for dynamic objects.
OBB (Oriented Bounding Box): Tightest fit, 16 separating axis tests for OBB-OBB. Better quality but 5–10× slower intersection test.
k-DOP (Discrete Oriented Polytope): Convex polytope with k/2 pairs of parallel faces at fixed orientations. 18-DOP (k=18) gives excellent fit for most models with only modest extra test cost. Used in Unreal Engine's collision system.

BVH in production: Embree (Intel, open source) is the CPU BVH reference — used in Blender Cycles, Arnold, and Corona. Optix (NVIDIA) provides GPU BVH built on top of RTX hardware. PBRT-v4 includes a clean educational BVH implementation. For WebGL path tracing, the entire BVH must be serialised into textures and traversed in GLSL — a flat array representation with index-based navigation rather than pointers.