Case Study 33-2: Profiling a Real Program — Finding the 80% Hotspot
The Problem Nobody Wants to Solve
A ray tracer has been in production for two years. It is "fast enough" — until it is not. A new scene complexity requirement has made the render time go from 30 seconds to 8 minutes. The developer has a vague feeling that the AABB (axis-aligned bounding box) intersection test is "probably the problem," but has not verified this. This case study demonstrates the complete profiling workflow: from unknown bottleneck to targeted optimization, using perf at each stage.
The program: a CPU ray tracer that tests rays against a BVH (Bounding Volume Hierarchy) of 10 million triangles.
Baseline: 8 minutes 14 seconds for a 1920×1080 render with 64 samples per pixel.
Stage 1: Wall-Clock Profiling with perf stat
# First question: what kind of bottleneck is this?
perf stat -e cycles,instructions,branch-misses,L1-dcache-load-misses,LLC-load-misses \
./raytrace scene_complex.sce
# Output:
# 494,020,000,000 cycles (8m14s × 5.4 GHz)
# 312,432,456,123 instructions IPC = 0.63
# 8,943,234,567 branch-misses (2.9% rate — significant!)
# 34,892,345,678 L1-dcache-load-misses (11.2% rate — very high)
# 12,384,921,345 LLC-load-misses (3.9% rate — DRAM heavy)
IPC = 0.63 tells us there is a significant bottleneck. Two candidates: - Branch mispredictions: 8.9 billion × ~20 cycles each = 178 billion stall cycles (36% of total) - LLC misses: 12.4 billion × ~200 cycles each = 2.48 trillion stall cycles (503% of total?)
Wait — 503% is impossible. Let's recalculate: 12.4B LLC misses × 200 cycles = 2.48T cycles. But total cycles = 494B. So not all LLC misses stall for 200 cycles — they overlap in the out-of-order window. The actual stall from LLC misses is approximately min(LLC_misses × latency, total_cycles) — the ROB can have multiple outstanding misses, hiding some latency.
The correct interpretation: both branch mispredictions and cache misses are significant. We need to find which code is responsible.
Stage 2: Hotspot Identification with perf record/report
# Sample at 4000 Hz for 60 seconds of execution
perf record -g -F 4000 ./raytrace scene_complex.sce &
sleep 60
kill %1
perf report --stdio --sort=comm,dso,sym | head -30
# Output:
#
# Overhead Command Shared Object Symbol
# ─────────────────────────────────────────────────────────────────
# 78.23% raytrace raytrace [.] bvh_intersect_aabb
# 9.12% raytrace raytrace [.] triangle_intersect
# 4.31% raytrace raytrace [.] shade_point
# 3.87% raytrace raytrace [.] generate_ray
# 2.14% raytrace raytrace [.] sample_texture
# ...
78% of samples are in bvh_intersect_aabb. The developer's intuition was right, but now we have confirmation and a number: this is an 78% hotspot. Optimizing it by 4× would reduce total runtime by ~59%, taking 8m14s → ~3m22s.
Stage 3: Instruction-Level Analysis with perf annotate
perf annotate bvh_intersect_aabb --stdio
# Output (showing high-sample instructions):
#
# 0.12 : bvh_intersect_aabb:
# 0.08 : push rbp
# 0.04 : mov rbp, rsp
# ...
# 31.47 : movss xmm0, [rdi+0x10] ← HOT: loads node->min.x
# 0.23 : subss xmm0, xmm1 ← computed tmin_x
# 28.91 : movss xmm2, [rdi+0x1C] ← HOT: loads node->max.x
# 0.18 : subss xmm2, xmm3
# 0.14 : divss xmm0, xmm4 ← t_min_x = (min.x - ray.ox) / ray.dx
# 0.21 : divss xmm2, xmm4
# ...
# 2.34 : cmp eax, 0
# 8.43 : jz .no_intersect ← mispredicted branch
Two problems identified:
1. Lines 31% and 29%: Loads from node->min.x and node->max.x are stalling — LLC misses. The BVH nodes are not cache-friendly.
2. 8.43%: Branch misprediction on the intersection test result.
3. DIV instructions: Not a significant fraction here (0.14%), but will become visible after fixing the main bottlenecks.
Stage 4: Understanding the Data Structure
The BVH node is defined as:
typedef struct BVHNode {
Vec3 min; // 12 bytes
Vec3 max; // 12 bytes
int left_child; // 4 bytes
int right_child; // 4 bytes
int triangle_id; // 4 bytes
int flags; // 4 bytes
// Total: 40 bytes per node
} BVHNode;
Two nodes = 80 bytes — more than one cache line. Traversal accesses left_child and right_child to go deeper in the tree, which are in a different cache line than the min/max bounds if the struct is laid out naively.
More importantly: BVH traversal accesses nodes in non-sequential order (determined by ray-box intersection). Random access to 10M nodes × 40 bytes = 400 MB → every AABB load is an LLC miss.
Stage 5: Fix 1 — AABB Layout Optimization
Observation: The tight inner loop only needs min and max (24 bytes). left_child, right_child, and triangle_id are only needed when descending or at a leaf. Separate the bounds from the tree structure:
; ORIGINAL: AoS layout
; node: [min.x][min.y][min.z][max.x][max.y][max.z][left][right][tri_id][flags]
; 4 4 4 4 4 4 4 4 4 4
; ← 40 bytes, straddles cache lines ─────────────────────────────────►
; OPTIMIZED: Separate arrays
; aabb_min_x[N], aabb_min_y[N], aabb_min_z[N] (SoA for min)
; aabb_max_x[N], aabb_max_y[N], aabb_max_z[N] (SoA for max)
; bvh_children[N][2] (tree structure, separate)
; For SIMD: test 4 AABBs simultaneously
; Load 4 consecutive min_x values: one cache line = 16 floats → 4 AABBs free
section .bss
align 64
aabb_min_x: resd MAX_NODES ; all node min.x values
aabb_min_y: resd MAX_NODES
aabb_min_z: resd MAX_NODES
aabb_max_x: resd MAX_NODES
aabb_max_y: resd MAX_NODES
aabb_max_z: resd MAX_NODES
bvh_left: resd MAX_NODES
bvh_right: resd MAX_NODES
bvh_flags: resd MAX_NODES
With SoA + SIMD, the intersection test can check 8 AABBs per iteration:
; test_aabb_8: test ray against 8 AABB nodes simultaneously
; ymm0 = ray_ox broadcast, ymm1 = ray_oy, ymm2 = ray_oz
; ymm3 = inv_dx broadcast, ymm4 = inv_dy, ymm5 = inv_dz
; rdi = base index into SoA arrays
; Returns: mask in eax (bit i set if node i intersects ray)
test_aabb_8:
; Load 8 min.x values (consecutive in memory → one load, zero extra misses)
vmovaps ymm6, [aabb_min_x + rdi*4]
vsubps ymm6, ymm6, ymm0 ; min.x - ray.ox
vmulps ymm6, ymm6, ymm3 ; t_min_x = (min.x - ox) / dx
vmovaps ymm7, [aabb_max_x + rdi*4]
vsubps ymm7, ymm7, ymm0
vmulps ymm7, ymm7, ymm3 ; t_max_x
vmovaps ymm8, [aabb_min_y + rdi*4]
vsubps ymm8, ymm8, ymm1
vmulps ymm8, ymm8, ymm4 ; t_min_y
vmovaps ymm9, [aabb_max_y + rdi*4]
vsubps ymm9, ymm9, ymm1
vmulps ymm9, ymm9, ymm4 ; t_max_y
; Compute interval: tmin = max(t_min_x, t_min_y), tmax = min(t_max_x, t_max_y)
vmaxps ymm10, ymm6, ymm8 ; tmin = max(tmin_x, tmin_y)
vminps ymm11, ymm7, ymm9 ; tmax = min(tmax_x, tmax_y)
; [repeat for z component]
vmovaps ymm12, [aabb_min_z + rdi*4]
; ... (omitted for brevity)
; Final test: tmin <= tmax and tmax >= 0
vcmpleps ymm13, ymm10, ymm11 ; tmin <= tmax?
vcmpgeps ymm14, ymm11, [zero_vec] ; tmax >= 0?
vandps ymm13, ymm13, ymm14 ; both conditions
vmovmskps eax, ymm13 ; extract 8-bit intersection mask
; eax bit i = 1 if node (rdi+i) intersects the ray
ret
Stage 6: Fix 2 — Branchless Intersection
The mispredicted branch on jz .no_intersect is responsible for 8.43% of samples. Replace with branchless processing using the mask from the SIMD test:
; ORIGINAL: branch on intersection test
call test_aabb
test eax, eax
jz .no_intersect ; 50% miss rate for random scene
; push child onto stack
.no_intersect:
; continue
; BRANCHLESS: always push, let mask determine validity
; Process 8 nodes, push valid children without branching
call test_aabb_8 ; eax = 8-bit mask
; For each set bit in eax, push the corresponding child index onto stack
; Use BSF/TZCNT to iterate over set bits:
.push_loop:
bsf ecx, eax ; ecx = index of lowest set bit
jz .push_done ; no more bits set
btr eax, ecx ; clear that bit
; Push child node index (rdi + ecx) onto BVH traversal stack
mov [r15 + r14*4], ecx ; r15 = stack base, r14 = stack pointer
inc r14d
jmp .push_loop
.push_done:
This eliminates the mispredicted branch entirely. The BSF loop runs as many iterations as there are intersecting nodes — typically 1–3, so it is short.
Stage 7: Measure After Each Fix
# After SoA AABB layout:
perf stat -e cycles,instructions,LLC-load-misses,branch-misses ./raytrace_v2
# v2 output:
# cycles: 187,234,512,000
# instructions: 398,234,456,123 IPC = 2.13 (was 0.63)
# LLC-misses: 1,234,512,345 (was 12.4B — 10× reduction!)
# branch-misses: 8,943,234,567 (unchanged — still 8.9B)
# Time: 34.7s (was 494s — 14.2× faster!)
# After branchless intersection:
perf stat -e cycles,instructions,LLC-load-misses,branch-misses ./raytrace_v3
# v3 output:
# cycles: 143,512,345,000
# instructions: 423,512,345,678 IPC = 2.95
# LLC-misses: 1,234,512,345 (unchanged — already fixed)
# branch-misses: 234,512,345 (was 8.9B — 38× reduction!)
# Time: 26.6s (was 34.7s after SoA — additional 1.3× speedup)
Combined improvement: 494s → 26.6s → 18.6× total speedup.
Final Profile
perf report --stdio --sort=sym --stdio ./raytrace_v3 | head -10
# Samples: 12K of event 'cycles'
# Overhead Symbol
# 31.23% [.] triangle_intersect ← now the new hotspot
# 22.17% [.] bvh_intersect_aabb (still 2nd, but much faster)
# 18.94% [.] shade_point
# 12.43% [.] sample_texture
# 8.21% [.] generate_ray
bvh_intersect_aabb dropped from 78% of samples to 22%. triangle_intersect is now the dominant cost at 31%. The next optimization pass would address that — but that is a different case study.
The Methodology Review
The complete workflow that took a 2-line code change to achieve an 18× speedup:
perf statfirst: IPC = 0.63, high LLC miss rate → memory or branch boundperf record/report: 78% of time in one function — verified the hotspotperf annotate: Two specific instructions with 31% and 29% sample counts → LLC misses on BVH node loads- Root cause analysis: AoS layout with 40-byte nodes, random access pattern, 400 MB working set
- Fix 1: SoA layout + SIMD → 14.2× speedup, LLC misses drop 10×
- Fix 2: Branchless intersection → 1.3× additional speedup, branch misses drop 38×
- Re-profile: New hotspot identified for next iteration
The developer's initial intuition was correct — but without profiling, they might have optimized the wrong thing (e.g., the division operations, which were 0.14% of time) or applied the right optimization in the wrong way. The 18× speedup came from two structural changes, both guided by specific measurement data.
🔍 Investigation Note: The profiling workflow described here — stat → record → annotate → fix → repeat — is the standard approach at every major performance engineering team. The tools are free (Linux perf), the methodology is systematic, and the results are verifiable. Performance optimization without profiling is guesswork; performance optimization with profiling is engineering.