Case Study 33-2: Profiling a Real Program — Finding the 80% Hotspot

Open Assembly Language Project

Case Study 33-2: Profiling a Real Program — Finding the 80% Hotspot

The Problem Nobody Wants to Solve

A ray tracer has been in production for two years. It is "fast enough" — until it is not. A new scene complexity requirement has made the render time go from 30 seconds to 8 minutes. The developer has a vague feeling that the AABB (axis-aligned bounding box) intersection test is "probably the problem," but has not verified this. This case study demonstrates the complete profiling workflow: from unknown bottleneck to targeted optimization, using perf at each stage.

The program: a CPU ray tracer that tests rays against a BVH (Bounding Volume Hierarchy) of 10 million triangles.

Baseline: 8 minutes 14 seconds for a 1920×1080 render with 64 samples per pixel.

Stage 1: Wall-Clock Profiling with perf stat

# First question: what kind of bottleneck is this?
perf stat -e cycles,instructions,branch-misses,L1-dcache-load-misses,LLC-load-misses \
          ./raytrace scene_complex.sce

# Output:
#  494,020,000,000    cycles              (8m14s × 5.4 GHz)
#  312,432,456,123    instructions        IPC = 0.63
#    8,943,234,567    branch-misses       (2.9% rate — significant!)
#   34,892,345,678    L1-dcache-load-misses (11.2% rate — very high)
#   12,384,921,345    LLC-load-misses      (3.9% rate — DRAM heavy)

IPC = 0.63 tells us there is a significant bottleneck. Two candidates: - Branch mispredictions: 8.9 billion × ~20 cycles each = 178 billion stall cycles (36% of total) - LLC misses: 12.4 billion × ~200 cycles each = 2.48 trillion stall cycles (503% of total?)

Wait — 503% is impossible. Let's recalculate: 12.4B LLC misses × 200 cycles = 2.48T cycles. But total cycles = 494B. So not all LLC misses stall for 200 cycles — they overlap in the out-of-order window. The actual stall from LLC misses is approximately min(LLC_misses × latency, total_cycles) — the ROB can have multiple outstanding misses, hiding some latency.

The correct interpretation: both branch mispredictions and cache misses are significant. We need to find which code is responsible.

Stage 2: Hotspot Identification with perf record/report

# Sample at 4000 Hz for 60 seconds of execution
perf record -g -F 4000 ./raytrace scene_complex.sce &
sleep 60
kill %1
perf report --stdio --sort=comm,dso,sym | head -30

# Output:
#
#  Overhead  Command      Shared Object   Symbol
#  ─────────────────────────────────────────────────────────────────
#    78.23%  raytrace     raytrace        [.] bvh_intersect_aabb
#     9.12%  raytrace     raytrace        [.] triangle_intersect
#     4.31%  raytrace     raytrace        [.] shade_point
#     3.87%  raytrace     raytrace        [.] generate_ray
#     2.14%  raytrace     raytrace        [.] sample_texture
#     ...

78% of samples are in bvh_intersect_aabb. The developer's intuition was right, but now we have confirmation and a number: this is an 78% hotspot. Optimizing it by 4× would reduce total runtime by ~59%, taking 8m14s → ~3m22s.

Stage 3: Instruction-Level Analysis with perf annotate

perf annotate bvh_intersect_aabb --stdio

# Output (showing high-sample instructions):
#
#  0.12 :   bvh_intersect_aabb:
#  0.08 :   push   rbp
#  0.04 :   mov    rbp, rsp
#  ...
# 31.47 :   movss  xmm0, [rdi+0x10]   ← HOT: loads node->min.x
#  0.23 :   subss  xmm0, xmm1         ← computed tmin_x
# 28.91 :   movss  xmm2, [rdi+0x1C]   ← HOT: loads node->max.x
#  0.18 :   subss  xmm2, xmm3
#  0.14 :   divss  xmm0, xmm4         ← t_min_x = (min.x - ray.ox) / ray.dx
#  0.21 :   divss  xmm2, xmm4
#  ...
#  2.34 :   cmp    eax, 0
#  8.43 :   jz     .no_intersect      ← mispredicted branch

Two problems identified: 1. Lines 31% and 29%: Loads from node->min.x and node->max.x are stalling — LLC misses. The BVH nodes are not cache-friendly. 2. 8.43%: Branch misprediction on the intersection test result. 3. DIV instructions: Not a significant fraction here (0.14%), but will become visible after fixing the main bottlenecks.

Stage 4: Understanding the Data Structure

The BVH node is defined as:

typedef struct BVHNode {
    Vec3 min;        // 12 bytes
    Vec3 max;        // 12 bytes
    int left_child;  // 4 bytes
    int right_child; // 4 bytes
    int triangle_id; // 4 bytes
    int flags;       // 4 bytes
    // Total: 40 bytes per node
} BVHNode;

Two nodes = 80 bytes — more than one cache line. Traversal accesses left_child and right_child to go deeper in the tree, which are in a different cache line than the min/max bounds if the struct is laid out naively.

More importantly: BVH traversal accesses nodes in non-sequential order (determined by ray-box intersection). Random access to 10M nodes × 40 bytes = 400 MB → every AABB load is an LLC miss.

Stage 5: Fix 1 — AABB Layout Optimization

Observation: The tight inner loop only needs min and max (24 bytes). left_child, right_child, and triangle_id are only needed when descending or at a leaf. Separate the bounds from the tree structure:

; ORIGINAL: AoS layout
; node: [min.x][min.y][min.z][max.x][max.y][max.z][left][right][tri_id][flags]
;        4       4       4      4      4      4      4     4      4      4
;        ← 40 bytes, straddles cache lines ─────────────────────────────────►

; OPTIMIZED: Separate arrays
; aabb_min_x[N], aabb_min_y[N], aabb_min_z[N]  (SoA for min)
; aabb_max_x[N], aabb_max_y[N], aabb_max_z[N]  (SoA for max)
; bvh_children[N][2]                            (tree structure, separate)

; For SIMD: test 4 AABBs simultaneously
; Load 4 consecutive min_x values: one cache line = 16 floats → 4 AABBs free

section .bss
align 64
aabb_min_x: resd MAX_NODES      ; all node min.x values
aabb_min_y: resd MAX_NODES
aabb_min_z: resd MAX_NODES
aabb_max_x: resd MAX_NODES
aabb_max_y: resd MAX_NODES
aabb_max_z: resd MAX_NODES
bvh_left:   resd MAX_NODES
bvh_right:  resd MAX_NODES
bvh_flags:  resd MAX_NODES

With SoA + SIMD, the intersection test can check 8 AABBs per iteration:

; test_aabb_8: test ray against 8 AABB nodes simultaneously
; ymm0 = ray_ox broadcast, ymm1 = ray_oy, ymm2 = ray_oz
; ymm3 = inv_dx broadcast, ymm4 = inv_dy, ymm5 = inv_dz
; rdi = base index into SoA arrays
; Returns: mask in eax (bit i set if node i intersects ray)

test_aabb_8:
    ; Load 8 min.x values (consecutive in memory → one load, zero extra misses)
    vmovaps ymm6, [aabb_min_x + rdi*4]
    vsubps  ymm6, ymm6, ymm0            ; min.x - ray.ox
    vmulps  ymm6, ymm6, ymm3            ; t_min_x = (min.x - ox) / dx

    vmovaps ymm7, [aabb_max_x + rdi*4]
    vsubps  ymm7, ymm7, ymm0
    vmulps  ymm7, ymm7, ymm3            ; t_max_x

    vmovaps ymm8, [aabb_min_y + rdi*4]
    vsubps  ymm8, ymm8, ymm1
    vmulps  ymm8, ymm8, ymm4            ; t_min_y

    vmovaps ymm9, [aabb_max_y + rdi*4]
    vsubps  ymm9, ymm9, ymm1
    vmulps  ymm9, ymm9, ymm4            ; t_max_y

    ; Compute interval: tmin = max(t_min_x, t_min_y), tmax = min(t_max_x, t_max_y)
    vmaxps  ymm10, ymm6, ymm8           ; tmin = max(tmin_x, tmin_y)
    vminps  ymm11, ymm7, ymm9           ; tmax = min(tmax_x, tmax_y)

    ; [repeat for z component]
    vmovaps ymm12, [aabb_min_z + rdi*4]
    ; ... (omitted for brevity)

    ; Final test: tmin <= tmax and tmax >= 0
    vcmpleps ymm13, ymm10, ymm11        ; tmin <= tmax?
    vcmpgeps ymm14, ymm11, [zero_vec]   ; tmax >= 0?
    vandps  ymm13, ymm13, ymm14         ; both conditions
    vmovmskps eax, ymm13                ; extract 8-bit intersection mask
    ; eax bit i = 1 if node (rdi+i) intersects the ray
    ret

Stage 6: Fix 2 — Branchless Intersection

The mispredicted branch on jz .no_intersect is responsible for 8.43% of samples. Replace with branchless processing using the mask from the SIMD test:

; ORIGINAL: branch on intersection test
    call    test_aabb
    test    eax, eax
    jz      .no_intersect       ; 50% miss rate for random scene
    ; push child onto stack
.no_intersect:
    ; continue

; BRANCHLESS: always push, let mask determine validity
; Process 8 nodes, push valid children without branching
    call    test_aabb_8         ; eax = 8-bit mask
    ; For each set bit in eax, push the corresponding child index onto stack
    ; Use BSF/TZCNT to iterate over set bits:

.push_loop:
    bsf     ecx, eax            ; ecx = index of lowest set bit
    jz      .push_done          ; no more bits set
    btr     eax, ecx            ; clear that bit

    ; Push child node index (rdi + ecx) onto BVH traversal stack
    mov     [r15 + r14*4], ecx  ; r15 = stack base, r14 = stack pointer
    inc     r14d
    jmp     .push_loop
.push_done:

This eliminates the mispredicted branch entirely. The BSF loop runs as many iterations as there are intersecting nodes — typically 1–3, so it is short.

Stage 7: Measure After Each Fix

# After SoA AABB layout:
perf stat -e cycles,instructions,LLC-load-misses,branch-misses ./raytrace_v2

# v2 output:
# cycles:        187,234,512,000
# instructions:  398,234,456,123    IPC = 2.13 (was 0.63)
# LLC-misses:      1,234,512,345   (was 12.4B — 10× reduction!)
# branch-misses:   8,943,234,567   (unchanged — still 8.9B)
# Time: 34.7s (was 494s — 14.2× faster!)

# After branchless intersection:
perf stat -e cycles,instructions,LLC-load-misses,branch-misses ./raytrace_v3

# v3 output:
# cycles:        143,512,345,000
# instructions:  423,512,345,678    IPC = 2.95
# LLC-misses:      1,234,512,345   (unchanged — already fixed)
# branch-misses:     234,512,345   (was 8.9B — 38× reduction!)
# Time: 26.6s (was 34.7s after SoA — additional 1.3× speedup)

Combined improvement: 494s → 26.6s → 18.6× total speedup.

Final Profile

perf report --stdio --sort=sym --stdio ./raytrace_v3 | head -10

# Samples: 12K of event 'cycles'
#  Overhead  Symbol
#    31.23%  [.] triangle_intersect   ← now the new hotspot
#    22.17%  [.] bvh_intersect_aabb   (still 2nd, but much faster)
#    18.94%  [.] shade_point
#    12.43%  [.] sample_texture
#     8.21%  [.] generate_ray

bvh_intersect_aabb dropped from 78% of samples to 22%. triangle_intersect is now the dominant cost at 31%. The next optimization pass would address that — but that is a different case study.

The Methodology Review

The complete workflow that took a 2-line code change to achieve an 18× speedup:

perf stat first: IPC = 0.63, high LLC miss rate → memory or branch bound
perf record/report: 78% of time in one function — verified the hotspot
perf annotate: Two specific instructions with 31% and 29% sample counts → LLC misses on BVH node loads
Root cause analysis: AoS layout with 40-byte nodes, random access pattern, 400 MB working set
Fix 1: SoA layout + SIMD → 14.2× speedup, LLC misses drop 10×
Fix 2: Branchless intersection → 1.3× additional speedup, branch misses drop 38×
Re-profile: New hotspot identified for next iteration

The developer's initial intuition was correct — but without profiling, they might have optimized the wrong thing (e.g., the division operations, which were 0.14% of time) or applied the right optimization in the wrong way. The 18× speedup came from two structural changes, both guided by specific measurement data.

🔍 Investigation Note: The profiling workflow described here — stat → record → annotate → fix → repeat — is the standard approach at every major performance engineering team. The tools are free (Linux perf), the methodology is systematic, and the results are verifiable. Performance optimization without profiling is guesswork; performance optimization with profiling is engineering.