Chapter 33 Exercises: Performance Analysis and Optimization

Open Assembly Language Project

Chapter 33 Exercises: Performance Analysis and Optimization

Exercise 1 — RDTSC Microbenchmark

Write a microbenchmark harness in NASM that: a) Uses RDTSC/LFENCE to measure wall-clock cycles b) Runs a warm-up phase of 100 iterations before timing c) Runs 10,000 timed trials and records the minimum (not average) cycle count d) Uses the minimum to avoid OS interrupt noise

Measure the following instructions (each in its own loop of 1000 iterations to amortize RDTSC overhead): - ADD RAX, RBX - IMUL RAX, RBX - DIV RBX (64-bit) - ADDSS XMM0, XMM1 - MULSS XMM0, XMM1 - SQRTSS XMM0, XMM1 - VFMADD231SS XMM0, XMM1, XMM2

For each: report measured latency (cycles per instruction, assuming a serial dependency chain). Compare with Agner Fog's tables for your CPU.

Exercise 2 — perf stat Bottleneck Classification

For each of the following programs, run perf stat with the events below and classify the bottleneck:

perf stat -e cycles,instructions,branch-misses,L1-dcache-load-misses,LLC-load-misses ./program

Programs to analyze: a) A sequential sum of a 1 GB array b) A random-access hash table lookup (1 million lookups in a 256 MB table) c) A tight arithmetic loop: for(i=0; i<N; i++) r = r*r*r + 1.7; (single accumulator) d) A quicksort of 10 million random integers e) A loop that computes the Mandelbrot set (complex multiply and add, conditional exit)

For each program, state: the dominant bottleneck (memory bandwidth, cache misses, branch mispredictions, computation, ILP), and the evidence from perf stat that supports your diagnosis.

Exercise 3 — IPC Analysis

a) Write a loop that computes the square root of a float using the Newton-Raphson method (no SQRTSS instruction — just multiplies and adds): x = initial_guess x = x * (1.5 - 0.5 * a * x * x) // one iteration x = x * (1.5 - 0.5 * a * x * x) // second iteration (high accuracy)

b) Measure IPC with perf stat -e instructions,cycles.

c) Identify the critical path through the computation. How many cycles does the critical path predict?

d) Compare predicted IPC with measured IPC. If they differ, explain why.

e) Rewrite to process 8 values simultaneously using AVX2 (VMULPS, VADDPS, etc.). What IPC do you achieve?

Exercise 4 — Dependency Chain Breaking

The following loop has a critical dependency problem:

; Compute sum of f(x[i]) for i = 0..N-1
; f(x) = x*x + 3*x + 1.7  (scalar float)
.loop:
    movss   xmm0, [rsi + rcx*4]
    mulss   xmm0, xmm0          ; x^2
    movss   xmm1, [rsi + rcx*4]
    mulss   xmm1, [C3]           ; 3x
    addss   xmm0, xmm1          ; x^2 + 3x
    addss   xmm0, [C17]         ; + 1.7
    addss   xmm2, xmm0          ; accumulate (DEPENDENCY: xmm2)
    dec     rcx
    jnz     .loop

a) Identify the critical path. What limits throughput? b) Rewrite using 4 independent accumulators (xmm2, xmm3, xmm4, xmm5). c) Rewrite using 4 independent accumulators AND 4-way loop unrolling (processing 4 elements per iteration to avoid redundant loads). d) Benchmark all three versions for N=10 million. Report cycles/element and speedup. e) Use llvm-mca to predict the throughput of each version before benchmarking. How accurate is the prediction?

Exercise 5 — Division by Constant

a) Write a NASM function that computes n / 7 for a 32-bit unsigned integer n using the DIV instruction. b) Compute the "magic number" for dividing by 7 (or use libdivide.h to generate it) and write a version using IMUL and shift. c) Benchmark both versions (1 billion divisions). What speedup do you observe? d) Repeat for divisors: 3, 5, 7, 10, 13, 17. Tabulate: divisor, magic number, shift amount, speedup. e) For which divisors does the reciprocal multiply give exact results for all 32-bit inputs? How would you verify this? f) Extend to signed division by -7. How does the algorithm change?

Exercise 6 — Loop Fusion and Fission

Given two loops:

// Loop A: compute B[i] = f(A[i])
for (int i = 0; i < N; i++) B[i] = sqrt(A[i]);

// Loop B: compute sum of B[i] * C[i]
for (int i = 0; i < N; i++) sum += B[i] * C[i];

a) Implement both as two separate loops in NASM (with the intermediate B[] array). b) Implement as a fused single loop (compute sum += sqrt(A[i]) * C[i] directly, no B[] array). c) For N=10 million floats, benchmark both implementations. Which is faster? d) Analyze: why does fusion help (or hurt) in terms of cache usage and register pressure? e) Would the fused version be faster or slower if sqrt were replaced by a trivial multiply? Explain. f) Now consider the reverse: loop fission (splitting one loop into two). When would splitting help?

Exercise 7 — CMOV vs. Branch

a) Implement a linear search function two ways: - Branch version: if (arr[i] == target) { found = i; break; } - Branchless version: processes all elements with CMOV, records last matching index

b) For the branchless version, use: nasm cmp [rsi + rcx*4], edi ; arr[i] == target? lea rdx, [rcx] cmove rax, rdx ; if equal, update found index

c) Benchmark both for: - Sequential data where the target is always at index N/2 (predictable branch) - Random data with the target present at a random position (unpredictable) - Data where the target is never found (predictable not-taken)

d) Report: cycles per element for each scenario and each version. When does CMOV win?

e) Add a third version: SSE2/AVX2 vectorized search using PCMPEQD and PMOVMSKB. How does it compare?

Exercise 8 — llvm-mca Analysis

Use llvm-mca to analyze the following loops. For each: 1. Run llvm-mca -mcpu=<your CPU> on the loop body 2. Record: throughput (cycles per iteration), IPC, bottleneck resource 3. Predict what optimization would help 4. Implement the optimization and verify with llvm-mca and RDTSC

Loops to analyze: a) Serial FP sum: addss xmm0, [rsi]; add rsi, 4; dec rcx; jnz .loop b) Interleaved loads and stores: reading from array A, writing to array B with a transform c) A loop with two IMUL instructions in the dependency chain d) A loop with mixed integer and floating-point operations

For each analysis, check whether port utilization or instruction latency is the limiting factor.

Exercise 9 — Code Alignment Experiment

a) Write a tight inner loop of exactly 12 instructions (to stress the 16-byte instruction fetch boundary). b) Run the loop with the first instruction at byte offsets 0, 1, 2, 4, 8, 12, 13, 14, 15, 16 within a 64-byte region (use NOP padding to shift the alignment). c) Benchmark each alignment. Plot: alignment offset vs. execution time. d) At which offsets does performance degrade? Does it correlate with 16-byte or 64-byte boundaries? e) Use perf stat -e dsb2mite_switches.penalty_cycles to measure frontend stalls caused by misalignment. f) Now apply ALIGN 16 before the loop in NASM. Verify that the loop starts at a 16-byte boundary by examining the disassembly (objdump -d).

Exercise 10 — Full Optimization Pipeline

Starting from the following naive function, apply the full optimization workflow:

// Compute weighted sum: sum(weights[i] * values[i]) for i = 0..N-1
// Both arrays are float32, N = 10 million
float weighted_sum(float *weights, float *values, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++)
        sum += weights[i] * values[i];
    return sum;
}

Step-by-step: a) Profile: Write a C wrapper and compile with gcc -O0. Run perf stat. Record baseline IPC, cycles, L1 miss rate. b) Identify bottleneck: Is it memory bandwidth, computation, or ILP? c) NASM v1: Write a direct NASM translation. Measure. d) NASM v2: Add 8 independent accumulators. Measure speedup. e) NASM v3: Convert to FMA (VFMADD231SS). Measure speedup. f) NASM v4: Convert to AVX2 (VFMADD231PS with YMM registers, 8 floats per instruction). Use 4 YMM accumulators. Measure speedup. g) NASM v5: Add software prefetch (PREFETCHT0 256 bytes ahead). Does it help? h) Compare with compiler: Compile the C version with gcc -O3 -march=native. How does it compare to your best NASM version?

Report: speedup at each step, and the perf stat metrics that guided each optimization decision.

Exercise 11 — Performance Counter Deep Dive

Write a benchmark that generates each of the following bottlenecks deliberately, then use perf to confirm which counter is elevated:

a) L1 cache miss: Access a 256 KB array with stride 4096 bytes (one cache line per access). b) L3 cache miss / DRAM latency: Pointer chase through 512 MB of randomly shuffled pointers. c) Branch misprediction: Predict if (rand() % 2) in a tight loop. d) Port 0 saturation: Loop with 4 VFMADD231PS instructions per iteration (all route to port 0). e) Frontend bound: Large loop that does not fit in the µop cache (~1500 µops threshold).

For each: write the benchmark, run perf stat with appropriate events, and confirm that the expected counter is elevated while others are normal.

Exercise 12 — Agner Fog Tables

Using Agner Fog's instruction tables for your CPU (available at agner.org/optimize):

a) For each instruction, look up latency, reciprocal throughput, and execution port: - ADDSS, MULSS, VFMADD231SS - ADDPS YMM, MULPS YMM, VFMADD231PS YMM - MOVAPS, VMOVAPS YMM - IMUL r64, r64 - DIV r64 - LEA r64, [r + r*8] - CMOVZ r64, r64

b) For a dot product loop body (VMOVAPS load + VFMADD231PS + loop overhead): - Calculate the theoretical throughput limit (FLOPs/cycle) assuming infinite accumulators - Calculate the minimum accumulators needed to reach that throughput - Verify with llvm-mca

c) For a matrix multiply inner loop using YMM FMA: - What is the theoretical peak GFLOP/s for your CPU (single core)? - What fraction of peak does your best tiled implementation achieve? - What bottleneck prevents reaching 100% of theoretical peak?