Chapter 32 Exercises: The Memory Hierarchy

Open Assembly Language Project

Chapter 32 Exercises: The Memory Hierarchy

Exercise 1 — Cache Line Arithmetic

A 32 KB 8-way set-associative L1 data cache uses 64-byte cache lines.

a) How many total cache lines does this cache hold? b) How many sets does it have? c) For a 64-bit address accessing byte 0x12345678: - What are the offset bits (which byte within the cache line)? - What are the set index bits (which set)? - What are the tag bits? d) Two addresses, A and B, map to the same set if and only if their set index bits match. Do 0x00000000 and 0x00008000 map to the same set? Explain.

Exercise 2 — Miss Rate Estimation

Write a small C or NASM program that iterates over an array of int32s with a configurable stride, summing elements. Measure execution time for strides 1, 2, 4, 8, 16, 32, 64 (bytes), using an array of 32 MB.

a) Plot or tabulate: stride vs. execution time (or cycles per element). b) At what stride does performance first plateau? Why? c) Estimate the cache miss rate at each stride (assume 4-byte reads, 64-byte cache lines). d) Use perf stat -e L1-dcache-load-misses to verify your estimates.

Exercise 3 — Row-Major vs. Column-Major Access

Write two programs that sum all elements of a 4096×4096 matrix of float32: - sum_rowmajor.asm: iterate rows outer, columns inner (row-major, stride-1) - sum_colmajor.asm: iterate columns outer, rows inner (column-major, stride-4096×4 bytes)

a) Measure execution time of both versions with perf stat. b) Record L1-dcache-load-misses and LLC-load-misses for both. c) Calculate the theoretical miss rates for each (assume 32 KB L1, 256 KB L2, 8 MB L3). d) The column-major version accesses a new cache line for every element. How many total cache lines does it touch? How many does the row-major version touch? e) At what array size does the difference between row-major and column-major become significant? Why?

Exercise 4 — False Sharing

a) Write a multithreaded C program (using pthreads) with two counters: c // Version A: false sharing struct { int counter0; int counter1; } shared; // Thread 0 increments shared.counter0 in a loop // Thread 1 increments shared.counter1 in a loop c // Version B: padded (no false sharing) struct { int counter0; char pad0[60]; int counter1; char pad1[60]; } shared;

b) Run both versions with 2 threads, 100 million iterations each. Compare execution times. c) Use perf stat -e cache-misses to measure cache miss rates for both versions. d) Write the Version B fix in NASM, using align 64 to separate the counters to different cache lines. e) Explain: can false sharing occur with two variables in the same function's local variables (on the same thread's stack)? Why or why not?

Exercise 5 — MESI Protocol Tracing

Trace the MESI state of a cache line through the following sequence. Two cores share a write-back L2 cache. Use states M/E/S/I for each core's L1 cache.

Initial state: cache line X = I in both cores.

Step	Operation	Core 0 L1	Core 1 L1	Explanation
1	Core 0 reads X
2	Core 1 reads X
3	Core 0 writes X
4	Core 1 reads X
5	Core 1 writes X
6	Core 0 reads X

For each step, state: (a) the state transition, (b) whether a cache miss occurred, (c) whether any coherence traffic (invalidation, writeback, transfer) was generated.

Exercise 6 — Blocking/Tiling

Implement a tiled matrix-vector multiply for an N×N matrix and N-vector.

a) First implement the naive version (no tiling). b) Then implement a tiled version with tile size T×T for the matrix. c) Choose T so that one tile of the matrix plus the corresponding T elements of the input and output vectors fit in L1 cache (32 KB). - What is the maximum T? (each element is float32 = 4 bytes) d) Benchmark both versions for N=2048 using perf stat. e) Report: execution time, L1 miss rate, L2 miss rate, LLC miss rate for both versions. f) What is the compute-to-memory ratio (FLOPs per byte loaded) for the tiled vs. untiled versions?

Exercise 7 — Non-Temporal Stores

a) Write two memset functions in NASM: - memset_normal: uses MOV [rdi], rax in a loop - memset_nt: uses MOVNTI [rdi], rax with SFENCE at end

b) Benchmark both for buffer sizes: 256 KB, 1 MB, 4 MB, 16 MB, 64 MB. c) At what buffer size does memset_nt become faster? Explain why. d) Add a third version using VMOVNTPS with YMM registers. Does it outperform the 64-bit version? By how much? e) For the NT version: what happens if you omit the SFENCE? Write a test that exposes the ordering issue (hint: another thread checking when memory has been zeroed). f) When would non-temporal stores be a bad choice? Give two specific scenarios.

Exercise 8 — AoS to SoA Conversion

Given this particle simulation structure:

typedef struct {
    float x, y, z;     // position
    float vx, vy, vz;  // velocity
    float mass;
    float charge;
} Particle;

Particle particles[1000000];

a) Draw the memory layout of the first three particles. b) If a gravity simulation only needs x, y, z, and mass: what fraction of cache space is "wasted" with AoS layout? c) Convert the struct to SoA layout in NASM: nasm section .bss ; Your SoA arrays here d) Write NASM code to update all X positions: x[i] += vx[i] for i = 0..N-1, using AVX2 VADDPS on 8 floats at a time. e) Write the same loop for AoS layout. Why is the SoA version more efficient? f) Measure both implementations for N=1,000,000. What speedup do you observe?

Exercise 9 — Prefetching Linked Lists

Linked list traversal is the classic case for software prefetching because nodes are scattered in memory.

a) Implement a linked list with nodes allocated using malloc (or sys_mmap). Create a list of 1 million nodes containing a 64-bit integer value. b) Write a list traversal that sums all values — baseline (no prefetch). c) Add software prefetch: PREFETCHT0 [current->next->next] to bring the next-next node into cache while processing the current node. d) Benchmark both versions. What speedup do you observe? e) Experiment with prefetch distance: prefetch 1, 2, 4, 8 nodes ahead. Plot: prefetch distance vs. execution time. f) Explain why the optimal prefetch distance is related to DRAM latency and the time spent processing one node.

Exercise 10 — Working Set Analysis

For each of the following programs, estimate the working set size and identify which cache level it fits in (L1, L2, L3, DRAM):

a) A function that computes the dot product of two 1000-element float64 vectors. b) A function that multiplies two 512×512 float32 matrices (naive implementation). c) A function that does a binary search in a sorted array of 10 million int64 values. d) A function that counts word frequencies in a 100 MB text file using a hash table of 1 million entries. e) A real-time audio processing loop that applies 10 FIR filter taps to a buffer of 256 float32 samples.

For each: 1. Calculate the working set size in bytes. 2. State which cache level it fits in (assume L1=32KB, L2=256KB, L3=8MB). 3. Identify the dominant cache behavior (cold misses only, L1 cycling, L2 spill, DRAM-bound). 4. Suggest one optimization to improve cache behavior.

Exercise 11 — Cache Miss Cascade

Write a program that demonstrates a "cache miss cascade": a data structure where following pointers causes repeated cache misses. Use an array of pointers shuffled randomly (so each pointer->next access is a cache miss to a random memory location).

a) Create a circular linked list of N=1,000,000 nodes. Arrange the nodes randomly in memory (shuffle the indices before allocating). b) Traverse the list, measuring total cycles with RDTSC. c) Calculate: average cycles per pointer chase. d) Compare with the latency numbers from the chapter (L1=4, L2=12, L3=40, DRAM=200 cycles). Which level is being missed at? e) Now create the same list but with nodes in sequential order (no shuffling). Compare traversal time. Explain the difference. f) For the shuffled version, can prefetching help? Why or why not?

Exercise 12 — Memory Bandwidth Measurement

Write a NASM bandwidth benchmark that measures peak read, write, and copy bandwidth:

a) Read bandwidth: load all elements of a large array (use VMOVAPS to load but sum into a single accumulator to prevent optimization). b) Write bandwidth: store zeros to all elements of a large array (use VMOVAPS with xmm0=0). c) Copy bandwidth: copy one large array to another (VMOVAPS load + store). d) Run for array sizes: 32 KB, 256 KB, 8 MB, 256 MB. Report bandwidth in GB/s for each level. e) Compare your measurements with the theoretical limits: L1 bandwidth ~1000 GB/s, DDR5 ~65 GB/s. f) Use perf stat -e mem_load_retired.l3_miss to confirm DRAM is being accessed for the 256 MB case.