Chapter 32 Quiz: The Memory Hierarchy

Open Assembly Language Project

Chapter 32 Quiz: The Memory Hierarchy

1. A CPU accesses a single byte from an address not in any cache. How many bytes are transferred from DRAM to L1 cache?

A) 1 byte (only the requested byte) B) 8 bytes (one 64-bit word) C) 64 bytes (one cache line) D) 4096 bytes (one memory page)

Answer: C — Caches transfer data in units of cache lines (64 bytes on all modern x86-64 systems). Even if only 1 byte is needed, the entire 64-byte aligned block containing it is fetched and stored in cache.

2. What is the approximate DRAM access latency on a modern system?

A) 1–4 cycles B) 12–20 cycles C) 40–50 cycles D) 100–300 cycles

Answer: D — DRAM latency is approximately 60–80 nanoseconds, which at 4 GHz corresponds to 240–320 cycles. L1 is ~4 cycles, L2 ~12, L3 ~40. DRAM is the slowest on-die-accessible memory, roughly 50–100× slower than L1.

3. What does N-way set-associative mean for a cache?

A) The cache has N separate banks that operate in parallel B) Each cache set holds N cache lines, and any of the N can hold a given address C) The cache uses N hash functions to reduce conflict misses D) N addresses can be accessed simultaneously without contention

Answer: B — In an N-way set-associative cache, memory addresses are mapped to sets (by address bits), and each set has N "ways" or slots. A cache line from memory can be stored in any of the N ways of its set. Higher associativity reduces conflict misses at the cost of more complex tag comparison logic.

4. Which type of cache miss occurs the first time a program ever accesses a particular cache line?

A) Conflict miss B) Capacity miss C) Compulsory (cold) miss D) Coherence miss

Answer: C — A compulsory (cold) miss occurs the very first time any address in a cache line is accessed — the data has never been in cache before. These are irreducible: even an infinite cache would still incur them. Prefetching can hide the latency but cannot eliminate the miss itself.

5. Array A contains 1 million int32 values (4 MB). A program sums all elements with stride-1 access. How many L1 cache misses occur (approximately), assuming L1 = 32 KB and each cache line = 64 bytes?

A) 1,000,000 (one per element) B) 62,500 (one per cache line) C) 512 (L1 can hold 512 lines; those are the only misses) D) 0 (hardware prefetcher eliminates all misses)

Answer: B — With stride-1 access, each cache line (64 bytes = 16 int32s) is loaded once and used for 16 consecutive elements. Total misses = 1,000,000 / 16 = 62,500. The hardware prefetcher eliminates latency but does not eliminate the loads — the data must still be transferred from L2/L3/DRAM; the difference is it arrives before the CPU stalls.

6. Two threads run on different cores. Thread 0 writes counter0 and Thread 1 writes counter1. Both variables are adjacent in memory on the same 64-byte cache line. What happens?

A) The writes succeed without issue because they are to different addresses B) One thread must wait until the other finishes, as if they shared a lock C) Each write invalidates the other core's copy of the cache line (false sharing), causing coherence traffic and reduced performance D) The CPU detects the false sharing and automatically moves the variables to different cache lines

Answer: C — This is false sharing. Although the variables are different, they occupy the same cache line. In the MESI protocol, when Core 0 writes, Core 1's copy of the line is invalidated (I state). Core 1's next write causes a cache miss, fetches the line, then invalidates Core 0's copy. The two cores ping-pong the cache line back and forth, generating coherence traffic on every write. Performance can degrade by 10–50× compared to using separate cache lines.

7. What is the primary purpose of the MOVNTPS instruction?

A) A fast version of MOVAPS that executes in fewer cycles B) A store that bypasses the cache, writing directly to memory via write-combining buffers C) A load that prefetches the cache line into L1 and L2 D) A store that writes to all cores' caches simultaneously

Answer: B — Non-temporal (NT) stores bypass the cache hierarchy. Instead of loading the cache line before writing (read-for-ownership), they write through write-combining buffers to memory. This is faster for write-only, large-buffer operations (memset, video frame writes) because it avoids read-for-ownership and maximizes write bandwidth. However, NT stores are weakly ordered and require SFENCE to ensure visibility.

8. What must follow a series of non-temporal stores to ensure they are visible to other processors?

A) LFENCE B) MFENCE C) SFENCE D) CPUID

Answer: C — SFENCE (Store Fence) ensures all prior stores, including non-temporal stores held in write-combining buffers, are flushed and made globally visible before any subsequent stores execute. LFENCE is for load ordering; MFENCE is a full fence (both loads and stores). For NT stores specifically, SFENCE is the correct and sufficient instruction.

9. A program accesses a 2D array stored in row-major order using column-major traversal (outer loop iterates columns, inner loop iterates rows). The array is 4096×4096 float32 (64 MB). What is the access stride in bytes?

A) 4 bytes (float32 size) B) 64 bytes (one cache line) C) 16,384 bytes (one row = 4096 × 4 bytes) D) 4,194,304 bytes (column skip = 4096 × 4096 × 4 / 4096)

Answer: C — In row-major storage, A[0][0] and A[1][0] (first elements of consecutive rows) are separated by one row = 4096 × 4 bytes = 16,384 bytes. Column-major traversal accesses A[0][j], A[1][j], A[2][j] — each 16,384 bytes apart. This means every access is a new cache line, eliminating all spatial locality.

10. Structure of Arrays (SoA) layout is preferred over Array of Structures (AoS) when:

A) Each operation accesses all fields of a single object at once B) Objects have many fields and only a few are accessed in hot loops C) The program is single-threaded D) The data fits entirely in L1 cache

Answer: B — SoA groups the same field across all objects contiguously. When a hot loop only uses a few fields (e.g., only x, y, z of a particle with 8 fields), SoA ensures those fields are packed densely in cache. With AoS, loading an object's x field also loads the other 7 fields into cache, wasting 7/8 of the cache line. SoA also enables SIMD processing: 8 consecutive x values fit perfectly in a YMM register.

11. The hardware prefetcher is LEAST effective for which access pattern?

A) Sequential array traversal B) Fixed-stride array access (e.g., every 8th element) C) Pointer chasing through a randomly shuffled linked list D) Reading a matrix row by row

Answer: C — The hardware prefetcher works by detecting regular stride patterns. For sequential and fixed-stride access, it reliably predicts the next cache line to fetch. Pointer chasing through a randomly shuffled linked list has no regular stride — the next address to access depends on the current memory value (data-dependent). The prefetcher cannot know what to prefetch because the next address is unknown until the current load completes.

12. What is the Flush+Reload side-channel technique used for?

A) Flushing the TLB and reloading page tables after a context switch B) Flushing a cache line with CLFLUSH, then measuring the reload time to determine if another process accessed that line C) Flushing the µop cache and reloading instructions from memory D) Flushing write-combining buffers before a memory barrier

Answer: B — Flush+Reload is a cache side-channel attack: the attacker flushes a specific cache line using CLFLUSH, waits for a victim process to run, then measures how long it takes to reload that address. A fast reload (cache hit) means the victim accessed that line; a slow reload (cache miss) means it did not. This technique is used in Spectre and related attacks to exfiltrate secrets through memory access patterns.

13. Cache tiling (blocking) primarily addresses which type of cache miss?

A) Compulsory misses (cold misses on first access) B) Capacity misses (working set too large for cache) C) Conflict misses (two addresses mapping to the same set) D) Coherence misses (other cores invalidating lines)

Answer: B — Tiling divides the computation into blocks small enough to fit the active working set in cache. For a naive matrix multiply, the entire matrix B must be accessed for each row of A — far exceeding L2 capacity. With tiling, only one B tile is active at a time. The same B tile elements are reused many times before eviction, eliminating capacity misses during that tile's computation.

14. What is PREFETCHNTA used for, and how does it differ from PREFETCHT0?

A) PREFETCHNTA is faster because it only prefetches to L1 while PREFETCHT0 prefetches to all levels B) PREFETCHNTA prefetches with "non-temporal" hint, loading into L1 but minimizing L2/L3 pollution; PREFETCHT0 prefetches to L1, L2, and L3 C) PREFETCHNTA is deprecated; PREFETCHT0 should always be used instead D) PREFETCHNTA triggers a hardware prefetch unit while PREFETCHT0 is a software hint only

Answer: B — PREFETCHNTA (Non-Temporal Aligned) hints that the data will be used once and should not fill all cache levels. It loads to L1 with minimal insertion in L2/L3, avoiding eviction of frequently-used data. Use it when reading large data that will be processed exactly once (e.g., memcpy source). PREFETCHT0 fills L1+L2+L3, appropriate for data that will be reused within a moderate time window.

15. A benchmark measures DDR5 memory bandwidth at 50 GB/s for a large array traversal, while the theoretical peak is 76.8 GB/s. What explains this gap?

A) The CPU is executing too many instructions, consuming cycles that could be used for memory access B) DRAM requires refresh cycles, during which no reads occur; row activation overhead limits sustained bandwidth below theoretical peak C) The cache is too small to buffer the transfers efficiently D) The OS is intercepting every memory access for security checking

Answer: B — Theoretical memory bandwidth assumes continuous, perfectly pipelined access. Real workloads incur overhead from row activation (tRCD), precharge (tRP), and periodic DRAM refresh (tREFI). Additionally, the memory controller has finite request queue depth, and random access patterns reduce efficiency below sequential bandwidth. Typical sustained bandwidth for sequential access is 65–85% of theoretical peak; random access can be 20–40%.