Chapter 5: Further Reading
Essential Sources
1. Jeff Dean and Luiz Andre Barroso, "The Tail at Scale" (Communications of the ACM, 2013)
The paper that made every systems engineer think about p99 latency instead of average latency. Dean and Barroso show that in large-scale distributed systems, tail latency (the slowest 1% of requests) is dominated by rare events — garbage collection pauses, cache misses, background processes — that are invisible in average-case analysis. For ML serving systems, the implications are direct: a model that meets the latency SLA at p50 but violates it at p99 is not production-ready. The paper introduces hedged requests, tied requests, and other techniques for taming tail latency. Short (7 pages), accessible, and essential reading for anyone building serving systems.
Reading guidance: Read the entire paper — it is concise and every section is relevant. Pay particular attention to Table 1 ("Latency of common operations in a modern datacenter") and the "Why Variability Exists" section. Then reconsider the profiling and latency analysis from this chapter with p99 in mind instead of median.
2. Samuel Williams, Andrew Waterman, and David Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (Communications of the ACM, 2009)
The original roofline model paper. Williams et al. introduce the elegant idea that computational performance is bounded by the minimum of peak compute throughput and memory bandwidth times arithmetic intensity — creating a "roofline" on a log-log plot. The paper applies the model to four multicore architectures and shows how different optimizations (SIMD, software prefetching, loop tiling) move a computation toward the roofline. Although the paper predates the GPU-dominated deep learning era, the model transfers directly: the A100's roofline is the same concept with different numbers. Understanding the roofline model is the single most effective way to reason about GPU utilization.
Reading guidance: Focus on Sections 1-3 (the roofline model itself) and Section 5 (optimization strategies). The specific architectures analyzed (Sun Niagara2, Intel Clovertown) are outdated, but the methodology is timeless. After reading, compute the roofline for your GPU (e.g., A100: 312 TFLOP/s fp16, 2039 GB/s HBM bandwidth) and plot the operations from Section 5.9 of this chapter.
3. Jeff Johnson, Matthijs Douze, and Herve Jegou, "Billion-Scale Similarity Search with GPUs" (IEEE Transactions on Big Data, 2021)
The FAISS paper. Johnson et al. describe the system design, GPU implementations, and algorithmic choices behind FAISS — Meta's library for approximate nearest neighbor search that powers recommendation, search, and retrieval systems at scale. The paper covers IVF indexing, product quantization, and GPU-accelerated distance computation, with benchmarks on datasets up to one billion vectors. For data scientists building recommendation systems or retrieval-augmented generation (RAG) pipelines, FAISS is the most important library in this chapter.
Reading guidance: Start with Sections 1-2 (problem statement and algorithm overview). Section 3 (GPU implementation) is valuable for understanding why GPU-accelerated ANN is 5-10x faster than CPU implementations. Section 5 (experiments) provides benchmark numbers you can use to calibrate your own cost estimates. Pair this with the FAISS documentation and tutorials at the official GitHub repository.
4. Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS, 2022)
FlashAttention demonstrates that algorithmic improvements in memory access patterns can provide 2-4x speedups for attention computation without any approximation — contradicting the assumption that faster attention requires approximate methods. The key insight is that standard attention implementations are memory-bound because they materialize the $n \times n$ attention matrix in HBM. FlashAttention computes attention in blocks that fit in GPU SRAM, reducing HBM I/O from $O(n^2)$ to $O(n^2 d / M)$ where $M$ is the SRAM size. This paper is the best illustration of the roofline model's practical implications: understanding the memory hierarchy led to a breakthrough that pure algorithmic complexity analysis missed.
Reading guidance: Read Sections 1-3 for the algorithm and I/O analysis. The key figure is Figure 1, which shows the standard attention's memory access pattern vs. FlashAttention's tiled pattern. Section 4 (extensions to multi-head attention, masking, dropout) shows the engineering required to make a theoretically simple idea production-ready. This paper is directly relevant to Chapter 10 (transformers) and Chapter 26 (training at scale).
5. Graham Cormode and S. Muthukrishnan, "An Improved Data Stream Summary: The Count-Min Sketch and Its Applications" (Journal of Algorithms, 2005)
The original Count-Min Sketch paper. Cormode and Muthukrishnan introduce a beautifully simple data structure — a 2D array of counters updated by multiple hash functions — that provides approximate frequency counts for any element in a data stream using fixed memory. The paper proves tight error bounds and demonstrates applications to heavy hitters, quantile estimation, and join size estimation. For data scientists working with streaming data, event counting, or real-time analytics at scale, the Count-Min Sketch is one of the most practical tools in the sublinear-algorithms literature.
Reading guidance: Sections 1-3 cover the data structure and proofs. Section 4 (applications) shows how the sketch extends beyond simple frequency counting. The proofs are short and elegant — worth reading in full. After the paper, explore the broader streaming algorithms literature: HyperLogLog for cardinality estimation, MinHash for set similarity, and the Misra-Gries algorithm for frequent items.