Chapter 5: Key Takeaways

  1. Computational cost is a first-class design constraint, not an afterthought. The most common failure mode of senior data scientists is not building a bad model — it is building a good model that is too slow, too expensive, or too large to deploy. Before choosing an algorithm, estimate its cost at your target scale. If brute-force nearest neighbor search takes 50ms at 10M items and your budget is 15ms, no amount of engineering can make it work. You need an approximate method.

  2. Constant-factor improvements do not survive scaling. An $O(n^2)$ algorithm that is twice as fast as an $O(n \log n)$ algorithm at $n = 100$ will be thousands of times slower at $n = 10^6$. The asymptotic complexity class determines whether an approach is viable at production scale. Profiling reveals the current bottleneck; complexity analysis reveals the future bottleneck.

  3. Approximate methods (LSH, HNSW, IVF, streaming sketches) trade exactness for tractability. Exact nearest neighbor search in high dimensions is $O(nd)$ per query — there is no way around it. Approximate methods reduce this to $O(d \log n)$ or $O(d \cdot n^\rho)$ with $\rho < 1$, at the cost of occasionally missing the true nearest neighbor. The tradeoff is tunable: more hash tables, more probed cells, or higher beam width increases recall at the cost of latency. In practice, 95% recall at 1ms is almost always preferable to 100% recall at 50ms.

  4. The roofline model reveals whether computation is limited by arithmetic throughput or memory bandwidth. Large matrix multiplications (the core of neural network forward passes) are compute-bound and achieve near-peak GPU utilization. Element-wise operations (ReLU, layer norm, softmax) are memory-bound — the GPU spends most of its time waiting for data. This asymmetry explains why kernel fusion (reducing memory traffic) often provides larger speedups than hardware upgrades (increasing arithmetic throughput).

  5. Transformer attention is $O(n^2 d)$, but the quadratic term dominates only at long sequences. At typical sequence lengths (512-2048 tokens), the linear projection terms ($O(nd^2)$) dominate total FLOPs. The KV-cache reduces autoregressive generation from $O(T^3 d)$ to $O(T^2 d)$ by storing and reusing key/value projections. Understanding this cost structure is essential for choosing between model architectures and for predicting serving costs.

  6. Amdahl's Law sets a hard ceiling on the speedup from parallelism. If 5% of a training pipeline is serial (data loading, gradient aggregation, logging), the maximum speedup with any number of GPUs is $20\times$. Adding GPUs beyond the point of diminishing returns wastes money. Profile the pipeline to identify the serial fraction before scaling hardware.

  7. Profile first, optimize second. Intuition about bottlenecks is unreliable. Profiling reveals that candidate retrieval consumes 70% of pipeline latency, or that data loading (not model training) is the bottleneck. Amdahl's Law then tells you the maximum payoff from optimizing each component. The combination of profiling and Amdahl's Law is the quantitative discipline that separates efficient engineering from guesswork.