Chapter 5: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Asymptotic Analysis

Exercise 5.1 (*)

For each of the following operations, state the time complexity in Big-O notation in terms of the given parameters:

(a) Computing the Gram matrix $K_{ij} = x_i^T x_j$ for $n$ data points in $\mathbb{R}^d$.

(b) Evaluating a random forest with $T$ trees, each of depth at most $\log_2 n$, on a single input with $d$ features.

(c) Running one epoch of stochastic gradient descent on a logistic regression model with $n$ training samples and $d$ features, using mini-batches of size $b$.

(d) Computing the full singular value decomposition of an $n \times d$ matrix (assume $n > d$).

(e) Computing the top-$r$ truncated SVD of the same $n \times d$ matrix using randomized methods.


Exercise 5.2 (*)

A recommendation system has $n = 5{,}000{,}000$ users and $m = 500{,}000$ items. The user-item interaction matrix has density $\rho = 0.02\%$ (i.e., $\rho \cdot nm$ non-zero entries).

(a) How much memory (in GB) does the dense matrix require in float32?

(b) How much memory does the sparse matrix require in CSR format? (Account for the data array, column index array, and row pointer array.)

(c) What is the speedup of sparse matrix-vector product over dense matrix-vector product for this matrix?

(d) At what density $\rho^*$ do sparse and dense representations use the same amount of memory in CSR format?


Exercise 5.3 (**)

Consider the following Python function that computes pairwise Euclidean distances:

import numpy as np

def pairwise_distances_naive(X: np.ndarray) -> np.ndarray:
    """Compute pairwise Euclidean distances (naive implementation)."""
    n = X.shape[0]
    D = np.zeros((n, n))
    for i in range(n):
        for j in range(i + 1, n):
            D[i, j] = np.sqrt(np.sum((X[i] - X[j]) ** 2))
            D[j, i] = D[i, j]
    return D

(a) What is the time complexity of this function in terms of $n$ (number of points) and $d$ (dimensionality)?

(b) Rewrite the function using the identity $\|x - y\|^2 = \|x\|^2 + \|y\|^2 - 2x^T y$ and vectorized numpy operations. What is the time complexity of your implementation?

(c) Benchmark both implementations for $n \in \{100, 500, 1000, 2000\}$ and $d = 128$. Plot the timing results and verify they match your complexity analysis.


Exercise 5.4 (**)

The training complexity of k-means is $O(nKde)$ where $n$ is the number of data points, $K$ is the number of clusters, $d$ is the dimensionality, and $e$ is the number of iterations.

(a) Mini-batch k-means uses a random subset of size $b \ll n$ at each iteration. What is its per-iteration complexity?

(b) If mini-batch k-means requires $e'$ iterations to converge (where typically $e' > e$ due to noisier updates), under what condition on $e'/e$ is mini-batch k-means faster than standard k-means?

(c) StreamRec wants to cluster 5M user embeddings ($d = 256$) into $K = 1{,}000$ clusters for IVF index construction. Standard k-means takes 25 iterations to converge. Mini-batch k-means (with $b = 10{,}000$) takes 100 iterations. Which is faster? By what factor?


Memory and Hardware

Exercise 5.5 (*)

An A100 GPU has 80 GB of HBM2e memory. You want to fine-tune a model with the following memory requirements:

  • Model parameters: 7B parameters in float16 (2 bytes each)
  • Optimizer states (Adam): 2 copies of parameters in float32
  • Gradient buffer: 1 copy of parameters in float16
  • Activation memory: estimated at 4 GB for batch size 1

(a) Compute the total memory required. Does it fit on a single A100?

(b) If you use gradient checkpointing (trading compute for memory, reducing activation memory by 5x), does the model now fit?

(c) If you use LoRA (freezing the base model and training only rank-16 adapter matrices for the attention projections), estimate the new optimizer state memory. Assume the model has 32 layers, each with 4 attention projection matrices of dimension $4096 \times 4096$.


Exercise 5.6 (**)

Compute the arithmetic intensity (FLOPs per byte) of the following operations, assuming float16 (2 bytes per element):

(a) Matrix-vector product $y = Ax$ where $A \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, $y \in \mathbb{R}^m$.

(b) Matrix-matrix product $C = AB$ where $A \in \mathbb{R}^{m \times k}$, $B \in \mathbb{R}^{k \times n}$, $C \in \mathbb{R}^{m \times n}$.

(c) Element-wise addition $z = x + y$ where $x, y, z \in \mathbb{R}^n$.

(d) For an A100 GPU with peak 312 TFLOP/s (fp16) and 2039 GB/s memory bandwidth, determine which of (a)-(c) are compute-bound vs. memory-bound at $m = n = k = 4096$.


Exercise 5.7 (**)

The KV-cache for a transformer model with $L$ layers, model dimension $d$, and sequence length $t$ requires $2 \times L \times t \times d \times \text{sizeof(dtype)}$ bytes.

(a) For LLaMA-7B ($L = 32$, $d = 4096$, float16), compute the KV-cache size at $t = 2048$ and $t = 32768$.

(b) An A100 has 80 GB. After loading the model weights (7B params $\times$ 2 bytes = 14 GB), how many concurrent requests can the GPU serve at $t = 2048$? At $t = 32768$?

(c) Propose two strategies to increase the number of concurrent requests. For each, state the tradeoff.


Approximate Methods

Exercise 5.8 (*)

An LSH index uses $L = 10$ hash tables, each with $k = 16$-bit hash codes (random hyperplane projections for cosine similarity). The database contains $n = 1{,}000{,}000$ vectors.

(a) What is the expected number of candidates per query, assuming uniformly distributed hash codes?

(b) If the query vector has cosine similarity 0.8 with its true nearest neighbor, what is the probability that this neighbor appears in at least one of the $L$ hash tables? (Use the collision probability formula for random hyperplane LSH.)

(c) How would you adjust $L$ and $k$ to achieve >99% probability of finding a neighbor with cosine similarity $\geq 0.8$?


Exercise 5.9 (**)

Implement an HNSW-like greedy graph search (simplified) with the following specification:

def greedy_search(
    query: np.ndarray,
    graph: Dict[int, List[int]],
    vectors: np.ndarray,
    entry_point: int,
    ef: int = 64
) -> List[int]:
    """Greedy search on a nearest-neighbor graph.

    Starting from entry_point, greedily traverse the graph, maintaining
    a priority queue of the ef closest visited nodes. Stop when no
    unvisited neighbor is closer than the farthest node in the queue.

    Args:
        query: Query vector of shape (dim,).
        graph: Adjacency list mapping node ID to neighbor IDs.
        vectors: Array of all vectors, shape (n, dim).
        entry_point: Starting node for the search.
        ef: Search beam width.

    Returns:
        List of node IDs for the ef closest nodes found.
    """
    pass  # Your implementation here

(a) Implement the function.

(b) Build a random k-regular graph (each node connected to $k = 32$ random nodes) on 10,000 random vectors in $\mathbb{R}^{128}$.

(c) Benchmark the recall@10 and query latency of your greedy search against brute-force. How does recall change with ef?


Exercise 5.10 (**)

Product quantization splits a $d$-dimensional vector into $m$ sub-vectors of dimension $d/m$ each, and quantizes each sub-vector to its nearest centroid in a codebook of size $K$.

(a) What is the compression ratio (bytes for the original vector / bytes for the PQ code) when $d = 256$, $m = 32$, $K = 256$, and the original dtype is float32?

(b) Implement asymmetric distance computation (ADC): given a full-precision query vector and a PQ-encoded database vector, compute the approximate L2 distance using pre-computed sub-vector centroid distances.

(c) Benchmark the accuracy of PQ-based nearest neighbor search against exact search on 100,000 random vectors in $\mathbb{R}^{256}$, varying $m \in \{8, 16, 32, 64\}$. Plot recall@10 vs. compression ratio.


Exercise 5.11 (***)

The Count-Min Sketch has additive error bounded by $\varepsilon n$ with probability $\geq 1 - \delta$, using $O(\frac{1}{\varepsilon} \ln \frac{1}{\delta})$ space.

(a) Prove that the Count-Min Sketch never underestimates: $\hat{c}(x) \geq c(x)$ for all items $x$.

(b) Prove the error bound: $\Pr[\hat{c}(x) \leq c(x) + \varepsilon n] \geq 1 - \delta$ when $w = \lceil e / \varepsilon \rceil$ and $d = \lceil \ln(1/\delta) \rceil$.

(c) The Count-Min Sketch overestimates rare items. Propose and implement a conservative update strategy that reduces overestimation while maintaining the no-underestimate guarantee.


Transformer Complexity

Exercise 5.12 (*)

For a transformer with $L$ layers, model dimension $d$, $h$ attention heads, FFN expansion factor $r = 4$, and sequence length $n$:

(a) Write the total FLOPs for a forward pass as a function of $L$, $d$, $h$, $r$, and $n$.

(b) At what sequence length $n^*$ does the $O(n^2 d)$ attention term dominate the $O(nd^2)$ linear projection term?

(c) For GPT-3 ($L = 96$, $d = 12288$, $h = 96$, $r = 4$, vocab = 50257), compute the total FLOPs for a forward pass at $n = 2048$.

(d) How does the FLOP count change if the FFN expansion factor is reduced from 4 to 8/3 (as in LLaMA's SwiGLU)?


Exercise 5.13 (**)

During autoregressive generation with KV-cache:

(a) Derive the per-token FLOPs at generation step $t$ (not the prefill step, but each subsequent token).

(b) Show that the total FLOPs for generating $T$ tokens (after a prompt of length $P$) is $O(T \cdot (P + T) \cdot d + T \cdot d^2)$.

(c) For LLaMA-7B generating 512 tokens after a 1024-token prompt, compute the total generation FLOPs. Compare with the prefill FLOPs for the 1024-token prompt.


Exercise 5.14 (***)

Linear attention replaces the $\text{softmax}(QK^T)V$ computation with $\phi(Q)(\phi(K)^T V)$, where $\phi$ is a feature map. By computing $\phi(K)^T V$ first (an $d \times d$ matrix), the per-token computation drops from $O(nd)$ to $O(d^2)$.

(a) Verify that the standard attention cannot be rewritten in this form (i.e., softmax attention does not factor as an outer product of features).

(b) Implement linear attention with the ELU+1 feature map: $\phi(x) = \text{ELU}(x) + 1$. Compare its output with standard attention on a random sequence.

(c) Benchmark the wall-clock time of linear attention vs. standard attention for sequence lengths $n \in \{256, 1024, 4096, 16384\}$ with $d = 128$. At what $n$ does linear attention become faster?

(d) Discuss the quality tradeoff: why has linear attention not replaced standard attention in practice, despite its better complexity?


Profiling and System Design

Exercise 5.15 (*)

A training pipeline has the following stage timings:

Stage Time (s)
Data loading 12
Preprocessing 8
Forward pass 25
Backward pass 45
Gradient sync 15
Optimizer step 5

(a) What fraction of time is spent on computation (forward + backward) vs. overhead?

(b) Using Amdahl's Law, what is the maximum speedup achievable by parallelizing only the forward and backward passes across 8 GPUs?

(c) If data loading can be fully overlapped with computation (via prefetching), what is the new maximum speedup from 8-GPU parallelism?


Exercise 5.16 (**)

Meridian Financial is comparing two credit scoring models:

Property Gradient Boosting Transformer
AUC 0.821 0.833
Inference latency (p50) 2 ms 85 ms
Inference latency (p99) 8 ms 340 ms
Model size 12 MB 450 MB
Training time 15 min 6 hours
Interpretability High (SHAP) Low

The SLA requires p99 latency $\leq$ 50ms. The system serves 1,000 requests/second.

(a) Can the transformer model meet the SLA without modification? If not, what techniques could reduce its latency?

(b) Quantize the transformer from float32 to int8. Estimate the new latency assuming 4x throughput improvement and proportional latency reduction. Does it now meet the SLA?

(c) If model distillation reduces the transformer to a 3-layer model with $d = 128$ (from 12 layers, $d = 512$), estimate the latency reduction factor. Would the distilled model meet the SLA?

(d) Write a cost-benefit analysis: is the 1.2% AUC improvement worth the additional engineering effort to make the transformer production-ready?


Exercise 5.17 (**)

Implement a profiler class that wraps arbitrary Python functions and produces a roofline analysis:

class RooflineProfiler:
    """Profile a computation and analyze it against the roofline model.

    Usage:
        profiler = RooflineProfiler(peak_gflops=312_000, bandwidth_gb_s=2039)
        result = profiler.profile(my_function, args, flops=..., bytes_accessed=...)
    """
    pass  # Your implementation here

(a) Implement the class with methods for profiling individual operations and generating a roofline plot.

(b) Profile the following operations: dense matrix multiply, sparse matrix-vector product, batch normalization, and multi-head attention. Plot all four on the roofline diagram.

(c) Based on your roofline analysis, which operations would benefit most from kernel fusion? Which would benefit most from a faster GPU?


Exercise 5.18 (***)

Design a complete cost estimation framework for StreamRec that takes the following inputs and produces a hardware recommendation:

  • Number of users, items, and interactions
  • Embedding dimension
  • Model architecture (MF, two-tower, transformer)
  • Serving SLA (latency, throughput, availability)
  • Training frequency (hourly, daily, weekly)
  • Budget (monthly cloud spend)
class MLCostEstimator:
    """End-to-end cost estimation for ML systems."""

    def estimate_training_cost(self, ...) -> TrainingCostEstimate:
        """Estimate training time, hardware, and cloud cost."""
        pass

    def estimate_serving_cost(self, ...) -> ServingCostEstimate:
        """Estimate serving hardware and cost per request."""
        pass

    def recommend_architecture(self, ...) -> ArchitectureRecommendation:
        """Recommend model architecture given constraints."""
        pass

(a) Implement the framework.

(b) Run it for StreamRec at three scales: 1M users, 10M users, and 100M users.

(c) At what user count does a two-tower model become more cost-effective than matrix factorization? At what count does a transformer-based model become justifiable?


Streaming and Online Algorithms

Exercise 5.19 (*)

StreamRec receives a stream of user engagement events. Implement a HyperLogLog sketch to estimate the number of distinct users who engaged with content in the last hour, using $O(\log \log n)$ space per register.

(a) Implement the HyperLogLog algorithm with $m = 1024$ registers.

(b) Process a stream of 1M events from 50,000 distinct users. Report the estimated cardinality and the relative error.

(c) How does the error change with the number of registers $m$? Plot the relative error vs. $m$ for $m \in \{16, 64, 256, 1024, 4096\}$.


Exercise 5.20 (**)

The Reservoir Sampling algorithm maintains a uniform random sample of size $k$ from a stream of unknown length. Prove that after processing $n$ elements, each element in the reservoir is present with probability $k/n$.

(a) Write the proof by induction on $n$.

(b) Implement reservoir sampling and use it to maintain a sample of 1,000 StreamRec interaction events from a stream of 10M events. Verify empirically that the sample distribution matches the stream distribution.

(c) Weighted reservoir sampling assigns a weight $w_i$ to each element and samples proportionally. Implement the A-Res algorithm (Efraimidis & Spirakis, 2006) and use it to oversample rare engagement events (completions) relative to common events (impressions).


Exercise 5.21 (***)

Implement a Frequent Items algorithm (Misra-Gries or Space-Saving) that identifies items appearing more than $n/k$ times in a stream, using $O(k)$ space.

(a) Implement the Misra-Gries algorithm.

(b) Prove that it identifies all items with frequency above $n/k$ (no false negatives) but may have false positives.

(c) Apply it to a stream of 10M StreamRec content view events with a Zipfian distribution. How many false positives does it produce for $k = 100$?


Integrated Problems

Exercise 5.22 (**)

Combine complexity analysis with information theory (Chapter 4). The mutual information between a feature $X$ and a target $Y$ can be estimated using the KSG estimator, which requires computing k-nearest neighbors for each sample in a joint $(X, Y)$ space.

(a) What is the time complexity of computing MI via the KSG estimator with brute-force k-NN, for $n$ samples in $d$-dimensional joint space?

(b) If StreamRec has 47 features and 2M samples, estimate the total time to compute MI for all features using brute-force k-NN.

(c) Propose an approximate MI estimator that replaces brute-force k-NN with ANN. Implement it using FAISS and benchmark the speedup vs. accuracy tradeoff.


Exercise 5.23 (***)

In Chapter 1, you implemented SVD-based matrix factorization. The truncated SVD of an $n \times m$ matrix to rank $r$ costs $O(nmr)$.

(a) For StreamRec ($n = 5\text{M}$, $m = 200\text{K}$, $r = 64$), estimate the FLOP count and wall-clock time for truncated SVD on a single CPU (assume 50 GFLOP/s).

(b) Randomized SVD (Halko, Martinsson, Tropp, 2011) reduces the cost to $O(nm \log r + (n + m) r^2)$. Compute the speedup at StreamRec scale.

(c) Implement randomized SVD and verify that it produces embeddings whose cosine similarities are within 1% of the exact SVD embeddings, for the top-100 most similar item pairs.


Exercise 5.24 (***)

Design and implement a benchmark that compares the end-to-end latency of three recommendation architectures:

  1. Matrix factorization + brute force retrieval: Chapter 1 embeddings, exact inner product search
  2. Matrix factorization + ANN retrieval: Same embeddings, FAISS HNSW
  3. Two-tower neural network + ANN retrieval: Separate user and item encoders (2-layer MLPs)

For each architecture, measure: - Index build time - Median and p99 query latency - Recall@20 compared to brute force - Memory footprint

Produce a summary table and recommend the best architecture for (a) 50K items, (b) 500K items, (c) 5M items.


Exercise 5.25 (****)

The Johnson-Lindenstrauss lemma states that $n$ points in $\mathbb{R}^d$ can be projected to $\mathbb{R}^k$ with $k = O(\varepsilon^{-2} \log n)$, preserving all pairwise distances to within a $(1 \pm \varepsilon)$ factor.

(a) Prove the lemma for the case of random Gaussian projections (Hint: use sub-Gaussian concentration).

(b) Implement JL projection and verify empirically: for 10,000 points in $\mathbb{R}^{1000}$, find the minimum $k$ such that all pairwise distances are preserved within 10%.

(c) Apply JL projection as a preprocessing step for k-NN classification on a real dataset (e.g., MNIST). Plot classification accuracy vs. projected dimension $k$.

(d) Connect JL to the curse of dimensionality: why does k-NN degrade in high dimensions, and how does JL projection help (or not help)?


Exercise 5.26 (****)

The communication complexity of distributed SGD with $P$ workers and model size $d$ is $O(Pd)$ per iteration (AllReduce). Gradient compression techniques reduce this.

(a) Implement top-k sparsification: each worker sends only the $k$ largest gradient components (with $k = 0.001d$). What is the new communication cost?

(b) Implement error feedback (Stich et al., 2018): accumulate the unsent gradient components and add them to the next iteration's gradient. Prove that this converges (in expectation) at the same rate as uncompressed SGD.

(c) Benchmark distributed training of a simple MLP with $d = 10^6$ parameters, comparing uncompressed AllReduce, top-k sparsification (with and without error feedback), and random sparsification. Plot convergence (loss vs. wall-clock time).

(d) At what ratio of computation to communication bandwidth does gradient compression become beneficial? Derive the condition analytically and verify empirically.


Exercise 5.27 (****)

Optimal transport between two discrete distributions of $n$ points each requires solving a linear program with $O(n^2)$ variables. The Sinkhorn algorithm approximates the solution using $O(n^2 / \varepsilon)$ operations, where $\varepsilon$ is the entropic regularization parameter.

(a) Implement the Sinkhorn algorithm for computing the regularized optimal transport distance between two point clouds.

(b) Apply it to measuring distribution shift: compare the embedding distributions of StreamRec users in January vs. February. Use this to detect concept drift.

(c) The sliced Wasserstein distance approximates the Wasserstein distance using $O(Ln \log n)$ operations, where $L$ is the number of random projections. Implement it and compare accuracy with Sinkhorn at varying $n$.

(d) For monitoring concept drift in a production system processing 100K samples per batch, which approach (Sinkhorn vs. sliced Wasserstein) is feasible? Justify with complexity analysis.