Chapter 5: Quiz

Test your understanding of computational complexity and scalability concepts. Answers follow each question.


Question 1

What is the time complexity of computing the pairwise distance matrix for $n$ points in $\mathbb{R}^d$?

Answer $O(n^2 d)$. There are $\binom{n}{2} = O(n^2)$ pairs, and each distance computation requires $O(d)$ operations (subtraction, squaring, and summing over $d$ dimensions).

Question 2

True or False: An $O(n^2)$ algorithm is always faster than an $O(n^3)$ algorithm for the same input.

Answer **False.** Big-O describes asymptotic behavior — the growth rate as $n \to \infty$. For small $n$, the constant factors matter. An $O(n^3)$ algorithm with a small constant (e.g., $0.001 n^3$) can be faster than an $O(n^2)$ algorithm with a large constant (e.g., $1000 n^2$) for $n < 1000$. The asymptotic advantage of $O(n^2)$ only dominates at sufficiently large $n$.

Question 3

k-means clustering is NP-hard in general. What does Lloyd's algorithm actually guarantee?

Answer Lloyd's algorithm guarantees convergence to a **local minimum** of the k-means objective (sum of squared distances from each point to its assigned centroid). It does not guarantee the global minimum. The quality of the solution depends on initialization, which is why practical implementations use multiple random restarts (k-means++) and return the best result. The NP-hardness means that no polynomial-time algorithm can guarantee the global optimum in all cases.

Question 4

A Count-Min Sketch with width $w = 1000$ and depth $d = 5$ processes a stream of $n = 10^7$ events. What is the additive error bound, and with what probability?

Answer The additive error bound is $\varepsilon n$ where $\varepsilon = e / w = e / 1000 \approx 0.00272$. So the error is at most $\varepsilon n \approx 27{,}183$ events. This holds with probability $\geq 1 - e^{-d} = 1 - e^{-5} \approx 0.9933$. The sketch guarantees $\hat{c}(x) \leq c(x) + 27{,}183$ for any item $x$ with probability at least 99.3%.

Question 5

The transformer self-attention mechanism has complexity $O(n^2 d)$. For a model with $d = 4096$, at approximately what sequence length does the $O(n^2 d)$ attention term equal the $O(nd^2)$ linear projection term?

Answer The attention term is $\approx 4n^2 d$ FLOPs and the linear projection term is $\approx 8nd^2$ FLOPs (QKV + output projection). Setting $4n^2 d = 8nd^2$ and solving: $n = 2d = 8{,}192$. At $n \approx 8{,}192$, the two terms are roughly equal. Below this, linear projections dominate; above this, attention dominates. (The exact crossover depends on the precise FLOP accounting, but $n \approx 2d$ is the correct order.)

Question 6

Explain the difference between data parallelism and model parallelism in one sentence each.

Answer **Data parallelism:** The model is replicated on each device, and the training data is partitioned across devices so each device processes a different mini-batch and gradients are aggregated. **Model parallelism:** The model is partitioned across devices so each device holds a different subset of the model's parameters, and data flows sequentially through the devices.

Question 7

If 10% of a training pipeline is serial (non-parallelizable), what is the maximum speedup achievable with 64 GPUs according to Amdahl's Law?

Answer $$\text{Speedup} = \frac{1}{0.10 + \frac{0.90}{64}} = \frac{1}{0.10 + 0.01406} = \frac{1}{0.11406} \approx 8.77\times$$ The maximum speedup is approximately $8.8\times$, far less than the $64\times$ that the GPU count might suggest. The 10% serial fraction limits the achievable speedup to at most $1/0.10 = 10\times$, regardless of the number of GPUs.

Question 8

What is the KV-cache, and why does it reduce autoregressive generation complexity from $O(T^3 d)$ to $O(T^2 d)$?

Answer The **KV-cache** stores the key and value projections for all previously generated tokens, avoiding redundant recomputation. Without the cache, generating token $t$ requires recomputing attention over all $t$ previous tokens, including recomputing their Q, K, V projections — resulting in $O(t^2 d)$ work per token and $O(T^3 d)$ total for $T$ tokens. With the cache, only the new token's query is computed; the keys and values for previous tokens are retrieved from the cache. This reduces per-token attention to $O(td)$, and total generation cost to $\sum_{t=1}^{T} O(td) = O(T^2 d)$.

Question 9

Name three strategies for making nearest neighbor search sub-linear in the number of data points. For each, state whether it is exact or approximate.

Answer 1. **KD-tree** — Exact (in low dimensions); partitions space with axis-aligned hyperplanes. Degrades to $O(n)$ in high dimensions. 2. **Locality-Sensitive Hashing (LSH)** — Approximate; hashes similar points to the same bucket with high probability. Query time is $O(n^\rho)$ where $\rho < 1$. 3. **HNSW (Hierarchical Navigable Small World)** — Approximate; builds a multi-layer proximity graph and uses greedy search. Query time is $O(\log n \cdot d)$ empirically. Other valid answers: IVF (approximate), ball trees (exact in low dimensions), random projection trees (approximate).

Question 10

A matrix has dimensions $100{,}000 \times 50{,}000$ with 0.1% density. What is the approximate speedup of sparse matrix-vector product over dense matrix-vector product?

Answer Dense matrix-vector product requires $2 \times 100{,}000 \times 50{,}000 = 10^{10}$ FLOPs. Sparse matrix-vector product requires $2 \times \text{nnz}$ FLOPs, where $\text{nnz} = 0.001 \times 100{,}000 \times 50{,}000 = 5{,}000{,}000$. So sparse requires $10^7$ FLOPs. The speedup is $10^{10} / 10^7 = 1000\times$. In practice, the speedup is somewhat less (100-300x) due to irregular memory access patterns in sparse operations, but the order of magnitude is correct.

Question 11

What does the roofline model's "ridge point" represent?

Answer The ridge point is the arithmetic intensity (FLOPs per byte) at which the computation transitions from **memory-bound** to **compute-bound**. It equals peak compute (FLOPs/s) divided by peak memory bandwidth (bytes/s). Below the ridge point, performance is limited by how fast data can be moved to the processor. Above it, performance is limited by the processor's arithmetic throughput. For an A100 GPU (312 TFLOP/s fp16, 2039 GB/s), the ridge point is approximately 153 FLOPs/byte.

Question 12

A recommendation system must serve 10,000 requests per second with p99 latency under 100ms. The item catalog contains 10M items with 256-dimensional embeddings. Can brute-force nearest neighbor search meet this requirement? Explain.

Answer **No.** Brute-force search requires $2 \times 10^7 \times 256 = 5.12 \times 10^9$ FLOPs per query. At 100 GFLOP/s (single-thread CPU), this takes $\approx 51$ms per query — already at the budget limit for p50, and p99 will be substantially higher. At 10,000 QPS, the total compute demand is $5.12 \times 10^{13}$ FLOP/s, requiring multiple high-end GPUs just for retrieval. ANN methods (HNSW, IVF-PQ) reduce per-query compute to $O(\log n \cdot d)$ or $O(\text{nprobe} \cdot n/K \cdot d)$, achieving <1ms latency, which is essential at this scale.

Question 13

True or False: Amortized $O(1)$ means every individual operation takes $O(1)$ time.

Answer **False.** Amortized $O(1)$ means the *average* cost per operation over a sequence of operations is $O(1)$. Individual operations may take much longer. For example, appending to a Python list is amortized $O(1)$, but occasional resizes cost $O(n)$. The guarantee is that the *total* cost of $m$ operations is $O(m)$.

Question 14

Feature subset selection over $d$ features is NP-hard because there are $2^d$ possible subsets. What practical alternatives do ML practitioners use?

Answer Practical alternatives include: 1. **Greedy forward selection** — Start with no features; iteratively add the feature that improves the objective most. $O(d^2)$ model evaluations. 2. **Greedy backward elimination** — Start with all features; iteratively remove the least important. $O(d^2)$ model evaluations. 3. **L1 regularization (Lasso)** — Adds an $\ell_1$ penalty that drives some coefficients to exactly zero, effectively selecting features during training. Convex optimization, polynomial time. 4. **Mutual information ranking** — Rank features by MI with the target (Chapter 4) and select the top $k$. $O(nd)$ for $d$ features. 5. **Tree-based importance** — Train a random forest or gradient boosted model and use feature importance scores. $O(nkd \log n)$. None of these guarantee the optimal subset, but they run in polynomial time and work well in practice.

Question 15

A streaming algorithm processes $n$ elements using $O(\log n)$ space. A batch algorithm for the same problem uses $O(n)$ space but produces exact results. When is the streaming algorithm preferred?

Answer The streaming algorithm is preferred when: 1. **Data does not fit in memory** — $n$ is so large that $O(n)$ space is infeasible (e.g., billions of events per day). 2. **Data arrives continuously** — There is no "end" to the dataset; it is an unbounded stream. 3. **Single-pass requirement** — Data can only be read once (e.g., network packets, sensor readings). 4. **Approximate answers suffice** — The application can tolerate bounded error (e.g., approximate distinct counts, approximate frequency estimation). 5. **Real-time results needed** — The batch algorithm would require storing all data before processing, introducing unacceptable delay. The batch algorithm is preferred when exact results are required and the data fits comfortably in memory.

Question 16

The arithmetic intensity of an element-wise operation (e.g., ReLU) on a vector of $n$ float32 values is approximately $1/8$ FLOPs/byte. Is this operation compute-bound or memory-bound on a modern GPU?

Answer **Memory-bound.** The ridge point for a modern GPU (e.g., A100) is approximately 153 FLOPs/byte. An arithmetic intensity of $1/8 = 0.125$ FLOPs/byte is far below the ridge point. The GPU's arithmetic units are idle more than 99% of the time, waiting for data to arrive from memory. This is why element-wise operations benefit enormously from kernel fusion (combining multiple element-wise ops into a single memory pass) but not from faster arithmetic hardware.

Question 17

Explain why the product quantization compression ratio of $4d/m$ is independent of the codebook size $K$.

Answer In product quantization, each sub-vector is replaced by a $\lceil \log_2 K \rceil$-bit code (the index of its nearest centroid). With $K = 256$, each code is exactly 1 byte ($\lceil \log_2 256 \rceil = 8$ bits). Since there are $m$ sub-vectors, the compressed representation is $m$ bytes. The original vector is $d \times 4$ bytes (float32). The compression ratio is $4d / m$, which depends on $d$ and $m$ but not on $K$ — because the standard choice of $K = 256$ always uses exactly 1 byte per sub-code. If $K$ were different (e.g., $K = 65536 = 2^{16}$), each code would be 2 bytes, and the ratio would change to $4d / (2m) = 2d/m$.

Question 18

A team wants to train a model on 100M data points. The training algorithm has complexity $O(n^2 d)$. Using 1000 GFLOP/s of compute and $d = 100$, estimate the training time. Is this feasible?

Answer FLOPs $= 2 \times (10^8)^2 \times 100 = 2 \times 10^{18}$. At $10^{12}$ FLOPs/s (1000 GFLOP/s), the time is $2 \times 10^{18} / 10^{12} = 2 \times 10^6$ seconds $\approx 23$ days. This is likely infeasible for most practical settings — 23 days of continuous computation on a high-end machine, with no room for iteration, hyperparameter tuning, or debugging. The team should either find an $O(n \log n)$ or $O(n)$ algorithm for the task, subsample the data, or use an approximation that reduces the dependence on $n^2$.

Question 19

What is an "embarrassingly parallel" workload? Give two examples from machine learning.

Answer An **embarrassingly parallel** workload is one where the work can be divided into independent sub-tasks with no communication or synchronization between them. The speedup scales linearly with the number of processors (serial fraction $\approx 0$). Examples: 1. **Batch inference** — Scoring predictions for a large set of inputs. Each input is processed independently; there are no dependencies between predictions. 2. **Hyperparameter search** — Evaluating different hyperparameter configurations. Each trial trains an independent model; no information flows between trials (in grid search or random search; Bayesian optimization is not embarrassingly parallel). Other valid examples: cross-validation folds, data augmentation, random forest tree construction (partially), ensemble model inference.

Question 20

You have two ANN indices for the same dataset. Index A has recall@10 = 0.95 and median query latency = 2.1ms. Index B has recall@10 = 0.82 and median query latency = 0.3ms. How would you choose between them for a production recommendation system?

Answer The choice depends on the system constraints and the cost of recall errors: **Choose Index A (high recall, higher latency)** if: - The latency budget accommodates 2.1ms for retrieval (e.g., total budget is 100ms) - Missing the true top-10 items significantly impacts recommendation quality or revenue - The system is not extremely latency-sensitive (e.g., not real-time bidding) **Choose Index B (lower recall, lower latency)** if: - The latency budget is tight (e.g., 5ms total for retrieval) - The ranker downstream can compensate for recall gaps (if you retrieve 500 candidates and only need 20 final recommendations, 82% recall at the retrieval stage may be sufficient) - The system serves very high QPS where every millisecond matters for infrastructure cost In practice, most recommendation systems choose a configuration along the recall-latency Pareto frontier that meets the latency SLA. The recall gap often has less impact than expected because: (1) the ranker re-scores candidates, and (2) the "missing" items are typically similar to retrieved items.