Answers to Selected Exercises

Advanced Data Science: Deep Learning, Causal Inference, and Production Systems at Scale

This appendix provides worked solutions for selected exercises from each chapter. Problems are chosen to illustrate key concepts and common pitfalls. Two- and three-star problems are prioritized to support deeper understanding.


Part I: Mathematical and Computational Foundations

Chapter 1: Linear Algebra for Machine Learning

Exercise 1.4 (★★) Show that the rank of a matrix equals the number of non-zero singular values. Use this to determine the effective rank of the StreamRec user-item matrix when singular values below 0.01 are treated as zero.

Solution:

The SVD of an m × n matrix A is A = UΣV^T, where Σ contains singular values σ₁ ≥ σ₂ ≥ ... ≥ σ_min(m,n) ≥ 0. The rank equals the number of linearly independent columns, which equals the number of non-zero singular values because U and V are orthogonal (rank-preserving), and Σ has rank equal to its non-zero diagonal entries.

import numpy as np

# Simulate a sparse user-item matrix (1000 users, 500 items)
np.random.seed(42)
n_users, n_items, n_latent = 1000, 500, 25
U_true = np.random.randn(n_users, n_latent)
V_true = np.random.randn(n_items, n_latent)
R = U_true @ V_true.T + 0.1 * np.random.randn(n_users, n_items)

_, sigmas, _ = np.linalg.svd(R, full_matrices=False)

exact_rank = np.sum(sigmas > 0)
effective_rank = np.sum(sigmas > 0.01 * sigmas[0])  # relative threshold

print(f"Exact rank: {exact_rank}")       # 500 (full rank due to noise)
print(f"Effective rank: {effective_rank}")  # ~25 (the true latent dimension)

# Energy captured by top-k singular values
energy = np.cumsum(sigmas**2) / np.sum(sigmas**2)
k_90 = np.searchsorted(energy, 0.90) + 1
print(f"Singular values for 90% energy: {k_90}")  # ~25

The effective rank reveals the true latent dimensionality (25), which is the number of singular values capturing meaningful signal rather than noise.


Exercise 1.7 (★★★) Derive the gradient of the matrix factorization loss L = ||R - UV^T||²_F with respect to U and V. Show that the gradient with respect to U is -2(R - UV^T)V.

Solution:

Let E = R - UV^T. The Frobenius norm squared is:

L = ||E||²_F = tr(E^T E) = tr((R - UV^T)^T (R - UV^T))

Expanding:

L = tr(R^T R) - 2tr(R^T UV^T) + tr(VU^T UV^T)

Taking the derivative with respect to U using matrix calculus rules (∂tr(AXB)/∂X = A^T B^T):

∂L/∂U = -2RV + 2UV^T V = -2(R - UV^T)V = -2EV

Similarly:

∂L/∂V = -2R^T U + 2VU^T U = -2(R - UV^T)^T U = -2E^T U

# Verification via numerical gradient
def mf_loss(U, V, R):
    return np.sum((R - U @ V.T)**2)

U = np.random.randn(100, 10)
V = np.random.randn(50, 10)
R = np.random.randn(100, 50)

# Analytical gradient
E = R - U @ V.T
grad_U_analytical = -2 * E @ V

# Numerical gradient (finite differences)
eps = 1e-5
grad_U_numerical = np.zeros_like(U)
for i in range(U.shape[0]):
    for j in range(U.shape[1]):
        U[i, j] += eps
        loss_plus = mf_loss(U, V, R)
        U[i, j] -= 2 * eps
        loss_minus = mf_loss(U, V, R)
        U[i, j] += eps
        grad_U_numerical[i, j] = (loss_plus - loss_minus) / (2 * eps)

print(f"Max difference: {np.max(np.abs(grad_U_analytical - grad_U_numerical)):.2e}")
# Output: ~1e-6 (numerical precision)

Exercise 1.9 (★★) Explain why PCA is equivalent to projecting data onto the top-k eigenvectors of the covariance matrix, and also equivalent to the best rank-k approximation via SVD (Eckart-Young theorem).

Solution:

Let X be the n × d centered data matrix. The covariance matrix is C = (1/n) X^T X. PCA finds directions of maximum variance, i.e., eigenvectors of C.

The SVD of X gives X = UΣV^T. Then X^T X = VΣ²V^T, so the eigenvectors of X^T X (and C) are the right singular vectors V. Projecting onto the top-k columns of V is identical to the rank-k truncated SVD approximation X_k = U_k Σ_k V_k^T, which by the Eckart-Young theorem minimizes ||X - X_k||_F over all rank-k matrices.

Thus PCA, eigendecomposition of the covariance, and truncated SVD are three views of the same operation.


Chapter 2: Multivariate Calculus and Optimization

Exercise 2.3 (★★) Implement SGD, SGD with momentum, and Adam from scratch for minimizing f(x, y) = x² + 10y². Plot the optimization trajectories and explain why Adam converges faster on this ill-conditioned problem.

Solution:

The condition number of the Hessian is 10/1 = 10, making this moderately ill-conditioned. The gradient is ∇f = [2x, 20y].

import numpy as np
import matplotlib.pyplot as plt

def grad_f(params):
    x, y = params
    return np.array([2*x, 20*y])

def sgd(x0, lr=0.05, steps=100):
    path = [x0.copy()]
    x = x0.copy()
    for _ in range(steps):
        x -= lr * grad_f(x)
        path.append(x.copy())
    return np.array(path)

def sgd_momentum(x0, lr=0.05, beta=0.9, steps=100):
    path = [x0.copy()]
    x, v = x0.copy(), np.zeros(2)
    for _ in range(steps):
        v = beta * v + grad_f(x)
        x -= lr * v
        path.append(x.copy())
    return np.array(path)

def adam(x0, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    path = [x0.copy()]
    x = x0.copy()
    m, v = np.zeros(2), np.zeros(2)
    for t in range(1, steps + 1):
        g = grad_f(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x -= lr * m_hat / (np.sqrt(v_hat) + eps)
        path.append(x.copy())
    return np.array(path)

x0 = np.array([5.0, 5.0])
path_sgd = sgd(x0)
path_mom = sgd_momentum(x0)
path_adam = adam(x0)

# Adam converges fastest because its per-parameter adaptive learning rate
# effectively rescales the gradient: the y-direction (large gradient, large
# second moment) gets a smaller effective learning rate, while the
# x-direction (small gradient) gets a larger effective rate. This
# counteracts the ill-conditioning.

Adam's key advantage: the v_hat term (running average of squared gradients) normalizes each parameter's update by its gradient magnitude. For the y-coordinate (gradient = 20y), v_hat is large, shrinking the step. For x (gradient = 2x), v_hat is small, preserving a larger step. This per-parameter adaptation effectively preconditions the optimization, approximating second-order information.


Exercise 2.6 (★★★) Derive the backpropagation equations for a two-layer network with ReLU activation. Draw the computational graph and trace the gradient flow.

Solution:

Network: z₁ = W₁x + b₁, a₁ = ReLU(z₁), z₂ = W₂a₁ + b₂, ŷ = σ(z₂), L = -[y log ŷ + (1-y) log(1-ŷ)]

Backward pass using chain rule:

  1. ∂L/∂z₂ = ŷ - y (standard result for sigmoid + cross-entropy)
  2. ∂L/∂W₂ = (ŷ - y) a₁^T
  3. ∂L/∂b₂ = ŷ - y
  4. ∂L/∂a₁ = W₂^T (ŷ - y)
  5. ∂L/∂z₁ = W₂^T (ŷ - y) ⊙ 𝟙(z₁ > 0) [ReLU derivative is indicator]
  6. ∂L/∂W₁ = [W₂^T (ŷ - y) ⊙ 𝟙(z₁ > 0)] x^T
  7. ∂L/∂b₁ = W₂^T (ŷ - y) ⊙ 𝟙(z₁ > 0)

The ReLU derivative ∂ReLU/∂z = 𝟙(z > 0) acts as a "gate" — gradients flow through neurons with positive pre-activation and are blocked by neurons with negative pre-activation. This is the mechanism behind dead neurons: if z₁ is always negative for some neuron, the gradient is always zero and the neuron never updates.


Exercise 2.8 (★★) Explain why saddle points are more problematic than local minima in high-dimensional optimization, and describe how SGD noise helps escape them.

Solution:

At a critical point, the Hessian determines the local geometry. For a d-dimensional loss function, the probability that a random critical point is a local minimum (all d eigenvalues positive) is approximately (1/2)^d, which is exponentially small. Most critical points in high dimensions are saddle points (mix of positive and negative eigenvalues).

SGD's minibatch noise provides random perturbations that push the parameters along negative-curvature directions, enabling escape from saddle points. The noise magnitude is proportional to the learning rate and inversely proportional to the batch size. This is why smaller batch sizes can improve generalization — they maintain enough noise to escape sharp saddle points and converge to flatter regions.


Chapter 3: Probability Theory and Statistical Inference

Exercise 3.2 (★★) Derive the MLE for the Bernoulli parameter p from first principles. Show that maximizing the log-likelihood is equivalent to minimizing binary cross-entropy loss.

Solution:

Given n observations x₁, ..., xₙ where xᵢ ∈ {0, 1}:

Likelihood: L(p) = ∏ᵢ p^xᵢ (1-p)^(1-xᵢ)

Log-likelihood: ℓ(p) = Σᵢ [xᵢ log p + (1-xᵢ) log(1-p)]

Setting ∂ℓ/∂p = 0: Σxᵢ/p - Σ(1-xᵢ)/(1-p) = 0, giving p̂_MLE = (1/n) Σxᵢ = x̄

The binary cross-entropy loss is: BCE = -(1/n) Σ [xᵢ log p + (1-xᵢ) log(1-p)] = -ℓ(p)/n

Since minimizing -ℓ(p)/n is equivalent to maximizing ℓ(p), minimizing binary cross-entropy IS maximum likelihood estimation.

import numpy as np
from scipy.optimize import minimize_scalar

np.random.seed(42)
data = np.random.binomial(1, 0.7, size=1000)

# Analytical MLE
p_mle = data.mean()  # 0.706

# Numerical optimization of negative log-likelihood
nll = lambda p: -np.sum(data * np.log(p) + (1 - data) * np.log(1 - p))
result = minimize_scalar(nll, bounds=(0.001, 0.999), method='bounded')

print(f"Analytical MLE: {p_mle:.4f}")
print(f"Numerical MLE:  {result.x:.4f}")
# Both yield ~0.706

Exercise 3.5 (★★★) Implement bootstrap confidence intervals for the median engagement time in StreamRec data. Compare percentile, BCa, and studentized bootstrap intervals.

Solution:

import numpy as np

def bootstrap_ci_percentile(data, stat_fn, n_boot=10000, alpha=0.05):
    """Percentile bootstrap confidence interval."""
    boot_stats = np.array([
        stat_fn(np.random.choice(data, size=len(data), replace=True))
        for _ in range(n_boot)
    ])
    return np.percentile(boot_stats, [100*alpha/2, 100*(1-alpha/2)])

def bootstrap_ci_bca(data, stat_fn, n_boot=10000, alpha=0.05):
    """BCa (bias-corrected and accelerated) bootstrap CI."""
    from scipy.stats import norm

    theta_hat = stat_fn(data)
    boot_stats = np.array([
        stat_fn(np.random.choice(data, size=len(data), replace=True))
        for _ in range(n_boot)
    ])

    # Bias correction
    z0 = norm.ppf(np.mean(boot_stats < theta_hat))

    # Acceleration (jackknife)
    n = len(data)
    jack_stats = np.array([
        stat_fn(np.delete(data, i)) for i in range(n)
    ])
    jack_mean = jack_stats.mean()
    acc = np.sum((jack_mean - jack_stats)**3) / (
        6 * np.sum((jack_mean - jack_stats)**2)**1.5
    )

    # Adjusted quantiles
    z_alpha = norm.ppf(alpha / 2)
    z_1alpha = norm.ppf(1 - alpha / 2)
    a1 = norm.cdf(z0 + (z0 + z_alpha) / (1 - acc * (z0 + z_alpha)))
    a2 = norm.cdf(z0 + (z0 + z_1alpha) / (1 - acc * (z0 + z_1alpha)))

    return np.percentile(boot_stats, [100*a1, 100*a2])

# Simulate right-skewed engagement time data (log-normal)
np.random.seed(42)
engagement_times = np.random.lognormal(mean=3.0, sigma=1.2, size=500)

ci_pct = bootstrap_ci_percentile(engagement_times, np.median)
ci_bca = bootstrap_ci_bca(engagement_times, np.median)

print(f"Percentile CI: [{ci_pct[0]:.2f}, {ci_pct[1]:.2f}]")
print(f"BCa CI:        [{ci_bca[0]:.2f}, {ci_bca[1]:.2f}]")
# BCa corrects for bias and skewness in the bootstrap distribution,
# producing more accurate intervals for skewed statistics like the median.

The BCa method is preferred because it corrects for two sources of inaccuracy: bias (the bootstrap distribution may not be centered on the true parameter) and skewness (the bootstrap distribution may be asymmetric).


Chapter 4: Information Theory for Data Science

Exercise 4.3 (★★) Compute the mutual information between two features and engagement for the StreamRec dataset. Show a case where MI and Pearson correlation disagree.

Solution:

import numpy as np
from sklearn.metrics import mutual_info_score

np.random.seed(42)
n = 10000

# Feature 1: Linear relationship with engagement
user_age = np.random.normal(35, 10, n)
engagement_linear = 0.5 * user_age + np.random.normal(0, 5, n)

# Feature 2: Nonlinear (quadratic) relationship
user_activity_hour = np.random.uniform(0, 24, n)
# Engagement peaks at noon and midnight (U-shaped)
engagement_nonlinear = 10 * np.cos(2 * np.pi * user_activity_hour / 24)**2 + \
                       np.random.normal(0, 2, n)

# Pearson correlation
corr_linear = np.corrcoef(user_age, engagement_linear)[0, 1]
corr_nonlinear = np.corrcoef(user_activity_hour, engagement_nonlinear)[0, 1]

# Mutual information (discretize for MI computation)
def estimate_mi(x, y, bins=30):
    """Estimate MI by discretization."""
    x_binned = np.digitize(x, np.linspace(x.min(), x.max(), bins))
    y_binned = np.digitize(y, np.linspace(y.min(), y.max(), bins))
    return mutual_info_score(x_binned, y_binned)

mi_linear = estimate_mi(user_age, engagement_linear)
mi_nonlinear = estimate_mi(user_activity_hour, engagement_nonlinear)

print(f"Linear:    Corr={corr_linear:.3f}, MI={mi_linear:.3f}")
print(f"Nonlinear: Corr={corr_nonlinear:.3f}, MI={mi_nonlinear:.3f}")
# Linear:    Corr=0.707, MI=0.62
# Nonlinear: Corr≈0.00,  MI=0.45

Activity hour has near-zero correlation with engagement (the relationship is symmetric around the mean, so positive and negative deviations cancel). But mutual information correctly detects the strong nonlinear dependency. This is why MI-based feature selection can find relationships that correlation-based methods miss.


Chapter 5: Computational Complexity and Scalability

Exercise 5.2 (★★) Profile the naive brute-force nearest neighbor search for StreamRec candidate retrieval. Then implement FAISS IVF and measure the speedup at 95% recall.

Solution:

import numpy as np
import time
# import faiss  # In practice

np.random.seed(42)
n_items = 200_000
d = 128
n_queries = 1000
k = 100

item_embeddings = np.random.randn(n_items, d).astype('float32')
query_embeddings = np.random.randn(n_queries, d).astype('float32')

# Brute force: O(n * d) per query
start = time.time()
for q in query_embeddings[:10]:  # Time 10 queries
    distances = np.sum((item_embeddings - q)**2, axis=1)
    top_k = np.argpartition(distances, k)[:k]
brute_time = (time.time() - start) / 10

print(f"Brute force: {brute_time*1000:.1f}ms per query")
# ~50-100ms per query (too slow for 100ms total latency budget)

# With FAISS IVF (conceptual — actual FAISS code):
# index = faiss.IndexIVFFlat(faiss.IndexFlatL2(d), d, n_clusters=1024)
# index.train(item_embeddings)
# index.add(item_embeddings)
# index.nprobe = 32  # Search 32 of 1024 clusters
# D, I = index.search(query_embeddings, k)
# ~2-5ms per query at >95% recall

# The speedup comes from partitioning the vector space into clusters
# (Voronoi cells) and only searching nearby clusters. With 1024 clusters
# and nprobe=32, we search only 3% of the items: O(n * nprobe / n_clusters * d).

Exercise 5.5 (★★★) Use the roofline model to determine whether a transformer self-attention operation is compute-bound or memory-bound on an A100 GPU. Calculate the arithmetic intensity.

Solution:

For self-attention with sequence length n=2048, dimension d=1024, on A100 (312 TFLOPS fp16, 2TB/s HBM bandwidth):

The QK^T matmul: 2 * n * n * d = 2 * 2048 * 2048 * 1024 ≈ 8.59 × 10⁹ FLOPs.

Memory: Read Q (n×d = 8MB), K (n×d = 8MB), write QK^T (n×n = 16MB) = 32MB total at fp16.

Arithmetic intensity = 8.59 × 10⁹ / 32 × 10⁶ ≈ 268 FLOPs/byte.

The A100's ridge point = 312 × 10¹² / 2 × 10¹² = 156 FLOPs/byte.

Since 268 > 156, this operation is compute-bound at this sequence length. However, at shorter sequences (n=512), arithmetic intensity drops to ~67 FLOPs/byte, which is memory-bound. This is why FlashAttention (which reduces HBM reads/writes) matters more for shorter sequences.


Part II: Deep Learning

Chapter 6: Neural Networks from Scratch

Exercise 6.3 (★★★) Implement a complete two-layer MLP in numpy with forward pass, backward pass, and gradient checking. Train it on the StreamRec click prediction task.

Solution:

import numpy as np

class NumpyMLP:
    def __init__(self, d_in, d_hidden, d_out):
        # He initialization
        self.W1 = np.random.randn(d_in, d_hidden) * np.sqrt(2.0 / d_in)
        self.b1 = np.zeros(d_hidden)
        self.W2 = np.random.randn(d_hidden, d_out) * np.sqrt(2.0 / d_hidden)
        self.b2 = np.zeros(d_out)

    def forward(self, X):
        self.X = X
        self.z1 = X @ self.W1 + self.b1
        self.a1 = np.maximum(0, self.z1)  # ReLU
        self.z2 = self.a1 @ self.W2 + self.b2
        self.y_hat = 1 / (1 + np.exp(-self.z2))  # Sigmoid
        return self.y_hat

    def backward(self, y):
        m = y.shape[0]
        # Output layer
        dz2 = (self.y_hat - y.reshape(-1, 1)) / m
        self.dW2 = self.a1.T @ dz2
        self.db2 = np.sum(dz2, axis=0)
        # Hidden layer
        da1 = dz2 @ self.W2.T
        dz1 = da1 * (self.z1 > 0).astype(float)  # ReLU gradient
        self.dW1 = self.X.T @ dz1
        self.db1 = np.sum(dz1, axis=0)

    def gradient_check(self, X, y, eps=1e-5):
        """Verify analytical gradients against numerical gradients."""
        self.forward(X)
        self.backward(y)
        for name, param, grad in [
            ('W1', self.W1, self.dW1), ('W2', self.W2, self.dW2)
        ]:
            idx = tuple(np.random.randint(0, s) for s in param.shape)
            param[idx] += eps
            loss_plus = -np.mean(
                y.reshape(-1,1)*np.log(self.forward(X)+1e-8) +
                (1-y.reshape(-1,1))*np.log(1-self.forward(X)+1e-8)
            )
            param[idx] -= 2 * eps
            loss_minus = -np.mean(
                y.reshape(-1,1)*np.log(self.forward(X)+1e-8) +
                (1-y.reshape(-1,1))*np.log(1-self.forward(X)+1e-8)
            )
            param[idx] += eps
            numerical = (loss_plus - loss_minus) / (2 * eps)
            analytical = grad[idx]
            rel_error = abs(numerical - analytical) / (abs(numerical) + abs(analytical) + 1e-8)
            print(f"{name}{list(idx)}: analytical={analytical:.6f}, "
                  f"numerical={numerical:.6f}, rel_error={rel_error:.2e}")

Gradient checking should show relative errors below 1e-5 for all parameters. Errors above 1e-3 indicate a bug in the backward pass.


Chapter 7: Training Deep Networks

Exercise 7.4 (★★) Compare training dynamics with and without batch normalization. Plot loss curves, gradient norms per layer, and activation distributions at initialization and after 1000 steps.

Solution:

The key observations:

  1. Loss curves: Without BN, training is slower and more sensitive to learning rate. With BN, convergence is faster and more stable across learning rates.
  2. Gradient norms: Without BN, gradient norms vary widely across layers (deeper layers may have much smaller gradients). With BN, gradient norms are more uniform.
  3. Activation distributions: Without BN, activations shift and scale unpredictably during training. With BN, activations maintain approximately zero mean and unit variance throughout training.
import torch
import torch.nn as nn

class MLPWithBN(nn.Module):
    def __init__(self, d_in, d_hidden, n_layers):
        super().__init__()
        layers = []
        for i in range(n_layers):
            layers.append(nn.Linear(d_in if i == 0 else d_hidden, d_hidden))
            layers.append(nn.BatchNorm1d(d_hidden))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(d_hidden, 1))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

# Training loop comparison (pseudocode for space):
# model_no_bn = MLP(d_in=64, d_hidden=256, n_layers=8)
# model_bn = MLPWithBN(d_in=64, d_hidden=256, n_layers=8)
# Train both with identical data, LR=0.01, log gradient norms per layer.
# BN model converges ~2x faster and tolerates LR up to 0.1.

The modern understanding is that BN's benefit comes less from reducing "internal covariate shift" (the original hypothesis) and more from smoothing the loss landscape, making the optimization problem better conditioned.


Chapter 8: Convolutional Neural Networks

Exercise 8.3 (★★) Calculate the number of parameters and receptive field size for a VGG-style network with 3 layers of 3×3 convolutions versus a single 7×7 convolution. Explain why stacked small kernels are preferred.

Solution:

For a single channel with C filters:

  • Single 7×7 conv: Parameters = 7 × 7 × C × C = 49C². Receptive field = 7×7.
  • Three 3×3 convs: Parameters = 3 × (3 × 3 × C × C) = 27C². Receptive field = 7×7 (each layer adds 2 pixels per dimension).

Three stacked 3×3 convolutions have the same receptive field but 45% fewer parameters (27C² vs 49C²). Additionally, three layers with ReLU activations between them introduce three nonlinearities instead of one, making the function class richer. This insight from VGGNet (Simonyan & Zisserman, 2014) remains a design principle in modern architectures.


Chapter 9: Recurrent Networks and Sequence Modeling

Exercise 9.2 (★★★) Derive why the vanilla RNN suffers from vanishing gradients by computing the Jacobian product chain. Show that the LSTM cell state provides an additive gradient path that mitigates this.

Solution:

In a vanilla RNN, h_t = tanh(W_hh h_{t-1} + W_xh x_t). The gradient of the loss at time T with respect to h_t requires:

∂h_T/∂h_t = ∏{k=t}^{T-1} diag(1 - h²) W_hh

Each factor has spectral radius ≤ ||W_hh|| * max(1 - h²). If the product of these factors is < 1, repeated multiplication drives gradients to zero exponentially.

In an LSTM, the cell state update is c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t. The gradient path through the cell state is:

∂c_T/∂c_t = ∏{k=t}^{T-1} diag(f)

Since forget gate values f_k ∈ (0, 1), this product shrinks but can remain close to 1 when f_k ≈ 1 (the gate is "open"). Critically, this is an additive path (no squashing nonlinearity on c_t), analogous to the residual connection in ResNets. The gradient flows through an unimpeded highway.


Chapter 10: The Transformer Architecture

Exercise 10.5 (★★★) Implement scaled dot-product attention from scratch in PyTorch. Verify that your implementation matches torch.nn.functional.scaled_dot_product_attention.

Solution:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Args:
        Q: (batch, n_heads, seq_len_q, d_k)
        K: (batch, n_heads, seq_len_k, d_k)
        V: (batch, n_heads, seq_len_k, d_v)
        mask: (batch, 1, seq_len_q, seq_len_k) or broadcastable
    Returns:
        output: (batch, n_heads, seq_len_q, d_v)
        attention_weights: (batch, n_heads, seq_len_q, seq_len_k)
    """
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

# Verification
torch.manual_seed(42)
B, H, N, D = 2, 8, 64, 64
Q = torch.randn(B, H, N, D)
K = torch.randn(B, H, N, D)
V = torch.randn(B, H, N, D)

out_custom, _ = scaled_dot_product_attention(Q, K, V)
out_pytorch = F.scaled_dot_product_attention(Q, K, V)

print(f"Max diff: {(out_custom - out_pytorch).abs().max():.2e}")
# Output: ~1e-6 (floating point precision)

The scaling factor 1/√d_k is critical: without it, for large d_k the dot products grow in magnitude, pushing the softmax into regions with extremely small gradients.


Chapter 11: Large Language Models

Exercise 11.3 (★★) Build a simple RAG pipeline for StreamRec that answers natural language queries about content. Measure retrieval accuracy at different chunk sizes (256, 512, 1024 tokens).

Solution:

# Conceptual implementation (requires embedding model and LLM API)
from dataclasses import dataclass
from typing import List

@dataclass
class RAGPipeline:
    chunk_size: int
    overlap: int
    top_k: int = 5

    def chunk_documents(self, documents: List[str]) -> List[str]:
        """Split documents into overlapping chunks."""
        chunks = []
        for doc in documents:
            words = doc.split()
            for i in range(0, len(words), self.chunk_size - self.overlap):
                chunk = " ".join(words[i:i + self.chunk_size])
                if chunk:
                    chunks.append(chunk)
        return chunks

    def retrieve(self, query_embedding, chunk_embeddings, chunks):
        """Return top-k chunks by cosine similarity."""
        import numpy as np
        sims = query_embedding @ chunk_embeddings.T
        top_idx = np.argsort(sims)[-self.top_k:][::-1]
        return [chunks[i] for i in top_idx]

    def generate(self, query, context_chunks):
        """Format prompt with retrieved context."""
        context = "\n---\n".join(context_chunks)
        prompt = f"""Answer the question using only the provided context.

Context:
{context}

Question: {query}
Answer:"""
        return prompt

# Empirical finding: chunk_size=512 with overlap=64 typically gives the
# best retrieval recall. Too small (256) fragments context; too large
# (1024) dilutes the relevant passage with irrelevant text.

Chapter 12: Generative Models

Exercise 12.4 (★★★) Derive the ELBO from the marginal log-likelihood. Show that maximizing the ELBO minimizes KL(q(z|x) || p(z|x)).

Solution:

Start with the marginal log-likelihood:

log p(x) = log ∫ p(x, z) dz

Introduce an approximate posterior q(z|x):

log p(x) = log ∫ p(x, z) [q(z|x) / q(z|x)] dz = log E_q [p(x, z) / q(z|x)] ≥ E_q [log p(x, z) / q(z|x)] (Jensen's inequality) = E_q [log p(x|z)] - KL(q(z|x) || p(z)) = ELBO

The gap between log p(x) and the ELBO is exactly:

log p(x) - ELBO = KL(q(z|x) || p(z|x))

Since KL ≥ 0, the ELBO is indeed a lower bound. Maximizing the ELBO is equivalent to minimizing KL(q(z|x) || p(z|x)), i.e., making q approximate the true posterior.

For the VAE, q(z|x) = N(μ_φ(x), σ²_φ(x)) (encoder output), and the ELBO decomposes into:

ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z)) = Reconstruction Loss + KL Regularization

The KL term has a closed-form solution for two Gaussians.


Chapter 13: Transfer Learning and Foundation Models

Exercise 13.3 (★★) Implement a two-tower retrieval model with pretrained encoders and contrastive (InfoNCE) loss. Evaluate at different negative sampling ratios.

Solution:

import torch
import torch.nn as nn
import torch.nn.functional as F

class TwoTowerModel(nn.Module):
    def __init__(self, user_dim, item_dim, embed_dim=128):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Linear(user_dim, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )
        self.item_tower = nn.Sequential(
            nn.Linear(item_dim, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )
        self.temperature = nn.Parameter(torch.tensor(0.07))

    def forward(self, user_features, item_features):
        user_emb = F.normalize(self.user_tower(user_features), dim=-1)
        item_emb = F.normalize(self.item_tower(item_features), dim=-1)
        return user_emb, item_emb

    def info_nce_loss(self, user_emb, item_emb):
        """In-batch negatives InfoNCE loss."""
        logits = user_emb @ item_emb.T / self.temperature
        labels = torch.arange(logits.size(0), device=logits.device)
        return F.cross_entropy(logits, labels)

With in-batch negatives (batch size B), each positive pair has B-1 negatives. Larger batch sizes provide more negatives, improving representation quality up to a point (diminishing returns beyond ~4096).


Chapter 14: Graph Neural Networks

Exercise 14.2 (★★) Implement a GCN layer from the adjacency matrix formulation and verify against PyTorch Geometric's GCNConv.

Solution:

import torch
import torch.nn as nn

class GCNLayer(nn.Module):
    """GCN layer: H' = D̃^(-1/2) Ã D̃^(-1/2) H W"""
    def __init__(self, in_features, out_features):
        super().__init__()
        self.W = nn.Parameter(torch.randn(in_features, out_features) * 0.01)

    def forward(self, H, A):
        # Add self-loops: Ã = A + I
        A_hat = A + torch.eye(A.size(0), device=A.device)
        # Degree matrix: D̃
        D_hat = torch.diag(A_hat.sum(dim=1))
        # Symmetric normalization: D̃^(-1/2) Ã D̃^(-1/2)
        D_inv_sqrt = torch.diag(1.0 / torch.sqrt(A_hat.sum(dim=1)))
        A_norm = D_inv_sqrt @ A_hat @ D_inv_sqrt
        # Message passing + linear transform
        return A_norm @ H @ self.W

The normalized adjacency D̃^(-1/2) Ã D̃^(-1/2) ensures that aggregation is symmetric and does not change the scale of representations based on node degree.


Part III: Causal Inference

Chapter 15: Beyond Prediction

Exercise 15.2 (★★) Construct a Simpson's paradox example using StreamRec data. Show that the overall association between recommendation and engagement reverses within user segments.

Solution:

import numpy as np
import pandas as pd

np.random.seed(42)

# Heavy users: recommended more AND engage more (confounder)
heavy = pd.DataFrame({
    'segment': 'heavy', 'recommended': np.random.binomial(1, 0.8, 600),
    'base_engage': np.random.binomial(1, 0.7, 600)
})
heavy['engaged'] = np.where(
    heavy['recommended'] == 1,
    np.random.binomial(1, 0.65, 600),  # rec'd: 65%
    np.random.binomial(1, 0.70, 600)   # not rec'd: 70%
)

# Light users: recommended less AND engage less
light = pd.DataFrame({
    'segment': 'light', 'recommended': np.random.binomial(1, 0.2, 400),
    'base_engage': np.random.binomial(1, 0.3, 400)
})
light['engaged'] = np.where(
    light['recommended'] == 1,
    np.random.binomial(1, 0.25, 400),  # rec'd: 25%
    np.random.binomial(1, 0.30, 400)   # not rec'd: 30%
)

data = pd.concat([heavy, light], ignore_index=True)

# Overall: recommendation appears to INCREASE engagement
overall = data.groupby('recommended')['engaged'].mean()
print("Overall:", overall.to_dict())
# {0: ~0.43, 1: ~0.58} — looks like recs help!

# Within each segment: recommendation DECREASES engagement
for seg in ['heavy', 'light']:
    seg_data = data[data['segment'] == seg]
    print(f"{seg}:", seg_data.groupby('recommended')['engaged'].mean().to_dict())
# heavy: {0: ~0.70, 1: ~0.65} — recs hurt!
# light: {0: ~0.30, 1: ~0.25} — recs hurt!

The paradox arises because user type confounds both recommendation and engagement. Heavy users receive more recommendations (positive association) and engage more (positive association), creating a spurious overall positive relationship even though the within-segment causal effect is negative.


Chapter 16: The Potential Outcomes Framework

Exercise 16.3 (★★★) Simulate a dataset with known potential outcomes. Compute the true ATE, then show that the naive difference-in-means estimator is biased under confounded treatment assignment.

Solution:

import numpy as np

np.random.seed(42)
n = 5000

# Confounder: user engagement propensity
X = np.random.normal(0, 1, n)

# Potential outcomes (known in simulation)
Y0 = 2 + 1.5 * X + np.random.normal(0, 1, n)  # Control
Y1 = 2 + 1.5 * X + 3.0 + np.random.normal(0, 1, n)  # Treated (true effect = 3.0)

true_ate = np.mean(Y1 - Y0)  # Should be ~3.0

# Confounded treatment: higher X → more likely to be treated
propensity = 1 / (1 + np.exp(-1.5 * X))
T = np.random.binomial(1, propensity)

# Observed outcome
Y = T * Y1 + (1 - T) * Y0

# Naive estimator (biased)
naive_ate = Y[T == 1].mean() - Y[T == 0].mean()

# Regression adjustment (unbiased if model is correct)
from numpy.linalg import lstsq
X_design = np.column_stack([np.ones(n), T, X])
beta, _, _, _ = lstsq(X_design, Y, rcond=None)
adjusted_ate = beta[1]

print(f"True ATE:     {true_ate:.3f}")    # ~3.00
print(f"Naive ATE:    {naive_ate:.3f}")    # ~4.25 (biased upward)
print(f"Adjusted ATE: {adjusted_ate:.3f}") # ~3.00 (unbiased)

The naive estimate is biased upward because treated units have higher X (by design of the propensity), and higher X leads to higher Y regardless of treatment. The naive estimator conflates the treatment effect with the confounding effect of X.


Chapter 17: Graphical Causal Models

Exercise 17.4 (★★) Use the backdoor criterion to determine which variables to condition on in a given DAG. Implement using the DoWhy library.

Solution:

Consider the StreamRec DAG: User_Preference (U) → Recommendation (R); U → Engagement (E); R → E; Platform_Activity (P) → U; P → R. Additionally, Item_Popularity (I) → R; I → E (a second confounder).

Backdoor paths from R to E: R ← U → E and R ← I → E.

A valid adjustment set: {U, I} — blocks both backdoor paths.

An invalid adjustment: {P} alone — does not block U → E.

# Conceptual DoWhy implementation
# import dowhy
# model = dowhy.CausalModel(
#     data=df,
#     treatment='recommendation',
#     outcome='engagement',
#     graph="""
#         digraph {
#             user_preference -> recommendation;
#             user_preference -> engagement;
#             recommendation -> engagement;
#             platform_activity -> user_preference;
#             platform_activity -> recommendation;
#             item_popularity -> recommendation;
#             item_popularity -> engagement;
#         }
#     """
# )
# identified = model.identify_effect()
# print(identified)  # Backdoor adjustment on {user_preference, item_popularity}

Chapter 18: Causal Estimation Methods

Exercise 18.3 (★★★) Implement IPW estimation for the StreamRec recommendation effect. Check covariate balance before and after weighting.

Solution:

import numpy as np
from sklearn.linear_model import LogisticRegression

np.random.seed(42)
n = 5000

# Confounders
X = np.random.randn(n, 3)  # user features

# Propensity (treatment depends on confounders)
true_propensity = 1 / (1 + np.exp(-(0.8*X[:,0] + 0.5*X[:,1] - 0.3*X[:,2])))
T = np.random.binomial(1, true_propensity)

# Outcome (true ATE = 2.0)
Y = 1.0 + 0.5*X[:,0] + 0.3*X[:,1] + 2.0*T + np.random.normal(0, 1, n)

# Step 1: Estimate propensity scores
ps_model = LogisticRegression()
ps_model.fit(X, T)
e_hat = ps_model.predict_proba(X)[:, 1]
e_hat = np.clip(e_hat, 0.05, 0.95)  # Trim extreme weights

# Step 2: IPW estimator
w1 = T / e_hat
w0 = (1 - T) / (1 - e_hat)
ipw_ate = np.mean(w1 * Y) - np.mean(w0 * Y)

# Step 3: Covariate balance check
def standardized_mean_diff(X, T, weights=None):
    """Compute SMD for each covariate before/after weighting."""
    smds = []
    for j in range(X.shape[1]):
        if weights is None:
            mean_t = X[T==1, j].mean()
            mean_c = X[T==0, j].mean()
            pooled_std = np.sqrt((X[T==1,j].var() + X[T==0,j].var()) / 2)
        else:
            mean_t = np.average(X[T==1, j], weights=weights[T==1])
            mean_c = np.average(X[T==0, j], weights=weights[T==0])
            pooled_std = np.sqrt((X[T==1,j].var() + X[T==0,j].var()) / 2)
        smds.append(abs(mean_t - mean_c) / pooled_std)
    return smds

weights = np.where(T == 1, 1/e_hat, 1/(1-e_hat))
smd_before = standardized_mean_diff(X, T)
smd_after = standardized_mean_diff(X, T, weights)

print(f"Naive ATE: {Y[T==1].mean() - Y[T==0].mean():.3f}")  # ~2.5 (biased)
print(f"IPW ATE:   {ipw_ate:.3f}")                            # ~2.0 (corrected)
print(f"SMD before: {[f'{s:.3f}' for s in smd_before]}")      # [0.5+, ...]
print(f"SMD after:  {[f'{s:.3f}' for s in smd_after]}")       # [<0.1, ...]

Good balance is indicated by all SMDs < 0.1 after weighting. If balance is poor, the propensity score model may be misspecified.


Chapter 19: Causal Machine Learning

Exercise 19.2 (★★★) Estimate CATEs using S-learner and T-learner. Show a case where they disagree and explain why.

Solution:

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor

np.random.seed(42)
n = 5000
X = np.random.randn(n, 5)
T = np.random.binomial(1, 0.5, n)

# True CATE: treatment effect depends on X[:,0]
tau = 2.0 + 3.0 * X[:, 0]  # HTE: ranges from -4 to +8
Y = 1.0 + 0.5*X[:,0] + tau*T + np.random.normal(0, 1, n)

# S-Learner: single model with T as a feature
X_s = np.column_stack([X, T])
s_model = GradientBoostingRegressor(n_estimators=200).fit(X_s, Y)
cate_s = (s_model.predict(np.column_stack([X, np.ones(n)])) -
          s_model.predict(np.column_stack([X, np.zeros(n)])))

# T-Learner: separate models for treated and control
t_model1 = GradientBoostingRegressor(n_estimators=200).fit(X[T==1], Y[T==1])
t_model0 = GradientBoostingRegressor(n_estimators=200).fit(X[T==0], Y[T==0])
cate_t = t_model1.predict(X) - t_model0.predict(X)

# Correlation with true CATE
print(f"S-learner corr: {np.corrcoef(tau, cate_s)[0,1]:.3f}")  # ~0.85
print(f"T-learner corr: {np.corrcoef(tau, cate_t)[0,1]:.3f}")  # ~0.92

The T-learner often outperforms the S-learner when treatment effects are heterogeneous because the S-learner's single model may under-regularize the treatment effect (the treatment indicator is just one of many features, and the model may not give it special treatment). However, the T-learner doubles the model complexity and can suffer when treatment/control groups have different covariate distributions.


Part IV: Bayesian and Temporal Data Science

Chapter 20: Bayesian Thinking

Exercise 20.3 (★★) Derive the Beta-Binomial conjugate update and show that the posterior mean is a weighted average of the prior mean and the MLE.

Solution:

Prior: p ~ Beta(α, β). Data: k successes in n trials.

Posterior: p | data ~ Beta(α + k, β + n - k)

Posterior mean = (α + k) / (α + β + n) = [(α + β)/(α + β + n)] × [α/(α+β)] + [n/(α + β + n)] × [k/n]

This is a weighted average of: the prior mean α/(α+β) (weight proportional to α+β, the "prior sample size") and the MLE k/n (weight proportional to n, the actual sample size). As n → ∞, the posterior mean converges to the MLE — the data overwhelms the prior. For small n, the prior pulls the estimate toward the prior mean (shrinkage).


Chapter 21: Bayesian Modeling in Practice

Exercise 21.5 (★★★) Build a hierarchical model for StreamRec category engagement rates. Diagnose convergence using R-hat, ESS, and divergences.

Solution:

# Conceptual PyMC implementation
import pymc as pm
import arviz as az
import numpy as np

# Data: engagement counts per category
np.random.seed(42)
n_categories = 25
true_mu = 0.3
true_sigma = 0.15
true_rates = np.clip(np.random.normal(true_mu, true_sigma, n_categories), 0.01, 0.99)
n_observations = np.random.randint(50, 500, n_categories)
successes = np.random.binomial(n_observations, true_rates)

with pm.Model() as hierarchical_model:
    # Hyperpriors
    mu = pm.Beta('mu', alpha=2, beta=5)
    kappa = pm.HalfNormal('kappa', sigma=50)

    # Category-level rates (partial pooling)
    alpha = mu * kappa
    beta = (1 - mu) * kappa
    theta = pm.Beta('theta', alpha=alpha, beta=beta, shape=n_categories)

    # Likelihood
    obs = pm.Binomial('obs', n=n_observations, p=theta, observed=successes)

    # Sample
    trace = pm.sample(2000, tune=1000, chains=4, target_accept=0.95)

# Diagnostics
summary = az.summary(trace, var_names=['mu', 'kappa'])
# Check: all R-hat < 1.01, ESS > 400, no divergences
# If divergences occur, increase target_accept or reparameterize

Key diagnostics: (1) R-hat < 1.01 for all parameters, (2) ESS > 100 per chain, (3) zero divergences. If divergences appear, reparameterizing the Beta distribution (e.g., using a non-centered parameterization) often resolves the issue.


Chapter 22: Bayesian Optimization and Sequential Decisions

Exercise 22.3 (★★) Implement Thompson sampling for a 5-armed bandit with Beta-Bernoulli rewards. Compare cumulative regret against epsilon-greedy and UCB1.

Solution:

import numpy as np

class BetaBernoulliBandit:
    def __init__(self, true_probs):
        self.true_probs = true_probs
        self.k = len(true_probs)

    def pull(self, arm):
        return np.random.binomial(1, self.true_probs[arm])

def thompson_sampling(bandit, n_rounds):
    alphas = np.ones(bandit.k)
    betas = np.ones(bandit.k)
    regret = []
    best_prob = max(bandit.true_probs)
    for t in range(n_rounds):
        samples = np.random.beta(alphas, betas)
        arm = np.argmax(samples)
        reward = bandit.pull(arm)
        alphas[arm] += reward
        betas[arm] += 1 - reward
        regret.append(best_prob - bandit.true_probs[arm])
    return np.cumsum(regret)

def epsilon_greedy(bandit, n_rounds, eps=0.1):
    counts = np.zeros(bandit.k)
    rewards = np.zeros(bandit.k)
    regret = []
    best_prob = max(bandit.true_probs)
    for t in range(n_rounds):
        if np.random.random() < eps:
            arm = np.random.randint(bandit.k)
        else:
            arm = np.argmax(rewards / np.maximum(counts, 1))
        reward = bandit.pull(arm)
        counts[arm] += 1
        rewards[arm] += reward
        regret.append(best_prob - bandit.true_probs[arm])
    return np.cumsum(regret)

bandit = BetaBernoulliBandit([0.3, 0.5, 0.7, 0.4, 0.6])
ts_regret = thompson_sampling(bandit, 10000)
eg_regret = epsilon_greedy(bandit, 10000)
# Thompson sampling achieves O(sqrt(K * T * log T)) regret,
# substantially lower than epsilon-greedy's O(epsilon * T) linear regret.

Chapter 23: Advanced Time Series and Temporal Models

Exercise 23.4 (★★) Implement walk-forward validation for a StreamRec daily engagement forecaster. Compare MAE and calibration of prediction intervals.

Solution:

Walk-forward validation respects temporal ordering: train on data up to time t, forecast t+1 through t+h, advance window. Never evaluate on data that was available during training.

import numpy as np

def walk_forward_evaluate(data, model_fn, initial_train=365, horizon=7, step=7):
    """Walk-forward cross-validation for time series."""
    results = {'mae': [], 'coverage_90': []}

    for start in range(initial_train, len(data) - horizon, step):
        train = data[:start]
        test = data[start:start + horizon]

        # Model returns point forecast and 90% prediction interval
        forecast, lower, upper = model_fn(train, horizon)

        mae = np.mean(np.abs(test - forecast))
        coverage = np.mean((test >= lower) & (test <= upper))

        results['mae'].append(mae)
        results['coverage_90'].append(coverage)

    print(f"Mean MAE: {np.mean(results['mae']):.3f}")
    print(f"90% PI coverage: {np.mean(results['coverage_90']):.3f}")
    # Good calibration: coverage should be ~0.90
    return results

If 90% prediction intervals cover less than 90% of observations, the model is overconfident. If coverage is much above 90%, intervals are too wide (the model is underconfident or the method is conservative).


Part V: Production ML Systems

Chapter 24: ML System Design

Exercise 24.2 (★★) Design the latency budget for a three-stage recommendation system (retrieval → ranking → re-ranking) that must respond within 200ms at p99.

Solution:

Total budget: 200ms at p99. Decomposition:

Stage Budget Justification
Feature fetch 20ms Redis feature store, pre-computed features
Retrieval (ANN) 30ms FAISS/HNSW over 200K items → 500 candidates
Ranking (transformer) 80ms Score 500 candidates with neural model
Re-ranking (fairness/diversity) 20ms Light post-processing on top 50
Network overhead 30ms Load balancer, serialization, deserialization
Safety buffer 20ms For tail latency spikes
Total 200ms

Key design decisions: (1) Retrieval must use ANN, not brute force. (2) Ranking model size is constrained by the 80ms budget — limits model depth. (3) Features must be pre-computed and cached, not computed on-demand. (4) Measure p99, not mean — tail latency matters for user experience.


Chapter 25: Data Infrastructure

Exercise 25.4 (★★★) Implement a point-in-time join that correctly prevents data leakage when constructing training features. Demonstrate the bug that occurs without it.

Solution:

import pandas as pd
import numpy as np

# Events table: user actions with timestamps
events = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2],
    'timestamp': pd.to_datetime(['2024-01-01', '2024-01-05',
                                  '2024-01-10', '2024-01-02', '2024-01-08']),
    'action_count_7d': [3, 7, 12, 2, 5]  # Feature: actions in last 7 days
})

# Labels: engagement at prediction time
labels = pd.DataFrame({
    'user_id': [1, 2],
    'prediction_time': pd.to_datetime(['2024-01-06', '2024-01-03']),
    'engaged': [1, 0]
})

# WRONG: naive join takes the latest feature value (may be from the future!)
wrong_features = labels.merge(
    events.sort_values('timestamp').groupby('user_id').last().reset_index(),
    on='user_id'
)
# User 1 gets action_count_7d=12 (from Jan 10), but prediction is Jan 6!

# CORRECT: point-in-time join
def point_in_time_join(labels_df, features_df, keys, time_col_label,
                       time_col_feature):
    """Join features as-of the label timestamp."""
    result = []
    for _, label_row in labels_df.iterrows():
        mask = (features_df[keys[0]] == label_row[keys[0]]) & \
               (features_df[time_col_feature] <= label_row[time_col_label])
        valid = features_df[mask].sort_values(time_col_feature)
        if len(valid) > 0:
            feature_row = valid.iloc[-1]
            result.append({**label_row.to_dict(), **feature_row.to_dict()})
    return pd.DataFrame(result)

correct_features = point_in_time_join(
    labels, events, ['user_id'], 'prediction_time', 'timestamp'
)
# User 1 correctly gets action_count_7d=7 (from Jan 5, before prediction on Jan 6)

Without point-in-time joins, future information leaks into training data. The model appears to perform well offline but fails in production where future features are unavailable.


Chapter 27: ML Pipeline Orchestration

Exercise 27.3 (★★) Design an idempotent training pipeline task. Show what goes wrong without idempotency when a retry occurs after a partial failure.

Solution:

Without idempotency: if a training task writes the model checkpoint to a shared location but crashes before logging the evaluation metrics, a retry re-trains the model and overwrites the checkpoint. If another downstream task read the first (now-overwritten) checkpoint, we have an inconsistency.

Idempotent design:

import hashlib
import json
from pathlib import Path

def train_model_idempotent(config: dict, data_path: str, output_dir: str):
    """Idempotent training task: same inputs always produce same output path."""
    # Deterministic output path based on inputs
    config_hash = hashlib.sha256(
        json.dumps(config, sort_keys=True).encode()
    ).hexdigest()[:12]
    model_path = Path(output_dir) / f"model_{config_hash}"

    # Skip if already completed (idempotency check)
    if (model_path / "COMPLETE").exists():
        print(f"Model already trained at {model_path}")
        return model_path

    # Train to a temporary path, then atomic rename
    tmp_path = Path(output_dir) / f"model_{config_hash}_tmp"
    tmp_path.mkdir(parents=True, exist_ok=True)

    # ... training logic writes to tmp_path ...

    # Atomic completion: rename + marker file
    if model_path.exists():
        import shutil
        shutil.rmtree(model_path)
    tmp_path.rename(model_path)
    (model_path / "COMPLETE").touch()

    return model_path

Key principles: (1) deterministic output path from inputs, (2) skip if already complete, (3) write to temp then atomic rename, (4) completion marker file.


Chapter 28: ML Testing and Validation

Exercise 28.2 (★★) Design three behavioral tests (invariance, directional, minimum functionality) for a recommendation model.

Solution:

def test_invariance_username(model, user_features, item_features):
    """Changing the user's display name should not change recommendations."""
    recs_original = model.recommend(user_features, item_features, k=10)
    user_features_modified = user_features.copy()
    user_features_modified['display_name'] = 'RandomTestName123'
    recs_modified = model.recommend(user_features_modified, item_features, k=10)
    assert recs_original == recs_modified, "Recommendations changed with name change!"

def test_directional_genre_preference(model, user_features, item_features):
    """Adding 10 sci-fi interactions should increase sci-fi recommendations."""
    recs_before = model.recommend(user_features, item_features, k=50)
    sci_fi_count_before = sum(1 for r in recs_before if r['genre'] == 'sci-fi')

    user_features_enhanced = add_genre_interactions(user_features, 'sci-fi', 10)
    recs_after = model.recommend(user_features_enhanced, item_features, k=50)
    sci_fi_count_after = sum(1 for r in recs_after if r['genre'] == 'sci-fi')

    assert sci_fi_count_after > sci_fi_count_before, \
        f"Sci-fi recs did not increase: {sci_fi_count_before} → {sci_fi_count_after}"

def test_mft_cold_start(model, item_features):
    """Model must produce non-empty recommendations for a brand-new user."""
    cold_user = create_empty_user_profile()
    recs = model.recommend(cold_user, item_features, k=10)
    assert len(recs) == 10, f"Cold start user got {len(recs)} recs, expected 10"
    assert len(set(r['item_id'] for r in recs)) == 10, "Duplicate recommendations!"

Chapter 29: Continuous Training and Deployment

Exercise 29.3 (★★) Design an automatic rollback policy for a canary deployment. Specify which metrics to monitor and what thresholds trigger rollback.

Solution:

Metric Source Threshold Window Action
Error rate (5xx) Serving logs > 1% (baseline + 0.5%) 5 min Immediate rollback
p99 latency Serving logs > 250ms (125% of baseline) 10 min Immediate rollback
Click-through rate Event stream < 90% of baseline 30 min Alert, confirm rollback
Null prediction rate Model output > 5% 5 min Immediate rollback
Prediction distribution shift Model output PSI > 0.25 vs. shadow 60 min Alert, investigate

The automatic rollback system should: (1) monitor canary and control populations separately, (2) use running statistical tests (not just thresholds) to account for natural variance, (3) require multiple metric violations for non-critical rollbacks to avoid false alarms, (4) trigger immediate rollback for safety-critical metrics (errors, latency) without waiting for statistical significance.


Chapter 30: Monitoring, Observability, and Incident Response

Exercise 30.2 (★★) Design a monitoring dashboard with four layers: business metrics, model metrics, data metrics, and system metrics. Specify alerts for each.

Solution:

Business layer: Revenue per session, conversion rate, user retention (7-day). Alert: any metric drops > 5% week-over-week.

Model layer: Prediction distribution (mean, std, percentiles), feature importance stability, calibration (ECE), accuracy on labeled subsets. Alert: prediction mean shifts > 2 standard deviations from rolling baseline; ECE > 0.05.

Data layer: Feature completeness, schema violations, freshness (time since last update), distributional shift (PSI per feature). Alert: any feature PSI > 0.25; data freshness > 2x expected cadence; schema violations > 0.

System layer: Serving latency (p50, p95, p99), throughput (requests/sec), GPU memory utilization, CPU utilization, error rates. Alert: p99 latency > SLA; error rate > 0.1%; GPU memory > 90%.

Critical design principle: business metrics detect problems that model metrics miss (and vice versa). A model may maintain accuracy while business outcomes degrade because user behavior shifted.


Part VI: Responsible and Rigorous Data Science

Chapter 31: Fairness in Machine Learning

Exercise 31.3 (★★★) Demonstrate the impossibility theorem: show that demographic parity, equalized odds, and calibration cannot be simultaneously satisfied except in trivial cases.

Solution:

import numpy as np

# Two groups: A (60% base rate) and B (40% base rate)
n = 10000
np.random.seed(42)

group = np.random.choice(['A', 'B'], n, p=[0.5, 0.5])
base_rate = np.where(group == 'A', 0.6, 0.4)
y_true = np.random.binomial(1, base_rate)

# Perfectly calibrated model (predicts true probability)
y_pred_prob = base_rate

# Check fairness criteria:
# 1. Calibration: P(Y=1 | score=s, G=g) = s for all s, g ✓ (by construction)

# 2. Demographic parity: P(score > t | G=A) = P(score > t | G=B)?
# For threshold t=0.5: P(positive | A) = 1.0, P(positive | B) = 0.0
# VIOLATED because base rates differ

# 3. Equalized odds: P(score > t | Y=1, G=A) = P(score > t | Y=1, G=B)?
# For threshold t=0.5: TPR_A = 1.0, TPR_B = 0.0
# VIOLATED because scores are deterministic by group

# Choquet (2017) / Kleinberg, Mullainathan, Raghavan (2016):
# When base rates differ between groups, calibration is incompatible with
# equal FPR and equal FNR (equalized odds) except when the model is
# perfectly accurate or the base rates are equal.

This impossibility result means that organizations must choose which fairness criterion to prioritize based on the application context. There is no "universally fair" model when base rates differ.


Chapter 32: Privacy-Preserving Data Science

Exercise 32.2 (★★) Train a StreamRec model with differential privacy using Opacus at epsilon values of 1, 3, and 8. Measure the accuracy-privacy tradeoff.

Solution:

The typical pattern: at epsilon=1 (strong privacy), accuracy drops 5-15% relative to the non-private baseline. At epsilon=8 (weaker privacy), the drop is 1-3%. The tradeoff depends on dataset size (more data mitigates the noise impact), model complexity (simpler models are more robust to DP noise), and the specific task.

Key implementation details for DP-SGD with Opacus: (1) replace BatchNorm with GroupNorm (BatchNorm leaks information across samples), (2) clip per-sample gradients to bound sensitivity, (3) add calibrated Gaussian noise to gradients. The privacy accountant tracks the cumulative privacy cost across training epochs.


Chapter 33: Rigorous Experimentation at Scale

Exercise 33.4 (★★★) Show that peeking at A/B test results inflates the false positive rate. Implement mSPRT as a valid alternative for continuous monitoring.

Solution:

import numpy as np
from scipy import stats

def simulate_peeking(n_sims=10000, n_samples=10000, peek_interval=100):
    """Show that peeking inflates false positive rate under the null."""
    false_positives = 0
    for _ in range(n_sims):
        control = np.random.normal(0, 1, n_samples)
        treatment = np.random.normal(0, 1, n_samples)  # H0: no effect

        for n in range(peek_interval, n_samples + 1, peek_interval):
            _, p_value = stats.ttest_ind(treatment[:n], control[:n])
            if p_value < 0.05:
                false_positives += 1
                break

    return false_positives / n_sims

actual_fpr = simulate_peeking()
print(f"FPR with peeking: {actual_fpr:.3f}")  # ~0.23 (should be 0.05!)

Peeking at 100 checkpoints inflates the FPR from 5% to approximately 23%. The mSPRT controls this by using a mixture likelihood ratio that accounts for continuous monitoring.


Chapter 34: Uncertainty Quantification

Exercise 34.3 (★★) Apply temperature scaling to calibrate a StreamRec recommendation model. Measure ECE before and after.

Solution:

import numpy as np
from scipy.optimize import minimize_scalar

def ece(y_true, y_prob, n_bins=15):
    """Expected Calibration Error."""
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    ece_value = 0.0
    for i in range(n_bins):
        mask = (y_prob > bin_boundaries[i]) & (y_prob <= bin_boundaries[i+1])
        if mask.sum() > 0:
            avg_confidence = y_prob[mask].mean()
            avg_accuracy = y_true[mask].mean()
            ece_value += mask.sum() * abs(avg_confidence - avg_accuracy)
    return ece_value / len(y_true)

def temperature_scale(logits, y_true):
    """Find optimal temperature T that minimizes NLL on validation set."""
    def nll(T):
        scaled = 1 / (1 + np.exp(-logits / T))
        return -np.mean(y_true * np.log(scaled + 1e-8) +
                       (1 - y_true) * np.log(1 - scaled + 1e-8))
    result = minimize_scalar(nll, bounds=(0.1, 10.0), method='bounded')
    return result.x

# Simulated overconfident model
np.random.seed(42)
logits = np.random.randn(5000) * 2  # Raw logits
y_prob_uncalibrated = 1 / (1 + np.exp(-logits))
y_true = np.random.binomial(1, 1 / (1 + np.exp(-logits * 0.5)))

T_opt = temperature_scale(logits, y_true)
y_prob_calibrated = 1 / (1 + np.exp(-logits / T_opt))

print(f"Temperature: {T_opt:.2f}")                      # ~1.85
print(f"ECE before: {ece(y_true, y_prob_uncalibrated):.4f}")  # ~0.07
print(f"ECE after:  {ece(y_true, y_prob_calibrated):.4f}")    # ~0.02

Temperature scaling is a single-parameter post-hoc calibration method. T > 1 indicates the model was overconfident (most common); T < 1 indicates underconfidence.


Chapter 35: Interpretability and Explainability

Exercise 35.2 (★★) Compute SHAP values for a credit scoring model and generate a natural language adverse action notice.

Solution:

The adverse action notice must comply with ECOA/FCRA: when credit is denied, the lender must provide the top reasons for denial. SHAP values provide a principled ranking: the features with the largest negative SHAP contributions (pushing the score below the threshold) are the top adverse action reasons.

Template: "Your application was not approved. The primary factors in this decision were: (1) {feature_1} — your {value_description_1}, (2) {feature_2} — your {value_description_2}, ..."

Map SHAP features to consumer-facing language: "debt_to_income_ratio" → "Your ratio of monthly debt payments to income"; "delinquency_count_24m" → "The number of late payments on your credit report in the past 24 months."


Part VII: Leadership and Synthesis

Chapter 37: Reading Research Papers

Exercise 37.2 (★★) Apply the three-pass reading strategy to the "Attention Is All You Need" paper. Identify the key claims, evaluate the experimental methodology, and list what the paper does NOT show.

Solution:

Pass 1 (5 minutes): Title, abstract, introduction, section headings, conclusion. Key claim: a model based solely on attention mechanisms achieves state-of-the-art translation quality while being more parallelizable than RNNs.

Pass 2 (30 minutes): Figures, equations, experiments. Architecture: encoder-decoder with multi-head self-attention, positional encoding, layer norm, feed-forward blocks. Key innovation: replacing recurrence with self-attention. Results: BLEU scores on WMT 2014 EN-DE (28.4) and EN-FR (41.0).

Pass 3 (critical evaluation): Strengths — clear architectural contribution, strong results, ablation study (Table 3 tests number of heads, dimension, dropout). Weaknesses — (1) only evaluated on translation (claims are broader than evidence), (2) no analysis of what the model learns (attention patterns barely examined), (3) computational cost comparison uses FLOPs but not wall-clock time on actual hardware, (4) positional encoding choice (sinusoidal) is not well justified beyond "it works."

What the paper does NOT show: generalization to other tasks, scaling behavior, interpretability of attention weights, performance on low-resource languages.


Chapter 38: The Staff Data Scientist

Exercise 38.3 (★★) Write an RFC (Request for Comments) proposing the adoption of a new model architecture for a recommendation system. Include the problem statement, proposed solution, alternatives considered, migration plan, and risks.

Solution:

The RFC should follow this structure: (1) Problem — current model latency exceeds SLA during peak hours; (2) Proposed solution — replace deep cross network with a lighter two-tower model with cached embeddings; (3) Alternatives — model distillation (rejected: still too slow for retrieval), caching current model outputs (rejected: staleness issues), horizontal scaling (rejected: cost); (4) Migration plan — shadow mode for 2 weeks, canary at 10% for 1 week, progressive rollout over 2 weeks; (5) Risks — potential regression in long-tail recommendation quality (mitigated by A/B test with statistical power for subgroup analysis); (6) Success criteria — p99 latency < 150ms, no more than 1% regression in engagement metrics.


Chapter 39: Building and Leading a Data Science Organization

Exercise 39.2 (★★) Design an ROI framework for measuring the business impact of a data science team. Address both direct (model-driven revenue) and indirect (decision quality, speed) contributions.

Solution:

Direct ROI (measurable): Revenue from model-driven features (recommendation uplift × revenue per engagement), cost savings from automation (hours saved × labor cost), fraud/risk reduction (losses prevented vs. baseline).

Indirect ROI (estimated): Decision velocity (time from question to data-driven answer, before and after DS team), experiment throughput (experiments per quarter), insight adoption rate (% of DS recommendations implemented by product teams).

Framework: Track a portfolio of projects on a 2×2 matrix of (business impact × technical complexity). The healthy DS team has a mix: quick wins (high impact, low complexity), strategic bets (high impact, high complexity), and capability builders (lower immediate impact, builds reusable infrastructure). Avoid vanity projects (high complexity, low impact).

Reporting cadence: Monthly impact summaries tied to business metrics, not model metrics. Leadership cares about revenue, cost, and risk — not AUC.