Exercises — Chapter 4: The Math Behind ML


Exercise 4.1: Distribution Identification (Conceptual)

For each of the following scenarios, identify the most appropriate probability distribution (normal, binomial, or Poisson) and justify your choice.

a) The number of customer support chats a StreamFlow subscriber initiates per week.

b) The distribution of monthly viewing hours across all StreamFlow subscribers.

c) Out of 50 subscribers who received a retention offer, how many will still be active in 30 days (assume each has an independent 70% probability of staying).

d) The number of server errors that occur per hour on StreamFlow's recommendation engine.

e) A hospital patient's blood pressure reading.


Exercise 4.2: Bayes' Theorem — StreamFlow Churn Alert (Conceptual + Math)

StreamFlow's early warning system flags subscribers who might churn. You know the following:

  • 8.2% of subscribers actually churn each month (the base rate)
  • When a subscriber is about to churn, the system correctly flags them 75% of the time (sensitivity)
  • When a subscriber is not about to churn, the system incorrectly flags them 12% of the time (false positive rate)

a) A subscriber gets flagged. What is the probability they will actually churn? Use Bayes' theorem and show your work.

b) Does this result surprise you? Why is the posterior probability so much lower than the 75% sensitivity?

c) Suppose StreamFlow improves the system so the false positive rate drops to 3%. Recompute the posterior. How much does this improvement matter?


Exercise 4.3: Matrix Dimensions (Conceptual)

StreamFlow has 2.4 million subscribers and 47 engineered features.

a) What is the shape of the feature matrix $\mathbf{X}$?

b) What is the shape of the weight vector $\mathbf{w}$ for a linear model?

c) What is the shape of the prediction vector $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$?

d) What is the shape of $\mathbf{X}^T\mathbf{X}$? What does this matrix represent conceptually?

e) A colleague says "I tried to multiply $\mathbf{X}^T$ by $\mathbf{y}$ and got a dimension error." The target vector $\mathbf{y}$ has shape (2400000,). What is the shape of $\mathbf{X}^T$, and what shape would $\mathbf{X}^T \mathbf{y}$ produce?


Exercise 4.4: Dot Product Interpretation (Conceptual + Code)

Given the following weight vector and two customer feature vectors:

import numpy as np

weights = np.array([-0.015, -0.025, 0.18, 0.12, -0.08])
# Features: [tenure_months, usage_hours, support_tickets, plan_downgrades, logins_per_week]

customer_a = np.array([36, 55, 0, 0, 5])
customer_b = np.array([3, 8, 4, 2, 1])

a) Compute the dot product (churn score) for each customer by hand. Show your work.

b) Verify your results with numpy.

c) Which features contribute most to Customer B's high churn score? Break down the contribution of each feature (weight times feature value).

d) A product manager asks: "Can we reduce Customer B's churn risk by giving them a free month?" Which feature would this intervention affect, and by how much would the score change?


Exercise 4.5: Gradient Descent Step-by-Step (Math + Code)

Consider a simple loss function with one parameter: $L(w) = (w - 5)^2 + 3$.

a) What is the minimum value of $L$, and at what value of $w$ does it occur?

b) Compute the derivative $\frac{dL}{dw}$.

c) Starting at $w_0 = 0$ with learning rate $\alpha = 0.1$, compute the first 5 gradient descent steps by hand. Fill in this table:

Step $w_t$ $L(w_t)$ $\frac{dL}{dw}$ $w_{t+1}$
0 0
1
2
3
4

d) Write numpy code to verify your hand calculations and continue for 50 iterations. Plot $w_t$ vs. iteration and $L(w_t)$ vs. iteration.


Exercise 4.6: Learning Rate Experiment (Code)

Using the gradient descent implementation from the chapter (reproduced below for convenience), run experiments on the synthetic StreamFlow data.

import numpy as np
from sklearn.preprocessing import StandardScaler

# Generate data
rng = np.random.default_rng(42)
n = 200
tenure = rng.uniform(1, 60, n)
usage = rng.normal(40, 15, n)
tickets = rng.poisson(0.5, n)
y = -0.02 * tenure - 0.03 * usage + 0.15 * tickets + rng.normal(0, 0.1, n)
X = StandardScaler().fit_transform(np.column_stack([tenure, usage, tickets]))

a) Run gradient descent with learning rates [0.001, 0.01, 0.1, 0.5, 1.0]. For each, record: (i) the number of iterations to converge (tolerance $10^{-8}$, max 10,000), and (ii) the final MSE.

b) Which learning rate converges fastest? Which diverges?

c) Plot all five loss curves on the same axes (use a log scale for the y-axis). What visual pattern distinguishes a learning rate that is too large?

d) Implement a simple learning rate schedule: start with $\alpha = 0.5$ and multiply by 0.999 after each iteration. Does this converge faster than any fixed learning rate?


Exercise 4.7: Log-Loss Sensitivity (Math + Code)

a) Compute the log-loss for the following predictions and fill in the table:

Predicted P(churn) Actual Log-loss contribution
0.99 1
0.90 1
0.50 1
0.10 1
0.01 1

b) Plot log-loss contribution vs. predicted probability for the case where actual = 1. What is the shape of this curve?

c) Why does log-loss go to infinity as the predicted probability approaches 0 for a positive example? What practical problem does this create, and how do implementations handle it?

d) A model outputs the following predictions for 6 customers:

y_true = np.array([1, 0, 1, 0, 1, 0])
y_pred = np.array([0.8, 0.3, 0.6, 0.1, 0.55, 0.45])

Compute the log-loss. Then change the prediction for customer 6 from 0.45 to 0.05. How much does the overall log-loss change? Why is the impact so large (or small)?


Exercise 4.8: MSE vs. Log-Loss Showdown (Code)

Create a synthetic binary classification problem:

rng = np.random.default_rng(42)
n = 500
X = rng.normal(0, 1, (n, 2))
y = (X[:, 0] + X[:, 1] + rng.normal(0, 0.5, n) > 0).astype(float)

a) Fit two models: - A linear regression (which minimizes MSE) and use its predictions as probabilities (clip to [0.01, 0.99]) - A logistic regression (which minimizes log-loss)

b) For each model, compute: MSE, log-loss, accuracy (threshold 0.5), and the range of predicted probabilities (min, max).

c) Which model produces better-calibrated probabilities? Plot a reliability diagram: bucket predictions into 10 bins, and for each bin, plot the mean predicted probability vs. the actual positive rate.

d) Explain in 2-3 sentences why MSE-trained models tend to produce predictions clustered near 0.5 while log-loss-trained models produce more extreme predictions.


Exercise 4.9: L1 vs. L2 Feature Selection (Code)

Generate a dataset with 20 features, where only 5 are actually predictive:

rng = np.random.default_rng(42)
n = 300
X = rng.normal(0, 1, (n, 20))
# Only features 0, 3, 7, 12, 18 matter
true_weights = np.zeros(20)
true_weights[[0, 3, 7, 12, 18]] = [2.0, -1.5, 1.0, -0.8, 0.5]
y = X @ true_weights + rng.normal(0, 0.5, n)

a) Fit a Ridge model with alpha=1.0. How many weights are exactly zero? How many are close to zero (absolute value < 0.01)?

b) Fit a Lasso model with alpha=0.1. How many weights are exactly zero? Which features survived?

c) Do the surviving Lasso features match the 5 truly predictive features? Which true features, if any, were incorrectly zeroed out?

d) Sweep alpha from 0.001 to 10.0 (use np.logspace(-3, 1, 50)) for both Ridge and Lasso. For each alpha, count the number of non-zero coefficients. Plot "number of non-zero weights vs. alpha" for both models on the same axes. What story does this plot tell?


Exercise 4.10: Gradient Descent with Regularization (Code)

Extend the gradient descent implementation from the chapter to support L2 regularization.

a) The gradient of the L2-regularized MSE loss is:

$$\nabla L_{\text{ridge}} = -(2/n) \mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) + 2\lambda\mathbf{w}$$

Implement this in a function gradient_descent_ridge(X, y, lambda_reg, lr, n_iter).

b) Run your implementation on the Exercise 4.6 data with lambda_reg values of 0, 0.01, 0.1, and 1.0. Compare the final weights.

c) Verify your results against sklearn.linear_model.Ridge with the corresponding alpha values. Do the weights match?

d) How does regularization affect convergence speed? Plot the loss curves for all four lambda_reg values on the same axes.


Exercise 4.11: The Metro General Bayes Chain (Conceptual + Code)

A patient at Metro General is discharged. You will update the readmission probability three times as new information arrives.

Start: Base readmission rate = 15% (prior).

Update 1: The patient has diabetes. Among readmitted patients, 40% have diabetes. Among non-readmitted patients, 15% have diabetes.

Update 2: Using the posterior from Update 1 as the new prior, the patient's discharge HbA1c is elevated. Among readmitted diabetic patients, 70% have elevated HbA1c. Among non-readmitted diabetic patients, 30% have elevated HbA1c.

Update 3: Using the posterior from Update 2 as the new prior, the patient has strong family support at home. Among readmitted patients (with diabetes and elevated HbA1c), 25% have strong family support. Among non-readmitted patients (with same conditions), 60% have strong support.

a) Compute the posterior after each update. Show your work.

b) Write numpy code to automate the sequential Bayes updates.

c) Does the order of the updates matter? Swap the order of Updates 1 and 2 and verify the final posterior is the same (it should be, assuming conditional independence given the class).

d) Plot the posterior probability after each update as a bar chart. Which piece of evidence had the largest impact?


Exercise 4.12: Convexity Check (Conceptual + Code)

a) A function $f$ is convex if for any two points $x_1, x_2$ and any $t \in [0, 1]$:

$$f(t x_1 + (1-t) x_2) \leq t f(x_1) + (1-t) f(x_2)$$

In plain English, what does this mean geometrically?

b) For each of the following, determine whether the function is convex. Justify your answer.

  • $f(w) = w^2$
  • $f(w) = |w|$
  • $f(w) = w^2 \sin(w)$
  • $f(w) = \max(0, 1 - w)$ (the hinge loss for a single point)

c) Why does convexity matter for gradient descent? What guarantee does it provide that non-convex functions do not?

d) Write numpy code that empirically checks convexity for MSE loss on a simple linear regression problem. Generate 1000 random pairs of weight vectors, check the convexity condition for $t = 0.5$, and report how many pairs satisfy it.


Exercise 4.13: Build Your Own Loss Function (Code + Conceptual)

StreamFlow's business team says: "We care much more about missing a churner (false negative) than about incorrectly flagging a loyal customer (false positive). A missed churner costs us \$200 in lifetime value. A false alarm costs us \$15 in unnecessary retention offers."

a) Define an asymmetric loss function that reflects these costs. Write it as a Python function.

b) Compute this loss for the following predictions:

y_true = np.array([1, 0, 1, 0, 1, 0, 0, 1])
y_pred = np.array([0.7, 0.3, 0.4, 0.6, 0.8, 0.1, 0.2, 0.3])
# Threshold at 0.5 for classification

c) Compare with standard log-loss. Which model decisions does the asymmetric loss penalize more heavily?

d) How could you incorporate this asymmetric cost structure into a scikit-learn model without writing a custom loss function? (Hint: think about class_weight or sample_weight.)