Quiz — Chapter 4: The Math Behind ML

Test your understanding of the mathematical foundations covered in this chapter. Answers are in Appendix B.


Question 1. StreamFlow's monthly churn rate is 8.2%. If you randomly select 50 subscribers, which probability distribution best models the number who will churn?

  • (a) Normal
  • (b) Poisson
  • (c) Binomial
  • (d) Uniform

Question 2. A readmission prediction model at Metro General has the following characteristics: 15% base readmission rate, 80% sensitivity (true positive rate), and 10% false positive rate. A patient is flagged as high-risk by the model. What is the posterior probability of readmission?

  • (a) 80%
  • (b) 58.5%
  • (c) 41.4%
  • (d) 15%

Show your work using Bayes' theorem.


Question 3. The StreamFlow feature matrix has shape (2400000, 47). What is the shape of $\mathbf{X}^T \mathbf{X}$?

  • (a) (2400000, 2400000)
  • (b) (47, 47)
  • (c) (47, 2400000)
  • (d) (2400000, 47)

Question 4. In gradient descent, the update rule is $\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \nabla L(\mathbf{w}_t)$. If the gradient at the current point is $[0.5, -0.3, 0.1]$ and the learning rate is 0.01, what is the update to the weights?

  • (a) Add [0.005, -0.003, 0.001]
  • (b) Subtract [0.005, -0.003, 0.001]
  • (c) Add [0.5, -0.3, 0.1]
  • (d) Subtract [0.5, -0.3, 0.1]

Write the new weights if $\mathbf{w}_t = [1.0, 2.0, 3.0]$.


Question 5. Your gradient descent training shows the following loss values over iterations: 2.5, 2.4, 2.3, 2.5, 2.8, 3.2, 4.1, 6.0, 15.3, NaN. What is the most likely cause?

  • (a) The loss function is non-convex
  • (b) The learning rate is too high
  • (c) The features are not scaled
  • (d) The model has too many parameters

Question 6. Which loss function should you use for StreamFlow's binary churn classification problem?

  • (a) Mean Squared Error (MSE)
  • (b) Mean Absolute Error (MAE)
  • (c) Cross-entropy (log-loss)
  • (d) Hinge loss

Explain in one sentence why the other options are less appropriate.


Question 7. A churn model predicts P(churn) = 0.95 for a customer who actually churns. Another model predicts P(churn) = 0.60 for the same customer. What is the log-loss contribution for each prediction?

  • Model A (p=0.95): ___
  • Model B (p=0.60): ___

Which model receives a larger penalty, and by what factor?


Question 8. What is the key geometric difference between L1 (Lasso) and L2 (Ridge) regularization?

  • (a) L1 uses a circular constraint region; L2 uses a diamond
  • (b) L1 uses a diamond constraint region; L2 uses a circle
  • (c) L1 penalizes large weights more; L2 penalizes small weights more
  • (d) L1 and L2 produce identical results with different computation methods

Why does this geometric difference cause L1 to produce sparse (zero) weights while L2 produces small-but-nonzero weights?


Question 9. You are training a linear regression with gradient descent. After 5,000 iterations, the loss has plateaued at 0.45. A colleague suggests scaling the features. Which of the following is true?

  • (a) Scaling cannot affect the final MSE because it is a linear transformation
  • (b) Scaling may allow gradient descent to converge to a lower MSE
  • (c) Scaling will definitely reduce the MSE
  • (d) Scaling only matters for regularized models

Question 10. A Poisson distribution with $\lambda = 2.5$ models the number of support tickets per customer per month. What is the probability that a customer files exactly 0 tickets?

  • (a) $e^{-2.5} \approx 0.082$
  • (b) $1 - e^{-2.5} \approx 0.918$
  • (c) $2.5^0 / 0! \approx 1.0$
  • (d) $0$

Question 11. The dot product $\mathbf{w} \cdot \mathbf{x}$ in a linear model equals 2.3. After applying a sigmoid function, the output is approximately 0.91. What does 0.91 represent in a churn classification context?

  • (a) The model's MSE for this customer
  • (b) The predicted probability that this customer will churn
  • (c) The customer's feature importance score
  • (d) The gradient of the loss at this point

Question 12. You fit a Lasso model (alpha=0.5) on a dataset with 30 features. The model sets 18 weights to exactly zero. You then increase alpha to 2.0. What do you expect?

  • (a) Fewer than 18 zero weights
  • (b) Exactly 18 zero weights
  • (c) More than 18 zero weights
  • (d) All 30 weights become zero

Question 13. True or False (explain each briefly):

a) The gradient of a loss function at a minimum is always exactly zero.

b) If a loss function is convex, gradient descent is guaranteed to find the global minimum (given a suitable learning rate and sufficient iterations).

c) MSE and log-loss will rank the same set of binary classification models in the same order (best to worst).

d) L2 regularization is equivalent to placing a Gaussian prior on the weights in Bayesian terms.

e) The learning rate in gradient descent should always be as large as possible without causing divergence.


Question 14. Match each scenario to the most appropriate loss function:

Scenario Loss Function
Predicting house prices (a) Log-loss
Classifying emails as spam/not-spam (b) MSE
Finding the maximum-margin separator between classes (c) Hinge loss
Predicting house prices with many outliers (d) MAE / Huber

Question 15. A data scientist computes the following: $\nabla L = [0.0, 0.0, 0.0]$. They conclude they have found the optimal weights.

a) Is this conclusion necessarily correct? Under what condition is it guaranteed to be correct?

b) Name two situations where a zero gradient does not indicate the global optimum.

c) For the loss functions used in linear regression and logistic regression, is the zero-gradient conclusion valid? Why or why not?