Chapter 8 Quiz: Supervised Learning — Regression


Multiple Choice

Question 1. What is the primary difference between classification and regression in supervised learning?

  • (a) Classification uses labeled data; regression does not.
  • (b) Classification predicts categorical outcomes; regression predicts continuous numerical values.
  • (c) Regression is more accurate than classification.
  • (d) Classification uses decision trees; regression uses linear models.

Question 2. In the linear regression equation y = b₀ + b₁x, what does the slope coefficient b₁ represent?

  • (a) The predicted value of y when x equals zero.
  • (b) The percentage of variance in y explained by x.
  • (c) The expected change in y for each one-unit increase in x.
  • (d) The strength of the correlation between x and y.

Question 3. Tom builds a polynomial regression model with degree 15. His training R² is 0.97 and his test R² is 0.41. What is the most likely diagnosis?

  • (a) Underfitting — the model is too simple for the data.
  • (b) Overfitting — the model memorized training noise rather than learning the true pattern.
  • (c) Data leakage — future information was included in the training set.
  • (d) Multicollinearity — the features are too correlated with each other.

Question 4. Which regularization technique can set coefficients exactly to zero, effectively performing automatic feature selection?

  • (a) Ridge (L2) regularization
  • (b) Lasso (L1) regularization
  • (c) Elastic Net regularization
  • (d) Polynomial regularization

Question 5. What is multicollinearity, and why is it a problem in multiple regression?

  • (a) When the target variable has multiple modes; it prevents the model from converging.
  • (b) When predictor variables are highly correlated with each other; it makes individual coefficient estimates unreliable.
  • (c) When there are more features than observations; it causes overfitting.
  • (d) When the residuals are not normally distributed; it invalidates the model's predictions.

Question 6. In a Random Forest regression model, what is the purpose of training each tree on a different random sample of the data (bootstrap aggregation)?

  • (a) To reduce computational time by using smaller datasets.
  • (b) To ensure each tree learns a different subset of the feature space.
  • (c) To reduce variance by ensuring the trees are diverse, so their errors cancel out when averaged.
  • (d) To increase the training R² by giving each tree a unique perspective.

Question 7. Which of the following best describes gradient boosting?

  • (a) Training multiple independent models and averaging their predictions.
  • (b) Training models sequentially, where each new model focuses on correcting the errors of previous models.
  • (c) Using a single deep decision tree with gradient-based optimization.
  • (d) Applying gradient descent to optimize a linear regression model.

Question 8. A demand forecasting model has an MAE of 30 units and an RMSE of 55 units. What does the large gap between MAE and RMSE suggest?

  • (a) The model is underfitting.
  • (b) The model has some large outlier errors — it is usually accurate but occasionally very wrong.
  • (c) The model is overfitting to the training data.
  • (d) The features are not predictive enough.

Question 9. A model predicts daily sales with a MAPE of 8 percent. For a product that typically sells 500 units per day, what is the expected average absolute error in units?

  • (a) 8 units
  • (b) 40 units
  • (c) 80 units
  • (d) 400 units

Question 10. When creating lag features for a time series regression model, why is it critical to use shifted (past) values rather than current or future values?

  • (a) Current values would make the model too complex.
  • (b) Using current or future values constitutes data leakage — the model would use information that wouldn't be available at prediction time.
  • (c) Lag features only work with past values due to mathematical constraints.
  • (d) Current values are always identical to the target variable.

Question 11. Ravi presents a demand model to Athena's board. Which of the following presentations would be most effective for securing executive support?

  • (a) "Our model achieves an R² of 0.88 with a standard error of 14.3 units."
  • (b) "Our model reduces forecast error by 35 percent compared to the existing method, translating to an estimated $3.6 million in annual savings from reduced overstock and stockout costs."
  • (c) "We trained five different algorithms and XGBoost outperformed Random Forest by 0.03 on the RMSE metric."
  • (d) "The p-values for all features in our regression model are below 0.05, indicating statistical significance."

Question 12. In Athena's demand forecasting problem, under-predicting demand (stockouts) costs approximately $44 per unit in lost margin, while over-predicting demand (overstock) costs approximately $0.03 per unit per day in holding costs. What does this asymmetry imply for the forecasting model?

  • (a) The model should be optimized to minimize RMSE, which penalizes large errors equally.
  • (b) The model should be biased slightly toward over-prediction, since the cost of under-prediction is much higher per unit.
  • (c) The model should be biased toward under-prediction to minimize inventory carrying costs.
  • (d) The asymmetry is irrelevant because the model should always target the most accurate prediction.

Question 13. Which of the following is NOT a valid strategy for handling multicollinearity?

  • (a) Dropping one of the highly correlated features.
  • (b) Combining correlated features into a single composite feature.
  • (c) Using Ridge regularization.
  • (d) Increasing the polynomial degree to capture the correlation.

Question 14. A log transform is applied to the target variable in a regression model. What must the analyst remember when interpreting the model's predictions?

  • (a) The coefficients now represent percentage changes rather than absolute changes.
  • (b) The predictions must be exponentiated (converted back to the original scale) before being used for business decisions.
  • (c) The R² will always be higher on the log scale.
  • (d) Both (a) and (b).

Question 15. Zillow's Zestimate had a median error of approximately 5 percent for on-market homes. Why was this level of accuracy sufficient for consumer engagement but insufficient for iBuying?

  • (a) Because consumers don't care about accuracy.
  • (b) Because iBuying margins were narrower than the model's error range, so even average-size errors could eliminate profitability.
  • (c) Because the Zestimate used linear regression, which is not accurate enough for transactions.
  • (d) Because the model was only trained on sold homes, not listed homes.

Question 16. Safety stock is calculated as z x sigma x sqrt(lead_time). If a more accurate forecasting model reduces sigma (forecast error standard deviation) from 40 to 25, by what percentage does safety stock decrease?

  • (a) 15 percent
  • (b) 37.5 percent
  • (c) 56.25 percent
  • (d) 62.5 percent

Question 17. Which of the following is the best description of the difference between correlation and causation in the context of regression analysis?

  • (a) Regression coefficients measure correlation; experiments measure causation. A positive regression coefficient between email frequency and customer lifetime value does not prove that more emails cause higher value.
  • (b) Regression coefficients always measure causation if the model's R² is above 0.80.
  • (c) Correlation and causation are the same thing when the p-value is below 0.01.
  • (d) Causation can only be established using neural networks, not regression.

True or False

Question 18. True or False: Adding more features to a linear regression model will always increase the R² on the training data, even if the features are random noise.


Question 19. True or False: XGBoost (gradient boosting) is typically the best algorithm for interpretability in regulated industries where model decisions must be explained.


Question 20. True or False: In time series regression, randomly splitting data into training and test sets (rather than using a temporal split) can lead to unrealistically optimistic performance estimates due to data leakage.


Short Answer

Question 21. Explain why Walmart's demand forecasting uses asymmetric loss functions for perishable food items. What would go wrong if they used symmetric RMSE instead?


Question 22. Tom's initial polynomial regression model (degree 15) had a training R² of 0.97 and a test R² of 0.41. After applying Ridge regularization, the training R² dropped to 0.79 but the test R² improved to 0.76. Explain, in language a business executive would understand, why a lower training score can indicate a better model.


Question 23. Athena's demand model uses lag features (yesterday's sales, last week's same-day sales) as predictors. Explain the potential problem if the model is used to generate forecasts more than one day into the future. How would you address this challenge?


Question 24. A marketing team interprets a regression coefficient of 2.3 for "email_frequency" (emails per week) predicting customer lifetime value as evidence that sending more emails will increase CLV. Identify the flaw in this reasoning and suggest how the company could establish whether the relationship is causal.


Question 25. Compare the DemandForecaster class from this chapter with the ChurnClassifier from Chapter 7. Identify two key differences in how the models are evaluated, and explain why these differences exist.


Answer key is available in Appendix B. For Question 25, reference both Chapter 7 and Chapter 8 code implementations.