Chapter 8 Exercises: Supervised Learning — Regression

DataField.Dev

Chapter 8 Exercises: Supervised Learning — Regression

Section A: Recall and Comprehension

Exercise 8.1 Define the following terms in your own words, using no more than two sentences each: (a) regression, (b) residual, (c) overfitting, (d) regularization, (e) feature engineering.

Exercise 8.2 Explain the difference between classification and regression in supervised learning. For each of the following business problems, state whether it is a classification problem, a regression problem, or could be framed as either — and justify your answer: - (a) Predicting whether a customer will respond to an email campaign - (b) Predicting how much a customer will spend in the next quarter - (c) Predicting the optimal price for a hotel room on a given night - (d) Predicting whether a loan applicant will default - (e) Predicting how many days until a machine needs maintenance

Exercise 8.3 In a linear regression equation y = b₀ + b₁x, explain the business meaning of: - (a) The slope (b₁) - (b) The intercept (b₀) - (c) The residual for a specific observation

Use the example of predicting daily coat sales from average temperature to make your explanation concrete.

Exercise 8.4 List three assumptions of linear regression. For each, describe a realistic business scenario where the assumption is violated and explain the consequence of the violation.

Exercise 8.5 Explain the difference between Ridge (L2) and Lasso (L1) regularization. In what business scenario would you prefer Lasso over Ridge? Why?

Exercise 8.6 Describe the difference between bagging (Random Forest) and boosting (Gradient Boosting) in your own words. Use an analogy that a non-technical business stakeholder would understand.

Exercise 8.7 Define R², MAE, RMSE, and MAPE. For each metric, describe one scenario where that metric would be the most appropriate choice for evaluating a demand forecasting model.

Section B: Application

Exercise 8.8: Interpreting Regression Coefficients A multiple regression model for predicting monthly revenue (in thousands of dollars) at an e-commerce company produces the following coefficients:

Feature	Coefficient
Intercept	120.0
Marketing spend ($K)	2.3
Number of products listed	0.05
Average customer rating (1-5)	45.0
Is holiday month (0/1)	85.0
Competitor price index	-1.8

(a) Interpret each coefficient in plain business language.
(b) Based on these coefficients, what is the predicted monthly revenue for a month with $50K marketing spend, 2,000 products, 4.2 average rating, no holiday, and a competitor price index of 100?
(c) The model's R² is 0.72. What does this mean? Is it "good enough" for business planning? What additional information would you need to make that judgment?
(d) The marketing spend and number of products listed have a correlation of 0.85. What problem might this cause, and how would you address it?

Exercise 8.9: Overfitting Diagnosis A data scientist presents the following model comparison results for a customer lifetime value prediction:

Model	Training R²	Test R²	Training MAE
Linear Regression	0.61	0.58	$142 \| $149
Random Forest (depth=5)	0.75	0.71	$98 \| $112
Random Forest (depth=20)	0.96	0.63	$24 \| $138
XGBoost (300 trees)	0.89	0.74	$56 \| $104
Polynomial (degree=8)	0.93	0.42	$38 \| $195

(a) Which models show signs of overfitting? How can you tell?
(b) Which model would you recommend for deployment? Justify your choice.
(c) For the Random Forest (depth=20), suggest two specific changes that might improve test performance.
(d) Why does the Polynomial (degree=8) model have the worst test performance despite having the second-best training performance?

Exercise 8.10: Feature Engineering for Hotel Pricing You are building a regression model to predict nightly hotel room prices. Your raw data includes: - Date of stay - Hotel star rating (1-5) - Number of rooms in hotel - City - Distance from city center (km) - Guest review score (1-10) - Number of reviews - Whether breakfast is included

(a) Propose five engineered features that could improve model performance. For each, explain the business intuition behind the feature.
(b) Which pairs of features might exhibit multicollinearity? Why?
(c) How would you encode the "city" variable if there are 500 unique cities in the dataset? Discuss at least two approaches with their trade-offs.
(d) Propose one interaction feature and explain why you expect the interaction effect to be meaningful.

Exercise 8.11: Time Series Feature Engineering You are forecasting weekly sales for a grocery store. Your raw data includes only the date and total weekly sales for the past three years.

(a) Create a list of at least eight features you would engineer from the date alone (no external data sources).
(b) For each feature, explain what pattern it is designed to capture.
(c) When creating lag features, why is it important to use .shift() rather than including the current week's value?
(d) If you wanted to add external data sources to improve the model, suggest three that would be especially valuable for grocery demand. For each, explain how you would integrate it with the existing data.

Exercise 8.12: Athena Product Category Analysis Athena's demand forecasting model achieves the following MAPE by product category:

Category	MAPE (%)	Sales Volume (units/month)
Basic outerwear	7.2	45,000
Athletic shoes	11.5	32,000
Fashion accessories	24.8	18,000
Home goods (stable)	5.1	62,000
Seasonal decorations	31.4	8,000
Electronics accessories	9.8	28,000

(a) Which categories meet a "good" forecasting threshold (MAPE < 15%)? Which do not?
(b) For the fashion accessories category, why might the MAPE be so high? Suggest three specific strategies to improve forecast accuracy for this category.
(c) Seasonal decorations have the worst MAPE but the lowest sales volume. Should Athena invest in improving this forecast? Defend your answer using a cost-benefit framework.
(d) If Athena can only invest in improving forecasting for one category, which would you recommend and why? Consider both MAPE and volume.

Exercise 8.13: Safety Stock Calculation Using the safety stock formula from Section 8.11:

safety_stock = z × sigma × sqrt(lead_time_days)

A product has the following characteristics: - Forecast error standard deviation (sigma): 35 units/day - Supplier lead time: 14 days - Current service level target: 95% (z = 1.65)

(a) Calculate the required safety stock.
(b) If the company raises its service level target to 99% (z = 2.33), how much does safety stock increase? Express as both absolute units and percentage increase.
(c) A new, more accurate demand model reduces the forecast error standard deviation from 35 to 22 units/day. Recalculate safety stock at the 95% service level. How many fewer units of safety stock are needed?
(d) Each unit of safety stock costs $12 per day to hold. What is the daily holding cost savings from the improved model in part (c)?
(e) The improved model cost $200,000 to develop and deploy. Based on the holding cost savings alone, what is the payback period?

Section C: Analysis and Evaluation

Exercise 8.14: The Zillow Zestimate Dilemma Read Case Study 2 (Zillow's Zestimate). The Zestimate had a median error of approximately 5 percent for on-market homes.

(a) Calculate the range of possible true values for a home with a Zestimate of $600,000, assuming the median error applies. What is the dollar range?
(b) If Zillow's iBuying margin was 3 percent, what maximum prediction error could Zillow tolerate before losing money on a transaction? (Ignore renovation and transaction costs for simplicity.)
(c) The 5 percent median error means half of predictions had errors larger than 5 percent. If the error distribution is approximately normal with a median of 5 percent and a standard deviation of 4 percent, what percentage of Zillow's purchases would be expected to have errors greater than 10 percent?
(d) Given your analysis, was it mathematically reasonable for Zillow to believe iBuying could be profitable? What risk management strategy could have made the venture viable?

Exercise 8.15: Model Selection Under Constraints You are a data scientist at a mid-size insurance company. You need to build a regression model to predict claim amounts for auto insurance policies. The model will be used to set premium prices and must pass regulatory review.

(a) Regulators require that the model be "explainable" — the company must be able to explain why any individual customer received their specific premium. Which regression algorithms from this chapter would and would not meet this requirement?
(b) Your initial dataset has 120 features, but you suspect many are redundant or irrelevant. Which regularization approach would you use, and why?
(c) Claim amounts are highly right-skewed (most claims are small, but a few are very large). How would you handle this in your regression approach? Discuss at least two strategies.
(d) Auto insurance claim amounts are influenced by factors that change over time (vehicle technology, road conditions, repair costs). How would you monitor and maintain your model post-deployment?

Exercise 8.16: The Cost of Prediction Error An airline uses a regression model to predict the number of passengers who will show up for each flight (some ticketed passengers don't show). The airline overbooks flights based on these predictions.

(a) What is the cost of the model over-predicting no-shows (predicting more no-shows than actually occur)?
(b) What is the cost of under-predicting no-shows?
(c) Which type of error is more costly for the airline? Which is more costly for customers?
(d) How should the airline's loss function differ from symmetric RMSE? Propose a modified loss function that accounts for the asymmetry.
(e) Should the airline optimize for minimizing their costs, minimizing customer inconvenience, or some balance? How does this connect to the broader themes of responsible AI discussed in this textbook?

Section D: Python Application

Exercise 8.17: Build a Price Prediction Model Using the concepts and code patterns from Section 8.10, build a regression model to predict product prices. Use the following synthetic data generator:

import numpy as np
import pandas as pd

np.random.seed(42)
n = 2000

data = pd.DataFrame({
    'brand_tier': np.random.choice([1, 2, 3], n, p=[0.3, 0.5, 0.2]),
    'material_quality': np.random.uniform(1, 10, n),
    'customer_rating': np.clip(np.random.normal(4.0, 0.8, n), 1, 5),
    'num_reviews': np.random.exponential(50, n).astype(int) + 1,
    'is_seasonal': np.random.choice([0, 1], n, p=[0.7, 0.3]),
    'weight_kg': np.random.uniform(0.1, 15, n),
})

data['price'] = (
    20 + data['brand_tier'] * 30
    + data['material_quality'] * 8
    + data['customer_rating'] * 15
    + np.log1p(data['num_reviews']) * 5
    + data['is_seasonal'] * 20
    + data['weight_kg'] * 2
    + data['brand_tier'] * data['material_quality'] * 2  # interaction
    + np.random.normal(0, 15, n)  # noise
)

(a) Split the data 80/20 into training and test sets.
(b) Train a Linear Regression, Ridge, and Random Forest model.
(c) Calculate R², MAE, RMSE, and MAPE for each model on the test set.
(d) Create the interaction feature brand_tier × material_quality and retrain the Linear Regression. Does performance improve? Why?
(e) Use Lasso to identify which features are most important. Do the results align with the data generation formula?

Exercise 8.18: Demand Forecasting Extension Using the DemandForecaster class from Section 8.10:

(a) Modify the generate_athena_sales_data function to include a "pandemic shock" — a 60 percent drop in demand for 8 weeks starting in March 2023, followed by a gradual recovery. Retrain the models and observe how they handle this disruption.
(b) Add a new feature: "days_since_last_promotion" (the number of days since the last promotion at each store). Does this feature improve model performance?
(c) Implement a custom loss function in the calculate_business_impact method where stockout costs are 3x overstock costs. How does this change the business impact analysis? How should this asymmetry be reflected in the model itself?
(d) Create a visualization that shows the model's MAPE by month. Are there specific months where the model performs especially well or poorly? What business factors might explain the pattern?

Answers to selected exercises are available in Appendix B.