Chapter 34 Exercises — Predictive Models: Regression and Classification
These exercises progress from individual skill building through full-system integration. Work through each tier in order; later tiers assume the skills from earlier ones.
Tier 1: Foundations
These exercises verify that you can execute the basic scikit-learn workflow for both regression and classification. Focus on correctness over efficiency — make the code run and produce sensible output before worrying about performance.
Exercise 1.1 — Your First Linear Regression
Using scikit-learn, build a linear regression model to predict monthly revenue from the number of sales transactions.
import pandas as pd
import numpy as np
np.random.seed(42)
n = 80
transactions = np.random.randint(50, 400, size=n)
revenue = transactions * 425 + np.random.normal(0, 8000, size=n)
df = pd.DataFrame({
"monthly_transactions": transactions,
"monthly_revenue": revenue
})
Tasks:
a) Split the data into 80% training and 20% test sets using train_test_split. Set random_state=42.
b) Fit a LinearRegression model on the training data.
c) Print the model's coefficient and intercept. Interpret what the coefficient means in plain English (e.g., "Each additional transaction is associated with...").
d) Calculate R², MAE, and RMSE on the test set. Print each metric with a label.
e) If R² = 0.91, what does that mean? Write a one-sentence interpretation.
Exercise 1.2 — Classification with Logistic Regression
A retail company tracks whether customers made a repeat purchase within 90 days. Build a binary classifier.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
np.random.seed(7)
n = 200
days_since_first_purchase = np.random.randint(1, 730, size=n)
total_orders = np.random.randint(1, 15, size=n)
avg_order_value = np.random.uniform(25, 300, size=n)
# Repeat purchasers tend to have more orders and higher AOV
log_odds = -2.5 + 0.25 * total_orders + 0.005 * avg_order_value - 0.001 * days_since_first_purchase
prob = 1 / (1 + np.exp(-log_odds))
repeat_purchase = (np.random.random(n) < prob).astype(int)
df = pd.DataFrame({
"days_since_first_purchase": days_since_first_purchase,
"total_orders": total_orders,
"avg_order_value": avg_order_value,
"repeat_purchase": repeat_purchase
})
Tasks:
a) Create feature matrix X and target vector y. Split 80/20 with random_state=42 and stratify=y.
b) Scale features using StandardScaler. Fit on training data only; transform both training and test.
c) Train a LogisticRegression model with random_state=42.
d) Print the confusion matrix (formatted clearly, with labels).
e) Print accuracy, precision, recall, and F1 score. Which metric matters most if the goal is to identify as many likely repeat purchasers as possible for a targeted promotion? Explain.
Exercise 1.3 — Decision Tree Basics
Build a decision tree classifier on the same dataset from Exercise 1.2. Compare its performance to logistic regression.
Tasks:
a) Train a DecisionTreeClassifier with max_depth=4 and random_state=42 on the same training split you used in Exercise 1.2 (no scaling needed for decision trees).
b) Print accuracy, precision, recall, and F1 on the test set.
c) Print feature importances as a formatted table showing feature name and importance score, sorted descending.
d) Now train a second decision tree with max_depth=None (no limit). What happens to training accuracy vs. test accuracy? What does this demonstrate?
e) In one sentence each: what is overfitting, and why does max_depth help prevent it?
Exercise 1.4 — Evaluating Regression Models
You are evaluating three models that predict quarterly sales. Their metrics on the test set are:
| Model | R² | MAE | RMSE |
|---|---|---|---|
| Model A | 0.88 | $4,200 | $6,800 | |
| Model B | 0.91 | $5,100 | $5,950 | |
| Model C | 0.85 | $3,900 | $9,400 |
Tasks:
a) Which model would you choose if the primary concern is minimizing the worst-case prediction errors? Explain using the metrics.
b) Which model would you choose if you need reliable average-case accuracy for budgeting? Explain.
c) Model C has a much higher RMSE than MAE relative to the other models. What does this suggest about Model C's error distribution?
d) Write Python code (no training required) to compute R², MAE, and RMSE given two arrays: y_true and y_pred. Use only sklearn.metrics.
Exercise 1.5 — The Train/Test Split
This exercise explores why holding out test data is essential.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
np.random.seed(0)
X = np.random.randn(30, 1)
y = 3 * X.ravel() + np.random.randn(30) * 0.5
Tasks:
a) Fit a LinearRegression on all 30 samples. Calculate R² on those same 30 samples. Record the score.
b) Now split 70/30 (train/test). Fit on train, evaluate on test. Record the test R².
c) Repeat part (b) but with random_state=1, random_state=2, and random_state=3. Print all four test R² values.
d) The score in part (a) is higher than in part (b). Explain why, and explain why this makes part (a)'s score misleading for assessing real-world performance.
e) Why does the test score vary across different random states in part (c)? What does this suggest about conclusions drawn from a single train/test split?
Tier 1 Answer Key
Exercise 1.1:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
X = df[["monthly_transactions"]]
y = df["monthly_revenue"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Coefficient: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
# Interpretation: each additional transaction is associated with ~$425 more monthly revenue.
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R²: {r2:.3f}")
print(f"MAE: ${mae:,.0f}")
print(f"RMSE: ${rmse:,.0f}")
R² = 0.91 means the model explains 91% of the variance in monthly revenue from transaction count alone; the remaining 9% comes from factors the model has not captured.
Exercise 1.4c: Model C's RMSE is disproportionately large relative to its MAE. RMSE penalizes large errors more heavily than MAE. This means Model C makes occasional very large prediction errors even though its typical error (MAE) is the smallest of the three. It has a skewed error distribution — most predictions are close but a few are far off.
Tier 2: Applied
These exercises ask you to make real decisions — choosing the right algorithm, interpreting output in business terms, and handling the preprocessing steps that real data requires.
Exercise 2.1 — Feature Engineering from Dates
Customer records include a signup_date and a last_purchase_date. These raw dates are not useful features directly, but you can engineer numeric features from them.
import pandas as pd
import numpy as np
reference_date = pd.Timestamp("2024-01-01")
df = pd.DataFrame({
"customer_id": range(1, 101),
"signup_date": pd.date_range("2020-01-01", periods=100, freq="15D"),
"last_purchase_date": pd.date_range("2023-01-01", periods=100, freq="4D"),
"total_lifetime_spend": np.random.uniform(100, 5000, 100),
"churned": np.random.randint(0, 2, 100)
})
Tasks:
a) Create a new column days_since_signup = number of days between signup_date and reference_date.
b) Create days_since_last_purchase = number of days between last_purchase_date and reference_date.
c) Create spend_per_day_active = total_lifetime_spend divided by days_since_signup. Handle the division carefully — what happens if days_since_signup is 0?
d) Build a logistic regression model using these three engineered features plus total_lifetime_spend to predict churned. Print classification report.
e) Which of the four features has the largest absolute coefficient after scaling? Interpret that finding in plain English.
Exercise 2.2 — Dealing with Categorical Features
Your dataset contains a mix of numeric and categorical variables. Scikit-learn requires numeric input.
import pandas as pd
import numpy as np
np.random.seed(3)
n = 150
df = pd.DataFrame({
"contract_type": np.random.choice(["Monthly", "Annual", "Multi-Year"], n),
"industry": np.random.choice(["Retail", "Healthcare", "Finance", "Manufacturing"], n),
"monthly_spend": np.random.uniform(500, 10000, n),
"support_tickets_ytd": np.random.randint(0, 25, n),
"months_as_customer": np.random.randint(1, 60, n),
"churned": np.random.randint(0, 2, n)
})
Tasks:
a) Use pd.get_dummies() to encode contract_type and industry. Set drop_first=True. Print the resulting column names.
b) What does drop_first=True do, and why is it important for linear models?
c) Build a LogisticRegression model with the encoded features. Scale the numeric columns but not the dummy variables. Hint: use ColumnTransformer or encode separately.
d) Print the confusion matrix and F1 score.
e) A colleague suggests just replacing "Monthly" with 1, "Annual" with 2, and "Multi-Year" with 3 (label encoding). What problem does this create for a logistic regression model?
Exercise 2.3 — Regression with Multiple Features
Predict annual customer revenue from account data.
import pandas as pd
import numpy as np
np.random.seed(11)
n = 120
df = pd.DataFrame({
"num_users": np.random.randint(5, 200, n),
"contract_months": np.random.randint(12, 60, n),
"support_tier": np.random.choice([1, 2, 3], n), # 1=Basic, 2=Pro, 3=Enterprise
"integration_count": np.random.randint(0, 10, n),
"onboarding_score": np.random.uniform(1, 10, n),
})
df["annual_revenue"] = (
df["num_users"] * 180
+ df["contract_months"] * 50
+ df["support_tier"] * 4000
+ df["integration_count"] * 800
+ df["onboarding_score"] * 300
+ np.random.normal(0, 3000, n)
)
Tasks:
a) Fit a LinearRegression model (train/test 80/20, random_state=42). Print R², MAE, RMSE.
b) Print a coefficient table: feature name, coefficient value, sorted by absolute coefficient descending. What is the single most influential feature?
c) Add a feature users_x_support = num_users * support_tier (an interaction term). Refit and compare R² to part (a). Did the interaction term improve the model?
d) What does a negative R² indicate? (It is possible — when does it happen?)
e) If you increased num_users by 10 while holding all other features constant, how much would the model predict annual revenue to increase? Use the coefficient from part (b).
Exercise 2.4 — Threshold Selection
A model predicts customer churn probability. The default threshold of 0.50 classifies probabilities above 0.50 as "churn." But this threshold is a choice, not a law.
import numpy as np
np.random.seed(99)
# Simulated: true labels and model probability scores
y_true = np.random.randint(0, 2, 300)
# Simulate a moderately good model
y_prob = np.clip(y_true * 0.5 + np.random.uniform(0, 0.6, 300), 0, 1)
Tasks:
a) Using threshold = 0.50, compute and print precision, recall, and F1 for the positive class (churn = 1).
b) Repeat for thresholds of 0.30, 0.40, 0.60, and 0.70. Present results in a formatted table.
c) Your company's retention team can handle outreach to at most 80 customers per month. There are 300 customers in this dataset. Which threshold maximizes the number of true churners identified among those 80? Write the logic to determine this.
d) Define in one sentence each: precision, recall, and why there is an inherent tradeoff between them when you adjust the threshold.
Exercise 2.5 — Random Forest Feature Importance
Build a Random Forest classifier and use feature importances to identify the most predictive signals.
import pandas as pd
import numpy as np
np.random.seed(55)
n = 250
df = pd.DataFrame({
"recency_days": np.random.randint(1, 365, n),
"frequency_orders": np.random.randint(1, 50, n),
"monetary_total": np.random.uniform(50, 10000, n),
"avg_days_between_orders": np.random.uniform(5, 90, n),
"pct_orders_returned": np.random.uniform(0, 0.5, n),
"num_product_categories": np.random.randint(1, 10, n),
"newsletter_subscriber": np.random.randint(0, 2, n),
"support_ticket_count": np.random.randint(0, 20, n),
})
# Simulate churn: driven primarily by recency and frequency
log_odds = (
-1.0
+ 0.008 * df["recency_days"]
- 0.05 * df["frequency_orders"]
+ 0.003 * df["pct_orders_returned"] * 100
)
prob = 1 / (1 + np.exp(-log_odds))
df["churned"] = (np.random.random(n) < prob).astype(int)
Tasks:
a) Train a RandomForestClassifier with n_estimators=100, max_depth=6, random_state=42 on 80% of the data.
b) Print a feature importance table, sorted descending. Which three features are most important?
c) Retrain using only the top 3 features. Compare accuracy and F1 to the full model. Is simplifying the model justified?
d) What is one advantage of Random Forest over a single decision tree besides typically higher accuracy?
e) Feature importances from a Random Forest can sometimes be misleading when features are correlated. Briefly explain why.
Tier 3: Integration
These exercises combine multiple concepts within a single workflow: data preparation, model training, evaluation, and business interpretation.
Exercise 3.1 — The Full Churn Pipeline
Build a complete churn prediction pipeline from raw data to actionable output.
import pandas as pd
import numpy as np
np.random.seed(77)
n = 400
df = pd.DataFrame({
"customer_id": [f"C{i:04d}" for i in range(1, n + 1)],
"segment": np.random.choice(["SMB", "Mid-Market", "Enterprise"], n,
p=[0.5, 0.35, 0.15]),
"months_active": np.random.randint(1, 84, n),
"mrr": np.random.uniform(200, 8000, n),
"logins_last_30d": np.random.randint(0, 60, n),
"tickets_open": np.random.randint(0, 8, n),
"nps_score": np.random.choice(range(0, 11), n),
"last_upsell_days": np.random.randint(0, 730, n),
})
df["churned"] = (
(df["logins_last_30d"] < 5).astype(int)
| (df["nps_score"] < 4).astype(int)
| (df["tickets_open"] > 4).astype(int)
).clip(0, 1)
Tasks:
a) Encode the segment column using one-hot encoding. Drop customer_id from features.
b) Build a pipeline that scales numeric features and trains a LogisticRegression. Use sklearn.pipeline.Pipeline.
c) Evaluate with 5-fold stratified cross-validation. Report mean and standard deviation of F1 score.
d) After fitting on 80% train data, generate a DataFrame of predictions for the test set with columns: customer_id, actual_churned, predicted_churned, churn_probability. Sort by churn_probability descending.
e) Print the top 10 highest-risk customers. Which customer segment appears most frequently in the top 10?
Exercise 3.2 — Revenue Forecasting with Cross-Validation
Compare linear regression with a random forest regressor on a revenue prediction problem. Use cross-validation to assess which generalizes better.
import pandas as pd
import numpy as np
np.random.seed(13)
n = 200
df = pd.DataFrame({
"team_size": np.random.randint(2, 50, n),
"deal_size_avg": np.random.uniform(5000, 100000, n),
"pipeline_coverage": np.random.uniform(1.0, 6.0, n),
"avg_sales_cycle_days": np.random.randint(14, 180, n),
"win_rate": np.random.uniform(0.10, 0.60, n),
"marketing_spend": np.random.uniform(10000, 200000, n),
})
df["quarterly_revenue"] = (
df["team_size"] * df["win_rate"] * df["deal_size_avg"] * 3
+ df["marketing_spend"] * 1.8
- df["avg_sales_cycle_days"] * 500
+ np.random.normal(0, 50000, n)
).clip(0)
Tasks:
a) Use cross_val_score with cv=5 and scoring="r2" on both a LinearRegression and a RandomForestRegressor(n_estimators=100, random_state=42). Print mean ± std R² for each.
b) Use cross_val_score with scoring="neg_mean_absolute_error". Convert to positive MAE. Which model achieves lower average MAE?
c) The RandomForestRegressor has a hyperparameter n_estimators. Train forests with 10, 50, 100, 200, and 500 estimators. Plot (or print in a table) how mean cross-validated R² changes. At what point does adding more trees stop helping?
d) Does R² of 0.75 in cross-validation mean the model is "good"? Write a 3-4 sentence answer addressing: what it measures, what it does not measure, and what additional information you would want before deploying.
Exercise 3.3 — Interpreting a Deployed Model
Your logistic regression churn model has been trained and is in production. A stakeholder asks you to explain three specific predictions to the sales team.
Use the coefficients from a trained model (train on Exercise 3.1's dataset, or similar). For each of these customers, calculate and explain their churn probability:
| Customer | months_active | mrr | logins_last_30d | tickets_open | nps_score |
|---|---|---|---|---|---|
| Alpha Corp | 48 | 4,200 | 2 | 5 | 3 |
| Beta Inc | 12 | 800 | 45 | 0 | 9 |
| Gamma LLC | 6 | 1,500 | 18 | 2 | 6 |
Tasks:
a) Train a logistic regression model on Exercise 3.1's data (without the segment columns for simplicity). Print scaled coefficients for each numeric feature.
b) For Alpha Corp, explain in plain English why the model predicts the probability it does. Reference specific features.
c) If you could convince Alpha Corp to do one thing to reduce their churn risk, what would it be? Base this on the feature coefficients.
d) Write a function explain_prediction(model, scaler, feature_names, customer_data) that prints a human-readable explanation of the top 3 factors driving a customer's churn probability.
Exercise 3.4 — Handling Class Imbalance
Churn datasets are often imbalanced — 85-95% of customers do not churn. This can fool a naive model into predicting "not churn" for everyone.
import pandas as pd
import numpy as np
np.random.seed(22)
n = 1000
df = pd.DataFrame({
"feature_1": np.random.randn(n),
"feature_2": np.random.randn(n),
"feature_3": np.random.randn(n),
"feature_4": np.random.randn(n),
})
# Only 8% churn rate
log_odds = -3.0 + 0.8 * df["feature_1"] - 0.6 * df["feature_2"]
prob = 1 / (1 + np.exp(-log_odds))
df["churned"] = (np.random.random(n) < prob).astype(int)
print(df["churned"].value_counts()) # Expect ~920 non-churn, ~80 churn
Tasks:
a) Train a LogisticRegression on 80% of this data. Report accuracy, precision, recall, and F1. What is wrong with this result?
b) Retrain with class_weight="balanced". How do the metrics change?
c) Explain why accuracy is a misleading metric for imbalanced classification. What would a model that always predicts "not churn" score for accuracy on this dataset?
d) A colleague suggests "we should just delete most of the non-churn customers to balance the dataset." What is one risk of this approach?
e) Which metric — precision, recall, or F1 — is most appropriate as a primary metric for an imbalanced churn problem, and why? (There is no single right answer — defend your choice.)
Tier 4: Challenge
These exercises require combining multiple chapter concepts into complete analytical systems, making judgment calls with imperfect information, and communicating results to non-technical stakeholders.
Exercise 4.1 — Model Selection Workflow
You are advising a B2B software company on which predictive model to deploy for churn detection. You have run the following 5-fold cross-validation results:
| Model | Mean F1 | Std F1 | Mean AUC | Std AUC | Train Time (s) | Interpretable? |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.68 | 0.04 | 0.82 | 0.03 | 0.1 | Yes |
| Decision Tree (depth=5) | 0.65 | 0.06 | 0.77 | 0.05 | 0.2 | Yes |
| Random Forest (100 trees) | 0.73 | 0.03 | 0.86 | 0.02 | 4.2 | Partially |
| Gradient Boosting | 0.74 | 0.03 | 0.87 | 0.02 | 18.7 | No |
Tasks:
a) Write a structured recommendation (3-4 paragraphs) selecting one model. Justify your choice by explicitly addressing: performance, variance (stability), interpretability, and operational concerns.
b) The sales team wants to know "why" a customer is flagged as high-risk for each case. Which models support this requirement well, and which do not?
c) The company has 5,000 customers and reruns predictions weekly. Which model's training time matters least? Which would you worry about at 500,000 customers?
d) If you had to choose between Random Forest with F1=0.73 ± 0.03 and Gradient Boosting with F1=0.74 ± 0.03, what additional information would you want before making a final recommendation?
Exercise 4.2 — Building a Reusable Prediction Module
Write a Python module churn_predictor.py with the following components:
a) A ChurnPredictor class with:
- __init__(self, model_type="logistic") supporting "logistic", "tree", and "forest"
- fit(self, X, y) — preprocesses (scales) and trains the model
- predict_proba(self, X) — returns churn probabilities for new customers
- get_top_risk_customers(self, X, customer_ids, threshold=0.60) — returns a DataFrame of customer IDs and probabilities where probability exceeds threshold, sorted descending
- get_feature_report(self, feature_names) — prints model coefficients (for logistic) or feature importances (for forest/tree)
b) A evaluate_model(predictor, X_test, y_test) standalone function that prints accuracy, precision, recall, F1, and AUC in a formatted table.
c) A if __name__ == "__main__" block that generates synthetic churn data, trains the model, evaluates it, and prints the top 10 risk customers.
Requirements: full docstrings on all methods, PEP 8 formatting, try/except for model type validation in __init__.
Exercise 4.3 — Communicating Results
You have built a logistic regression model with these test-set metrics: - Accuracy: 84% - Precision: 71% - Recall: 58% - F1: 64% - AUC: 0.81 - 30-day retention campaign cost: $150 per customer - Average MRR of a churned customer: $1,200 - Estimated retention rate when contacted: 40%
Tasks:
a) Write an executive summary (150-200 words, no technical jargon) for the VP of Customer Success explaining what the model does, how reliable it is, and why it is worth using.
b) Calculate the expected ROI of using the model vs. a baseline strategy of contacting all customers. Assume a customer base of 500 with 8% actual churn rate. The model identifies 70 customers as high-risk (with precision 71%).
c) The VP asks: "Can't we just be 100% accurate?" Write a brief, honest response explaining why perfect prediction is not achievable and what the practical ceiling on accuracy is for this type of problem.
d) List three specific questions you would ask about the data quality before deploying this model in production.
Exercise 4.4 — Diagnosing Model Problems
A colleague shares a model they built. Diagnose what is wrong in each scenario.
Scenario A:
Training accuracy: 99.4%
Test accuracy: 61.2%
Scenario B:
Churn rate in dataset: 7%
Model accuracy: 93.2%
Model recall on churn class: 0.04
Scenario C:
# Colleague's code:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fit+transform entire dataset
X_train, X_test = X_scaled[:800], X_scaled[800:]
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
Scenario D:
Feature importances (Random Forest):
customer_id: 0.58
account_number: 0.21
annual_revenue: 0.12
...
Tasks:
For each scenario, identify the problem and explain how to fix it. Be specific — name the technical issue (overfitting, data leakage, class imbalance, etc.) and describe exactly what the colleague should change.
Tier 5: Stretch
These exercises go beyond the chapter's explicit content. They require research, creative problem formulation, or connecting predictive modeling to broader business systems.
Exercise 5.1 — From Model to Decision System
A logistic regression model predicts churn probability for 2,000 customers. You need to design a decision system that uses these probabilities to assign each customer to one of three intervention tracks:
- Track A (High-Touch): Personal call from account manager — capacity: 50 customers/month, cost: $800
- Track B (Automated Sequence): Personalized email series — capacity: 300 customers/month, cost: $35
- Track C (No Action): Customer is not contacted
Known economics: average MRR = $1,800, average customer lifetime if retained = 18 months, intervention success rate = Track A: 55%, Track B: 20%.
Design and implement this decision system:
a) Define a prioritize_interventions(customer_probs_df, capacity_a=50, capacity_b=300) function that returns a DataFrame with an assigned track for each customer.
b) Calculate the expected monthly revenue saved by Track A interventions and Track B interventions separately. Calculate total intervention cost. Calculate net ROI.
c) How should the threshold for Track A change if capacity increases to 100? If success rate drops to 35%? Write general logic rather than hard-coded numbers.
d) What business information would you need to extend this system to also account for customer lifetime value (i.e., prioritizing high-MRR customers differently)?
Exercise 5.2 — Temporal Validation
Standard train_test_split splits data randomly. For time-series business data, this can be misleading because you would be "predicting" past events using future data as training.
Research TimeSeriesSplit from scikit-learn and redesign the cross-validation in Exercise 3.2 using temporal splits:
a) Explain why random train/test splitting is problematic for time-ordered data in 3-4 sentences.
b) Implement TimeSeriesSplit with 5 splits on the revenue dataset from Exercise 3.2. Print R² for each fold.
c) Compare the mean R² and variance from TimeSeriesSplit vs. standard KFold. Which is more conservative (lower estimate)? Why would you trust it more for deployment decisions?
d) Generate a visualization (or ASCII table) showing which rows are train and which are test for each of the 5 TimeSeriesSplit folds.
Exercise 5.3 — The Baseline Problem
Before deploying any model, you need to beat a baseline. Design and test three baselines for the churn prediction problem from Exercise 3.4:
a) Always-No baseline: Predict 0 (not churn) for every customer. Compute accuracy, precision, recall, F1.
b) Always-Yes baseline: Predict 1 (churn) for every customer. Compute the same metrics.
c) Random baseline: Predict churn with probability equal to the observed churn rate. Run 100 trials and report mean ± std for each metric.
d) Stratified baseline: Use DummyClassifier(strategy="stratified"). How does this compare to your manual random baseline?
e) Write 2-3 sentences explaining what it means for a model to "beat the baseline" and why this matters practically. Is a model with 84% accuracy on an 8%-churn dataset impressive?
Exercise 5.4 — Model Monitoring in Production
A churn model was trained in January and deployed in February. It is now October.
Research the concept of "model drift" (data drift and concept drift) and answer the following:
a) What is data drift? Give a concrete example of what data drift would look like in a customer churn dataset 8 months after training.
b) What is concept drift? Give a concrete example where the relationship between features and churn changes over time in ways that would invalidate a January-trained model by October.
c) Design a monthly model monitoring report. List five metrics you would track and explain what change in each metric would trigger a model retraining.
d) Implement a check_feature_drift(X_train, X_new, feature_names, threshold=0.20) function that flags any feature where the mean has shifted by more than threshold standard deviations between training and current data. Print a warning for flagged features.
Exercise 5.5 — Connecting Supply Chain and Predictive Models
Chapter 32 built an inventory analytics system for Acme Corp. Chapter 34 builds predictive models. Connect these two systems.
a) Describe (in plain English, no code required) how you would use the supply chain data from Chapter 32 as features in a predictive model. Specifically: - What would the prediction target be? (Give two options.) - What features from the supply chain system would be predictive? - What new data would you need that was not in the Chapter 32 system?
b) The Chapter 32 reorder alert system identifies which items are below their reorder point. Design a classification model that predicts, 14 days in advance, which items will go below their reorder point — before they actually do. What features would you include?
c) Implement a simplified version: use the daily demand data and lead time statistics from Chapter 32 to build a LogisticRegression that predicts whether a SKU will stock out in the next 7 days. Generate synthetic features and a target label, then train and evaluate the model.
d) What is the business value of predicting stockouts 14 days in advance vs. detecting them in real time? Express this in terms the operations team would find compelling.