Chapter 7 Exercises: Supervised Learning -- Classification

DataField.Dev

Chapter 7 Exercises: Supervised Learning -- Classification

Section A: Recall and Comprehension

Exercise 7.1 Define the following terms in your own words, using no more than two sentences each: (a) classification, (b) binary classification, (c) supervised learning, (d) feature engineering, (e) class imbalance.

Exercise 7.2 Explain the difference between classification and regression. Provide two business examples for each that were not mentioned in the chapter.

Exercise 7.3 Describe the six steps of the classification workflow presented in Section 7.2. For each step, identify one common mistake that could undermine the project.

Exercise 7.4 In your own words, explain why logistic regression is recommended as the first model to try for binary classification. List at least three reasons.

Exercise 7.5 Compare and contrast decision trees and random forests. What specific problem with individual decision trees does the random forest address, and how does it address it?

Exercise 7.6 Explain the difference between bagging (as used in random forests) and boosting (as used in gradient boosting). Use an analogy other than the ones provided in the chapter.

Exercise 7.7 Define each cell of the confusion matrix (TP, TN, FP, FN) in the context of a fraud detection model where the positive class is "fraudulent transaction."

Section B: Metrics and Interpretation

Exercise 7.8: The Accuracy Paradox A credit card company processes 10,000 transactions per day. On average, 50 are fraudulent (0.5 percent). An analyst builds a model and reports: "Our fraud detection model is 99.5 percent accurate!" - (a) Explain how this model could achieve 99.5 percent accuracy while catching zero fraud. - (b) What metric would you use instead of accuracy to evaluate this model? Why? - (c) If the model achieves 80 percent recall and 60 percent precision, what does this mean in practical terms? How many fraudulent transactions does it catch, and how many legitimate transactions does it incorrectly flag?

Exercise 7.9: Precision vs. Recall Tradeoffs For each of the following business applications, state whether you would prioritize precision or recall, and explain your reasoning: - (a) Email spam filtering for a CEO's inbox - (b) Screening mammograms for breast cancer - (c) Predicting which products will be best-sellers for inventory stocking - (d) Identifying money laundering in bank transactions - (e) Recommending movies on a streaming platform - (f) Flagging student essays for potential plagiarism

Exercise 7.10: Threshold Selection A customer churn model produces the following results at different thresholds on a test set of 2,000 customers (400 actual churners, 1,600 non-churners):

Threshold	TP	FP	FN	TN
0.30	340	480	60	1,120
0.50	260	200	140	1,400
0.70	160	60	240	1,540

(a) Calculate precision, recall, and F1 for each threshold.
(b) If a retention offer costs $15 per customer and each lost customer represents $400 in annual revenue, which threshold maximizes net value? Show your work.
(c) A colleague suggests using the threshold with the highest F1 score. Do you agree? Why or why not?

Exercise 7.11: AUC-ROC Interpretation Your team has built two models for predicting loan defaults: - Model A: AUC = 0.92, but the loan officers find its predictions "opaque and hard to trust" - Model B: AUC = 0.84, but each prediction comes with a clear list of the top three risk factors

Write a one-paragraph recommendation for which model to deploy and why. Consider both the technical performance gap and the organizational factors discussed in this chapter.

Section C: Coding Exercises

Exercise 7.12: Data Exploration Using the generate_athena_churn_data() function from the chapter, generate a dataset of 10,000 customers. Write code to: - (a) Calculate the churn rate for each loyalty tier. Which tier has the highest churn rate? - (b) Create a summary table showing the mean and median of each numerical feature, broken down by churn status (churned vs. not churned). - (c) Identify the three features with the strongest correlation to the churned target variable.

Exercise 7.13: Logistic Regression from Scratch Using the synthetic Athena data: - (a) Train a logistic regression model using only two features: days_since_last_purchase and purchase_count_12m. Report its AUC-ROC. - (b) Add all remaining features. Report the new AUC-ROC. How much did the additional features improve performance? - (c) Examine the logistic regression coefficients. Which feature has the largest positive coefficient (most increases churn probability)? Which has the largest negative coefficient (most decreases churn probability)? Do these make business sense?

Exercise 7.14: Decision Tree Visualization Train a decision tree classifier on the Athena data with max_depth=3. Using scikit-learn's export_text() function, print the tree rules. Answer: - (a) What is the first (root) split? Why do you think the algorithm chose this feature? - (b) Follow the path for a customer with days_since_last_purchase=120, purchase_count_12m=1, and avg_order_value=30. What class does the tree predict? - (c) Would a retention manager find these rules intuitive? Why or why not?

from sklearn.tree import export_text, DecisionTreeClassifier

# Hint:
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)
print(export_text(dt, feature_names=feature_names))

Exercise 7.15: Overfitting Demonstration Train two decision tree classifiers on the Athena data: - Tree A: max_depth=None (no limit) - Tree B: max_depth=5

For each tree: - (a) Report the training accuracy and test accuracy. - (b) Report the number of leaves in the tree (use tree.get_n_leaves()). - (c) Which tree is overfit? What evidence supports your conclusion? - (d) What is the approximate performance gap between training and test accuracy for the overfit tree?

Exercise 7.16: Random Forest Hyperparameter Exploration Train four random forest classifiers on the Athena data, varying n_estimators: - 10 trees - 50 trees - 200 trees - 500 trees

Hold all other hyperparameters constant. For each, report the AUC-ROC on the test set. - (a) At what point does adding more trees stop meaningfully improving performance? - (b) Does training time increase linearly with the number of trees? Measure and report. - (c) Given these results, what number of trees would you recommend for production deployment? Justify your choice.

Exercise 7.17: Feature Engineering Challenge Starting with the raw Athena data, engineer at least three new features not already in the dataset. Examples to consider: - Revenue per month of tenure - A binary flag for "high returner" (return rate > some threshold) - Interaction between purchase trend and email engagement

Train a random forest model with and without your engineered features. Report the AUC-ROC for each. Did your features improve the model? Which engineered feature had the highest importance?

Exercise 7.18: Class Imbalance Strategies Using the Athena data: - (a) Train a logistic regression model on the raw imbalanced data (no class weights, no resampling). Report precision, recall, and F1 for the churn class. - (b) Train the same model with class_weight='balanced'. Report the same metrics. - (c) Use SMOTE to resample the training data and train a third model. Report the same metrics. - (d) Compare all three approaches. Which strategy produces the best recall? Which produces the best precision? Which would you recommend for Athena's use case, and why?

# Hint for SMOTE:
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Exercise 7.19: The Complete ChurnClassifier Run the full ChurnClassifier pipeline from the chapter. Then extend it with one of the following enhancements: - (a) Add a plot_roc_curves() method that plots ROC curves for all three models on the same chart using matplotlib. - (b) Add a cross_validate() method that performs 5-fold cross-validation and reports mean and standard deviation for AUC-ROC. - (c) Add a top_risk_customers() method that returns the top N highest-risk customers from the test set, sorted by churn probability.

Section D: Business Application

Exercise 7.20: ML Canvas for a New Use Case Choose one of the following classification problems and complete the ML Canvas (introduced in Chapter 6): - (a) Predicting employee attrition for a 5,000-person company - (b) Predicting which free-trial users will convert to paid subscribers for a SaaS product - (c) Predicting which insurance claims are likely fraudulent

Your canvas should address: value proposition, prediction target, data sources, features (at least eight), training data, model output, decision integration, evaluation metrics, failure modes, and monitoring plan.

Exercise 7.21: The Threshold Decision You are the VP of Marketing at a subscription box company. Your data science team presents a churn model with AUC = 0.86. They ask you to decide on a classification threshold. Your team provides the following information: - Monthly subscription: $39.99 - Average customer lifetime: 14 months - Customer acquisition cost: $65 - Retention offer cost: $25 (one free box) - Estimated save rate if targeted: 35 percent

(a) Calculate the customer lifetime value (CLV) for an average customer.
(b) Calculate the cost of a false negative (missed churner) and a false positive (unnecessary retention offer).
(c) Based on these economics, would you prefer a lower or higher threshold? Explain your reasoning.
(d) At what threshold would the cost of false positives start to outweigh the benefit of catching additional true positives? Describe conceptually how you would determine this.

Exercise 7.22: Communicating Model Results Your churn model produces the following results on a test set of 5,000 customers:

Confusion Matrix:
                    Predicted: Stay    Predicted: Churn
Actual: Stayed          3,800              250
Actual: Churned           350              600

Prepare a one-page executive briefing for the CMO that includes: - (a) A plain-English explanation of what these numbers mean (avoid jargon like "true positive" and "false negative") - (b) The recommended action for each of the four groups in the matrix - (c) An honest assessment of the model's limitations - (d) A recommendation for next steps

Exercise 7.23: Feature Selection as a Business Conversation A healthcare insurance company wants to predict which members are at risk of not renewing their policy. The data science team proposes using the following features: - Age, gender, zip code - Number of claims filed in the past year - Total claim dollar amount - Number of customer service calls - Type of plan (individual, family, employer-sponsored) - Whether the member visited the online portal in the last 90 days - Credit score

For each feature: - (a) Explain why it might be predictive of non-renewal. - (b) Identify any ethical or legal concerns with using it (consider fair lending laws, disparate impact, and Chapter 25's bias discussion). - (c) Recommend whether to include it, exclude it, or include it with safeguards.

Exercise 7.24: Debugging a Poorly Performing Model Your team builds a classification model to predict whether a customer will respond to a promotional email. The model achieves AUC = 0.52 on the test set (barely better than random guessing). Propose at least five specific diagnoses and corresponding remedies. Consider: - Data quality issues - Target variable definition problems - Feature engineering gaps - Data leakage possibilities - Fundamental problem framing issues

Exercise 7.25: Algorithm Selection For each of the following business scenarios, recommend which classification algorithm(s) to try first and explain your reasoning: - (a) A bank needs to explain every credit decision to regulators - (b) A startup has 500 labeled examples and needs a quick prototype - (c) An e-commerce company has 50 million transactions and needs maximum accuracy - (d) A hospital needs to classify medical images (this is a preview -- think about what you've learned so far and what might be different) - (e) A nonprofit with limited technical resources needs to identify donors most likely to give this year

Section E: Challenge Problems

Exercise 7.26: Build Your Own Classifier Choose a publicly available binary classification dataset (suggestions: the Kaggle Titanic dataset, the UCI Heart Disease dataset, or the Telco Customer Churn dataset). Adapt the ChurnClassifier class from the chapter to work with your chosen dataset. Report: - (a) Dataset description and target variable - (b) Feature engineering steps you performed - (c) Model comparison results (at least three models) - (d) Feature importance analysis - (e) Optimal threshold based on a business scenario you define - (f) A one-paragraph "business report" explaining your findings to a non-technical stakeholder

Exercise 7.27: The Cost of Delayed Prediction Athena's model uses a 180-day churn definition. Ravi's team debates whether a 90-day window would be better (faster intervention) or worse (less data, less reliable predictions). - (a) What are the business advantages of a shorter prediction window? - (b) What are the data science disadvantages? - (c) Design an experiment that would test both windows and determine which creates more business value. Describe your methodology, success metrics, and timeline.

Exercise 7.28: Multi-Model Ensemble Build a simple ensemble classifier that combines the predictions of logistic regression, random forest, and XGBoost through weighted averaging of predicted probabilities. Experiment with different weight combinations (e.g., equal weights vs. AUC-weighted). Does the ensemble outperform the best individual model? Under what conditions might it not?

Selected solutions are available in Appendix B: Answers to Selected Exercises.