Chapter 34: Key Takeaways — Predictive Models: Regression and Classification
The Central Insight
Most business prediction problems fall into one of two categories: predicting a number (regression) or predicting a category (classification). Both use the same fundamental workflow — features in, trained model, predictions out — but they require different algorithms, different evaluation metrics, and different ways of communicating results to stakeholders.
The hardest part of applied machine learning in business settings is not the modeling. It is asking the right question, building honest evaluation, and translating probability scores into decisions that real people can act on.
Algorithm Selection Guide
| Problem Type | Output | Algorithm | When to Use |
|---|---|---|---|
| Predict a quantity | Continuous number | LinearRegression |
Continuous target, interpretability needed |
| Predict a category | Class label + probability | LogisticRegression |
Binary outcome, interpretability critical |
| Classify with complex patterns | Class label + probability | DecisionTreeClassifier |
Stakeholders need to see decision rules |
| Highest accuracy classification | Class label + probability | RandomForestClassifier |
Performance priority, some interpretability |
| Predict a quantity with complex patterns | Continuous number | RandomForestRegressor |
Non-linear relationships, high dimensionality |
Default starting point: Logistic regression for classification, linear regression for regression. Only move to more complex models if cross-validation shows meaningful improvement.
The Universal scikit-learn Workflow
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
# 1. Prepare features and target
X = df[feature_columns]
y = df[target_column]
# 2. Split (always stratify for classification)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y
)
# 3. Scale (fit on train only — never on test)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform only, not fit_transform
# 4. Train
model = LogisticRegression(class_weight="balanced", random_state=42)
model.fit(X_train_scaled, y_train)
# 5. Evaluate
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print(classification_report(y_test, y_pred))
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")
The most common mistake: Calling fit_transform() on the full dataset before splitting. This leaks test data statistics into the scaler, producing optimistic evaluation scores that do not hold in production.
Regression Metrics Reference
| Metric | Formula | Interpretation | Good Value |
|---|---|---|---|
| R² | 1 - SS_res / SS_tot | Fraction of variance explained | Closer to 1 is better; 0 means model = mean; negative means model is worse than mean |
| MAE | mean( | y - ŷ | ) |
| RMSE | sqrt(mean((y - ŷ)²)) | Like MAE, but penalizes large errors more | As low as possible; much larger RMSE than MAE signals outlier errors |
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R²: {r2:.3f}")
print(f"MAE: {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
Classification Metrics Reference
| Metric | Formula | When It Matters |
|---|---|---|
| Accuracy | (TP + TN) / Total | Only meaningful when classes are balanced |
| Precision | TP / (TP + FP) | When false positives are costly |
| Recall | TP / (TP + FN) | When false negatives are costly |
| F1 | 2 × (Precision × Recall) / (Precision + Recall) | When both precision and recall matter |
| AUC | Area under ROC curve | Model ranking quality independent of threshold |
The precision-recall tradeoff: Lowering the classification threshold increases recall (you flag more true positives) at the cost of precision (you also flag more false positives). There is no free lunch.
For imbalanced datasets: Always report recall and F1 alongside accuracy. A model that always predicts the majority class achieves high accuracy while having zero recall on the minority class.
Confusion Matrix Layout
Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
print(f"True Negatives: {cm[0, 0]}")
print(f"False Positives: {cm[0, 1]}")
print(f"False Negatives: {cm[1, 0]}")
print(f"True Positives: {cm[1, 1]}")
Cross-Validation: The Right Way to Evaluate
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
f1_scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
auc_scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")
print(f"F1: {f1_scores.mean():.3f} ± {f1_scores.std():.3f}")
print(f"AUC: {auc_scores.mean():.3f} ± {auc_scores.std():.3f}")
Why cross-validation over a single split: - A single train/test split gives one estimate, which can be lucky or unlucky - 5-fold CV gives five estimates from five different test sets - The standard deviation tells you how stable the model's performance is - High variance across folds (std > 0.10 for F1) means the model is sensitive to which data it sees — a warning sign
Feature Engineering Patterns
| Pattern | Example | Why It Works |
|---|---|---|
| Recency | Days since last purchase | Captures behavioral change better than raw dates |
| Rate | Spend per day active | Normalizes for customer tenure |
| Interaction | Team size × complexity score | Captures synergies raw features miss |
| Ratio | RMSE / MAE | Diagnostics that combine two measures |
| Flag | is_new_client (0/1) | Captures categorical boundaries as binary |
| Lag | Previous month's orders | Temporal context for prediction |
Preventing Overfitting
Signs of overfitting: - Training accuracy >> test accuracy (gap > 10-15%) - Test score varies wildly across cross-validation folds - Simple models (logistic regression) perform nearly as well as complex ones
Controls for each algorithm:
| Algorithm | Overfitting Control | How It Works |
|---|---|---|
LogisticRegression |
C parameter (lower = stronger regularization) |
Penalizes large coefficients |
DecisionTreeClassifier |
max_depth, min_samples_leaf |
Prevents trees from memorizing individual samples |
RandomForestClassifier |
max_depth, min_samples_leaf, n_estimators |
Averaging many weak trees reduces variance |
# Logistic regression with regularization
LogisticRegression(C=0.1, penalty="l2") # Stronger regularization
LogisticRegression(C=10.0, penalty="l2") # Weaker regularization
# Decision tree with depth limit
DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
# Random forest with depth limit
RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_leaf=5)
Handling Class Imbalance
When the positive class (churners, fraud, defaults) is rare, classifiers tend to ignore it:
# Option 1: class_weight parameter (simplest, always try first)
LogisticRegression(class_weight="balanced")
RandomForestClassifier(class_weight="balanced")
# Option 2: Adjust threshold (no retraining required)
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.30 # Lower than default 0.50 to catch more positives
y_pred = (y_prob >= threshold).astype(int)
# Option 3: stratify= in train_test_split (always do this)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, test_size=0.20, random_state=42
)
Feature Importance: Random Forest
importance_df = pd.DataFrame({
"feature": feature_columns,
"importance": model.feature_importances_,
}).sort_values("importance", ascending=False)
print(importance_df.head(10).to_string(index=False))
Caveats: - Feature importances reflect predictive power in the training data, not causal relationships - Highly correlated features split importance between them — one may appear unimportant even if it is not - Use importances to guide investigation, not as final business conclusions
Using a Pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(class_weight="balanced", random_state=42)),
])
# Pipeline handles fit/transform correctly in cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring="f1")
# Fits scaler and model together on train, applies only transform on test
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
A Pipeline prevents data leakage in cross-validation by ensuring the scaler is refitted on each fold's training data.
Churn Model Quick Reference
The complete Acme churn workflow in five steps:
acme_customers.xlsx }
acme_sales_2023.csv } --> build_churn_dataset()
support_tickets.csv } |
v
customer-level feature DataFrame
(one row per customer, 90-day window)
|
train_and_compare_models()
(logistic, tree, random forest)
(5-fold stratified cross-validation)
|
select best model
|
generate_churn_risk_report()
(top-N at-risk accounts, sorted by
predicted churn probability)
|
deliver to Sandra's team
(prioritized outreach list)
One-Sentence Summaries
- Linear Regression: Predict a number from other numbers, assuming a roughly linear relationship.
- Logistic Regression: Predict the probability of a yes/no outcome; coefficients are directly interpretable.
- Decision Tree: A series of if/then rules learned from data; readable by humans, prone to overfitting.
- Random Forest: Average many decision trees trained on random subsets; less interpretable but more accurate.
- Train/Test Split: Never evaluate a model on the same data you used to train it.
- Cross-Validation: Run train/test split multiple times on different portions to get a stable estimate.
- R²: Fraction of variance in the target explained by the model — not a measure of business usefulness.
- Precision: Of all customers flagged at-risk, what fraction actually churned?
- Recall: Of all customers who actually churned, what fraction did the model flag?
- AUC: How well does the model rank customers from highest to lowest risk, independent of any threshold?
- Data Leakage: Using information in training that would not be available at prediction time.
- Overfitting: The model memorized the training data instead of learning patterns that generalize.
- Feature Engineering: Creating informative new variables from raw data — often more valuable than the algorithm choice.
- class_weight="balanced": Instructs the model to treat minority class errors as proportionally more costly.
Chapter Checklist
Before moving to Chapter 35, you should be able to:
- [ ] Explain the difference between regression and classification, and identify which type applies to a given business question
- [ ] Write the complete scikit-learn workflow: split, scale (on train only), fit, predict, evaluate
- [ ] Calculate and interpret R², MAE, and RMSE for a regression model
- [ ] Build a confusion matrix and calculate precision, recall, and F1 from it
- [ ] Explain why accuracy is misleading for imbalanced classification
- [ ] Train a logistic regression, decision tree, and random forest on the same dataset
- [ ] Run 5-fold stratified cross-validation and interpret mean ± std results
- [ ] Extract and interpret feature importances from a Random Forest
- [ ] Apply class_weight="balanced" for imbalanced classification
- [ ] Use
sklearn.pipeline.Pipelineto prevent data leakage in cross-validation - [ ] Engineer at least two features from raw date or categorical columns
- [ ] Communicate model results and limitations to a non-technical stakeholder