Chapter 34: Key Takeaways — Predictive Models: Regression and Classification

DataField.Dev

Chapter 34: Key Takeaways — Predictive Models: Regression and Classification

The Central Insight

Most business prediction problems fall into one of two categories: predicting a number (regression) or predicting a category (classification). Both use the same fundamental workflow — features in, trained model, predictions out — but they require different algorithms, different evaluation metrics, and different ways of communicating results to stakeholders.

The hardest part of applied machine learning in business settings is not the modeling. It is asking the right question, building honest evaluation, and translating probability scores into decisions that real people can act on.

Algorithm Selection Guide

Problem Type	Output	Algorithm	When to Use
Predict a quantity	Continuous number	`LinearRegression`	Continuous target, interpretability needed
Predict a category	Class label + probability	`LogisticRegression`	Binary outcome, interpretability critical
Classify with complex patterns	Class label + probability	`DecisionTreeClassifier`	Stakeholders need to see decision rules
Highest accuracy classification	Class label + probability	`RandomForestClassifier`	Performance priority, some interpretability
Predict a quantity with complex patterns	Continuous number	`RandomForestRegressor`	Non-linear relationships, high dimensionality

Default starting point: Logistic regression for classification, linear regression for regression. Only move to more complex models if cross-validation shows meaningful improvement.

The Universal scikit-learn Workflow

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# 1. Prepare features and target
X = df[feature_columns]
y = df[target_column]

# 2. Split (always stratify for classification)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# 3. Scale (fit on train only — never on test)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)   # transform only, not fit_transform

# 4. Train
model = LogisticRegression(class_weight="balanced", random_state=42)
model.fit(X_train_scaled, y_train)

# 5. Evaluate
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print(classification_report(y_test, y_pred))
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")

The most common mistake: Calling fit_transform() on the full dataset before splitting. This leaks test data statistics into the scaler, producing optimistic evaluation scores that do not hold in production.

Regression Metrics Reference

Metric	Formula	Interpretation	Good Value
R²	1 - SS_res / SS_tot	Fraction of variance explained	Closer to 1 is better; 0 means model = mean; negative means model is worse than mean
MAE	mean(	y - ŷ	)
RMSE	sqrt(mean((y - ŷ)²))	Like MAE, but penalizes large errors more	As low as possible; much larger RMSE than MAE signals outlier errors

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

r2   = r2_score(y_test, y_pred)
mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R²:   {r2:.3f}")
print(f"MAE:  {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")

Classification Metrics Reference

Metric	Formula	When It Matters
Accuracy	(TP + TN) / Total	Only meaningful when classes are balanced
Precision	TP / (TP + FP)	When false positives are costly
Recall	TP / (TP + FN)	When false negatives are costly
F1	2 × (Precision × Recall) / (Precision + Recall)	When both precision and recall matter
AUC	Area under ROC curve	Model ranking quality independent of threshold

The precision-recall tradeoff: Lowering the classification threshold increases recall (you flag more true positives) at the cost of precision (you also flag more false positives). There is no free lunch.

For imbalanced datasets: Always report recall and F1 alongside accuracy. A model that always predicts the majority class achieves high accuracy while having zero recall on the minority class.

Confusion Matrix Layout

                   Predicted Negative   Predicted Positive
Actual Negative    True Negative (TN)   False Positive (FP)
Actual Positive    False Negative (FN)  True Positive (TP)

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
print(f"True Negatives:  {cm[0, 0]}")
print(f"False Positives: {cm[0, 1]}")
print(f"False Negatives: {cm[1, 0]}")
print(f"True Positives:  {cm[1, 1]}")

Cross-Validation: The Right Way to Evaluate

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

f1_scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
auc_scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")

print(f"F1:  {f1_scores.mean():.3f} ± {f1_scores.std():.3f}")
print(f"AUC: {auc_scores.mean():.3f} ± {auc_scores.std():.3f}")

Why cross-validation over a single split: - A single train/test split gives one estimate, which can be lucky or unlucky - 5-fold CV gives five estimates from five different test sets - The standard deviation tells you how stable the model's performance is - High variance across folds (std > 0.10 for F1) means the model is sensitive to which data it sees — a warning sign

Feature Engineering Patterns

Pattern	Example	Why It Works
Recency	Days since last purchase	Captures behavioral change better than raw dates
Rate	Spend per day active	Normalizes for customer tenure
Interaction	Team size × complexity score	Captures synergies raw features miss
Ratio	RMSE / MAE	Diagnostics that combine two measures
Flag	is_new_client (0/1)	Captures categorical boundaries as binary
Lag	Previous month's orders	Temporal context for prediction

Preventing Overfitting

Signs of overfitting: - Training accuracy >> test accuracy (gap > 10-15%) - Test score varies wildly across cross-validation folds - Simple models (logistic regression) perform nearly as well as complex ones

Controls for each algorithm:

Algorithm	Overfitting Control	How It Works
`LogisticRegression`	`C` parameter (lower = stronger regularization)	Penalizes large coefficients
`DecisionTreeClassifier`	`max_depth`, `min_samples_leaf`	Prevents trees from memorizing individual samples
`RandomForestClassifier`	`max_depth`, `min_samples_leaf`, `n_estimators`	Averaging many weak trees reduces variance

# Logistic regression with regularization
LogisticRegression(C=0.1, penalty="l2")   # Stronger regularization
LogisticRegression(C=10.0, penalty="l2")  # Weaker regularization

# Decision tree with depth limit
DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)

# Random forest with depth limit
RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_leaf=5)

Handling Class Imbalance

When the positive class (churners, fraud, defaults) is rare, classifiers tend to ignore it:

# Option 1: class_weight parameter (simplest, always try first)
LogisticRegression(class_weight="balanced")
RandomForestClassifier(class_weight="balanced")

# Option 2: Adjust threshold (no retraining required)
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.30  # Lower than default 0.50 to catch more positives
y_pred = (y_prob >= threshold).astype(int)

# Option 3: stratify= in train_test_split (always do this)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.20, random_state=42
)

Feature Importance: Random Forest

importance_df = pd.DataFrame({
    "feature": feature_columns,
    "importance": model.feature_importances_,
}).sort_values("importance", ascending=False)

print(importance_df.head(10).to_string(index=False))

Caveats: - Feature importances reflect predictive power in the training data, not causal relationships - Highly correlated features split importance between them — one may appear unimportant even if it is not - Use importances to guide investigation, not as final business conclusions

Using a Pipeline

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(class_weight="balanced", random_state=42)),
])

# Pipeline handles fit/transform correctly in cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring="f1")

# Fits scaler and model together on train, applies only transform on test
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

A Pipeline prevents data leakage in cross-validation by ensuring the scaler is refitted on each fold's training data.

Churn Model Quick Reference

The complete Acme churn workflow in five steps:

acme_customers.xlsx    }
acme_sales_2023.csv    }  -->  build_churn_dataset()
support_tickets.csv    }              |
                                      v
                           customer-level feature DataFrame
                           (one row per customer, 90-day window)
                                      |
                              train_and_compare_models()
                              (logistic, tree, random forest)
                              (5-fold stratified cross-validation)
                                      |
                               select best model
                                      |
                            generate_churn_risk_report()
                            (top-N at-risk accounts, sorted by
                             predicted churn probability)
                                      |
                            deliver to Sandra's team
                            (prioritized outreach list)

One-Sentence Summaries

Linear Regression: Predict a number from other numbers, assuming a roughly linear relationship.
Logistic Regression: Predict the probability of a yes/no outcome; coefficients are directly interpretable.
Decision Tree: A series of if/then rules learned from data; readable by humans, prone to overfitting.
Random Forest: Average many decision trees trained on random subsets; less interpretable but more accurate.
Train/Test Split: Never evaluate a model on the same data you used to train it.
Cross-Validation: Run train/test split multiple times on different portions to get a stable estimate.
R²: Fraction of variance in the target explained by the model — not a measure of business usefulness.
Precision: Of all customers flagged at-risk, what fraction actually churned?
Recall: Of all customers who actually churned, what fraction did the model flag?
AUC: How well does the model rank customers from highest to lowest risk, independent of any threshold?
Data Leakage: Using information in training that would not be available at prediction time.
Overfitting: The model memorized the training data instead of learning patterns that generalize.
Feature Engineering: Creating informative new variables from raw data — often more valuable than the algorithm choice.
class_weight="balanced": Instructs the model to treat minority class errors as proportionally more costly.

Chapter Checklist

Before moving to Chapter 35, you should be able to:

[ ] Explain the difference between regression and classification, and identify which type applies to a given business question
[ ] Write the complete scikit-learn workflow: split, scale (on train only), fit, predict, evaluate
[ ] Calculate and interpret R², MAE, and RMSE for a regression model
[ ] Build a confusion matrix and calculate precision, recall, and F1 from it
[ ] Explain why accuracy is misleading for imbalanced classification
[ ] Train a logistic regression, decision tree, and random forest on the same dataset
[ ] Run 5-fold stratified cross-validation and interpret mean ± std results
[ ] Extract and interpret feature importances from a Random Forest
[ ] Apply class_weight="balanced" for imbalanced classification
[ ] Use sklearn.pipeline.Pipeline to prevent data leakage in cross-validation
[ ] Engineer at least two features from raw date or categorical columns
[ ] Communicate model results and limitations to a non-technical stakeholder