Key Takeaways: Logistic Regression and Classification — Predicting Categories

Contributors to Introduction to Data Science

Key Takeaways: Logistic Regression and Classification — Predicting Categories

This is your reference card for Chapter 27. It covers classification, the sigmoid function, confusion matrices, and the critical precision-recall tradeoff that governs every classification decision.

The Core Idea

Logistic regression is linear regression wrapped in a sigmoid function. It computes a weighted sum of features (like linear regression), then squishes the result through the sigmoid to produce a probability between 0 and 1. This probability is then thresholded to make a classification decision.

Score     = intercept + w1*x1 + w2*x2 + ... + wn*xn    (linear part)
Probability = 1 / (1 + e^(-Score))                      (sigmoid)
Prediction = "positive" if Probability >= threshold      (decision)

Key Concepts

Classification: Predicting a category (yes/no, high/low, spam/not spam), as opposed to regression which predicts a number.
Sigmoid function: Maps any number to the range (0, 1). S-shaped curve that squashes extreme values.
Threshold: The probability cutoff for classification decisions. Default is 0.5, but the optimal threshold depends on the costs of different types of errors.
predict_proba: Returns probability estimates, which are more informative than binary predictions. Use probabilities to rank, communicate uncertainty, and set custom thresholds.
Confusion matrix: Four-cell table showing TP, FP, TN, FN that reveals the types of errors, not just the count.
Precision: TP / (TP + FP). "When the model says yes, how often is it right?"
Recall (sensitivity): TP / (TP + FN). "Of all actual positives, how many does the model catch?"
F1-score: Harmonic mean of precision and recall. Only high when both are high.
Class imbalance: When one class dominates, accuracy is misleading. Use precision, recall, and F1 instead.

The scikit-learn Workflow

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix,
    classification_report, accuracy_score)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict (binary)
y_pred = model.predict(X_test)

# Predict (probabilities)
y_proba = model.predict_proba(X_test)[:, 1]

# Evaluate
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

The Confusion Matrix

                  Predicted
              Positive    Negative
Actual  Pos     TP          FN      ← Recall = TP/(TP+FN)
        Neg     FP          TN

                ↑
         Precision = TP/(TP+FP)

Cell	Meaning	Also Called
TP	Correct positive	Hit
TN	Correct negative	Correct rejection
FP	Incorrect positive	False alarm, Type I error
FN	Missed positive	Miss, Type II error

Precision vs. Recall Tradeoff

Lower threshold → More positives predicted
                → Higher recall (fewer misses)
                → Lower precision (more false alarms)

Higher threshold → Fewer positives predicted
                 → Lower recall (more misses)
                 → Higher precision (fewer false alarms)

When to prioritize recall: Missing a positive case is costly (cancer screening, security threats, fraud detection).

When to prioritize precision: False alarms are costly (spam filtering, criminal accusations, surgical recommendations).

Handling Class Imbalance

# Strategy 1: Balanced class weights
model = LogisticRegression(class_weight='balanced', max_iter=1000)

# Strategy 2: Custom threshold
y_proba = model.predict_proba(X_test)[:, 1]
y_pred_custom = (y_proba >= 0.3).astype(int)

# Strategy 3: Better metrics (not just accuracy)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

The baseline test for imbalanced data: If your model's accuracy doesn't beat "always predict the majority class," it hasn't learned anything useful.

Interpreting Coefficients

Coefficient Sign	Meaning
Positive	Higher feature value increases probability of positive class
Negative	Higher feature value decreases probability of positive class
Larger absolute value	Stronger association

Important: The effect on probability is not linear — it depends on where you are on the sigmoid curve. The coefficient is constant in log-odds space, not in probability space.

Odds ratio: exp(coefficient). An odds ratio of 1.5 means a one-unit increase in the feature multiplies the odds by 1.5.

Regression vs. Classification: When to Use Which

Task	Model	Output	Use When
Predict a number	Linear Regression	Continuous value	You need a specific numerical prediction
Predict a category	Logistic Regression	Probability + class	You need a yes/no decision or risk ranking

You can convert regression to classification (e.g., "if predicted rate >= 80%, classify as high") but logistic regression is purpose-built for classification tasks.

Common Pitfalls

Using accuracy alone for imbalanced data. Always check precision, recall, and the confusion matrix.
Ignoring predict_proba. Binary predictions lose important information about confidence and ranking.
Using the default 0.5 threshold without thinking. The optimal threshold depends on the costs of FP vs. FN.
Confusing logistic regression with linear regression. Despite the name, logistic regression is for classification.
Not scaling features. Logistic regression with regularization (default in scikit-learn) can produce different results depending on feature scales.
Interpreting coefficients as linear effects on probability. The effect depends on where you are on the sigmoid curve.

What You Should Be Able to Do Now

[ ] Explain why linear regression is inappropriate for classification
[ ] Describe how the sigmoid function converts a score to a probability
[ ] Fit a logistic regression model with scikit-learn
[ ] Use predict_proba to get probability outputs
[ ] Construct and read a confusion matrix
[ ] Calculate precision and recall from a confusion matrix
[ ] Choose between precision and recall based on error costs
[ ] Adjust the classification threshold for different scenarios
[ ] Recognize class imbalance and choose appropriate evaluation metrics
[ ] Compare model performance to the majority-class baseline

The Decision Framework

When choosing how to evaluate your classifier, ask:

Is the data imbalanced? If yes, don't trust accuracy alone.
Which error is worse: FP or FN? This determines whether to prioritize precision or recall.
Do I need a decision or a ranking? If ranking, use predict_proba.
What threshold serves the application? Set it based on error costs, not convenience.
Does the model beat the baseline? If it can't outperform "always predict the majority class," it isn't useful.

You're ready for Chapter 28, where you'll learn about decision trees — a completely different kind of model that makes predictions by asking a series of yes/no questions about the features. Decision trees handle nonlinear relationships naturally and are among the most interpretable models in machine learning.