Key Takeaways: Logistic Regression

One-Sentence Summary

Logistic regression models the probability of a binary outcome by applying the sigmoid function to a linear combination of predictors, producing coefficients interpretable as odds ratios, and evaluated using confusion matrices, ROC curves, and AUC — making it the foundational classification technique that underpins AI/machine learning classification systems.

Core Concepts at a Glance

Concept Definition Why It Matters
Logistic regression Models the log-odds of a binary outcome as a linear function of predictors: $\ln(P/(1-P)) = b_0 + b_1 x_1 + \cdots$ Extends regression to yes/no outcomes; the simplest classifier
Sigmoid function $P = 1/(1+e^{-z})$ — maps any real number to a probability between 0 and 1 Solves the problem of linear regression predicting impossible probabilities
Odds ratio $e^{b_i}$ — the multiplicative change in odds for a one-unit increase in $x_i$ The primary way to interpret logistic regression coefficients
Confusion matrix A 2$\times$2 table cross-tabulating predicted vs. actual outcomes (TP, FP, FN, TN) The foundation for all classification evaluation metrics
ROC curve / AUC Plots sensitivity vs. 1-specificity at all thresholds; AUC summarizes overall discrimination Evaluates the model independent of any specific threshold

The Logistic Regression Procedure

Step by Step

  1. Confirm binary outcome. Ensure the response variable is binary (0/1, yes/no, success/failure).

  2. Explore the data: - Compare predictor distributions between the two outcome groups (box plots, bar charts) - Check outcome rate (base rate) — imbalanced classes require special attention

  3. Fit the model using sm.Logit(y, X).fit() (statsmodels) or LogisticRegression().fit(X, y) (sklearn).

  4. Interpret coefficients: - Exponentiate each coefficient: $e^{b_i}$ = odds ratio - "For each one-unit increase in $x_i$, the odds of the outcome are multiplied by $e^{b_i}$, holding all other variables constant" - Check p-values and 95% CIs for the odds ratios

  5. Evaluate the model: - Confusion matrix at the chosen threshold - Accuracy, sensitivity, specificity, precision, F1 score - ROC curve and AUC

  6. Choose the threshold: - Consider the relative costs of false positives vs. false negatives - The threshold is a values decision, not just a statistical one

  7. Check fairness: - Evaluate error rates separately for each relevant group - Equal accuracy does not guarantee equal error rates

Key Python Code

import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix, classification_report,
                             roc_curve, auc)
import numpy as np

# --- STATSMODELS (inference: coefficients, p-values, CIs) ---
X_sm = sm.add_constant(df[['x1', 'x2', 'x3']])
model = sm.Logit(df['y'], X_sm).fit()
print(model.summary())

# Odds ratios
print("Odds Ratios:")
print(np.exp(model.params).round(3))
print("95% CI for Odds Ratios:")
print(np.exp(model.conf_int()).round(3))

# --- SKLEARN (prediction and evaluation) ---
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

# ROC and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
print(f"AUC: {auc(fpr, tpr):.3f}")

Excel Procedure

Step Action
1. Enter data Binary outcome (0/1) in one column; predictors in separate columns
2. Note Excel's Data Analysis ToolPak does NOT include logistic regression. Use the Solver add-in to maximize the log-likelihood, or use Python/R
3. Alternative Export data to CSV and analyze in Python (recommended)

The Threshold Concept: Thinking in Odds

Probability, odds, and log-odds are three representations of the same quantity. Logistic regression models log-odds because they range from $-\infty$ to $+\infty$, making them suitable for a linear model. The odds ratio ($e^{b}$) is the most interpretable form of the coefficient.

Currency Range Formula from P Formula to P
Probability ($P$) 0 to 1
Odds 0 to $\infty$ $\text{Odds} = \frac{P}{1-P}$ $P = \frac{\text{Odds}}{1+\text{Odds}}$
Log-odds (logit) $-\infty$ to $+\infty$ $\text{Logit} = \ln\left(\frac{P}{1-P}\right)$ $P = \frac{1}{1+e^{-\text{Logit}}}$
Key Insight Details
P = 0.50 corresponds to odds = 1 and logit = 0 The midpoint; where the outcome is equally likely
Coefficients are additive on the log-odds scale $b_1 = 0.5$ means log-odds increase by 0.5 per unit
Odds ratios are multiplicative on the odds scale $e^{0.5} = 1.65$ means odds multiplied by 1.65 per unit
Probability changes are not constant Same log-odds change produces different probability changes depending on starting probability

Key Formulas

Formula Description
$\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 x_1 + \cdots + b_k x_k$ Logistic regression equation
$P = \frac{1}{1 + e^{-(b_0 + b_1 x_1 + \cdots)}}$ Sigmoid function (probability from log-odds)
$e^{b_i}$ Odds ratio for predictor $x_i$
$\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}$ Overall correct rate
$\text{Sensitivity} = \frac{TP}{TP + FN}$ True positive rate (recall)
$\text{Specificity} = \frac{TN}{TN + FP}$ True negative rate
$\text{Precision} = \frac{TP}{TP + FP}$ Positive predictive value
$F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ Harmonic mean of precision and recall

Confusion Matrix Reference

Actually Positive Actually Negative
Predicted Positive True Positive (TP) False Positive (FP)
Predicted Negative False Negative (FN) True Negative (TN)

Metric Quick Reference

Metric Formula Question It Answers Best For
Accuracy (TP+TN)/Total How often is the model right overall? Balanced classes
Sensitivity TP/(TP+FN) Of actual positives, how many caught? When missing positives is costly
Specificity TN/(TN+FP) Of actual negatives, how many correctly identified? When false alarms are costly
Precision TP/(TP+FP) Of predicted positives, how many are real? When acting on predictions is expensive
F1 Score Harmonic mean Balance between precision and recall? Imbalanced classes
AUC Area under ROC curve How well does the model discriminate? Comparing models, threshold-independent

ROC and AUC Interpretation

AUC Range Model Quality
0.50 No better than random guessing
0.60-0.70 Poor
0.70-0.80 Acceptable
0.80-0.90 Excellent
0.90-1.00 Outstanding

AUC interpretation: The probability that a randomly chosen positive case gets a higher predicted probability than a randomly chosen negative case.

Odds Ratio Interpretation

$e^{b}$ Meaning
$e^b = 1$ No effect
$e^b > 1$ Higher odds (positive association)
$e^b < 1$ Lower odds (negative association)
$e^b = 2.0$ Odds double per unit increase
$e^b = 0.5$ Odds halve per unit increase
$e^b = 1.42$ 42% increase in odds per unit
$e^b = 0.75$ 25% decrease in odds per unit

Common Mistakes

Mistake Correction
Using linear regression for binary outcomes Use logistic regression — linear regression predicts values outside [0, 1]
Interpreting $b_1$ as a change in probability $b_1$ is a change in log-odds; interpret $e^{b_1}$ as an odds ratio
Confusing odds ratio with probability OR = 2.0 means odds double, NOT that probability doubles
Relying on accuracy alone Use sensitivity, specificity, precision, F1, and AUC — especially with imbalanced classes
Using the default 0.5 threshold without thinking Choose threshold based on the relative costs of FP vs. FN
Claiming a "race-neutral" algorithm is fair Proxy variables can carry racial bias even when race is excluded from the model
Comparing coefficients across different scales A coefficient of 0.5 for "age in years" is not comparable to 0.5 for "income in thousands"
Ignoring differential error rates Overall accuracy can be equal while FPR/FNR differ dramatically between groups

Connections

Connection Details
Ch. 9 (Bayes' Theorem) Sensitivity = P(test+
Ch. 22 (Simple Regression) Logistic regression extends the regression framework to binary outcomes; same idea of "how does x predict y"
Ch. 23 (Multiple Regression) Same multi-predictor framework, same "holding other variables constant" interpretation, same confounding issues
Ch. 13 (Hypothesis Testing) Type I error = false positive; Type II error = false negative; the sensitivity-specificity tradeoff parallels the $\alpha$-$\beta$ tradeoff
Ch. 17 (Power and Effect Sizes) Odds ratio is the effect size measure for logistic regression; confidence intervals for ORs provide both significance and magnitude
AI/ML (Theme 3) Logistic regression is the simplest classifier; neural networks are stacked logistic regressions; evaluation metrics apply to all classifiers
Ethics (Theme 6) Classification errors affect people differently; differential error rates by group = algorithmic bias; threshold choice is a values decision