Key Takeaways: Logistic Regression

Contributors

Key Takeaways: Logistic Regression

One-Sentence Summary

Logistic regression models the probability of a binary outcome by applying the sigmoid function to a linear combination of predictors, producing coefficients interpretable as odds ratios, and evaluated using confusion matrices, ROC curves, and AUC — making it the foundational classification technique that underpins AI/machine learning classification systems.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Logistic regression	Models the log-odds of a binary outcome as a linear function of predictors: $\ln(P/(1-P)) = b_0 + b_1 x_1 + \cdots$	Extends regression to yes/no outcomes; the simplest classifier
Sigmoid function	$P = 1/(1+e^{-z})$ — maps any real number to a probability between 0 and 1	Solves the problem of linear regression predicting impossible probabilities
Odds ratio	$e^{b_i}$ — the multiplicative change in odds for a one-unit increase in $x_i$	The primary way to interpret logistic regression coefficients
Confusion matrix	A 2$\times$2 table cross-tabulating predicted vs. actual outcomes (TP, FP, FN, TN)	The foundation for all classification evaluation metrics
ROC curve / AUC	Plots sensitivity vs. 1-specificity at all thresholds; AUC summarizes overall discrimination	Evaluates the model independent of any specific threshold

The Logistic Regression Procedure

Step by Step

Confirm binary outcome. Ensure the response variable is binary (0/1, yes/no, success/failure).
Explore the data: - Compare predictor distributions between the two outcome groups (box plots, bar charts) - Check outcome rate (base rate) — imbalanced classes require special attention
Fit the model using sm.Logit(y, X).fit() (statsmodels) or LogisticRegression().fit(X, y) (sklearn).
Interpret coefficients: - Exponentiate each coefficient: $e^{b_i}$ = odds ratio - "For each one-unit increase in $x_i$, the odds of the outcome are multiplied by $e^{b_i}$, holding all other variables constant" - Check p-values and 95% CIs for the odds ratios
Evaluate the model: - Confusion matrix at the chosen threshold - Accuracy, sensitivity, specificity, precision, F1 score - ROC curve and AUC
Choose the threshold: - Consider the relative costs of false positives vs. false negatives - The threshold is a values decision, not just a statistical one
Check fairness: - Evaluate error rates separately for each relevant group - Equal accuracy does not guarantee equal error rates

Key Python Code

import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix, classification_report,
                             roc_curve, auc)
import numpy as np

# --- STATSMODELS (inference: coefficients, p-values, CIs) ---
X_sm = sm.add_constant(df[['x1', 'x2', 'x3']])
model = sm.Logit(df['y'], X_sm).fit()
print(model.summary())

# Odds ratios
print("Odds Ratios:")
print(np.exp(model.params).round(3))
print("95% CI for Odds Ratios:")
print(np.exp(model.conf_int()).round(3))

# --- SKLEARN (prediction and evaluation) ---
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

# ROC and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
print(f"AUC: {auc(fpr, tpr):.3f}")

Excel Procedure

Step	Action
1. Enter data	Binary outcome (0/1) in one column; predictors in separate columns
2. Note	Excel's Data Analysis ToolPak does NOT include logistic regression. Use the Solver add-in to maximize the log-likelihood, or use Python/R
3. Alternative	Export data to CSV and analyze in Python (recommended)

The Threshold Concept: Thinking in Odds

Probability, odds, and log-odds are three representations of the same quantity. Logistic regression models log-odds because they range from $-\infty$ to $+\infty$, making them suitable for a linear model. The odds ratio ($e^{b}$) is the most interpretable form of the coefficient.

Currency	Range	Formula from P	Formula to P
Probability ($P$)	0 to 1	—	—
Odds	0 to $\infty$	$\text{Odds} = \frac{P}{1-P}$	$P = \frac{\text{Odds}}{1+\text{Odds}}$
Log-odds (logit)	$-\infty$ to $+\infty$	$\text{Logit} = \ln\left(\frac{P}{1-P}\right)$	$P = \frac{1}{1+e^{-\text{Logit}}}$

Key Insight	Details
P = 0.50 corresponds to odds = 1 and logit = 0	The midpoint; where the outcome is equally likely
Coefficients are additive on the log-odds scale	$b_1 = 0.5$ means log-odds increase by 0.5 per unit
Odds ratios are multiplicative on the odds scale	$e^{0.5} = 1.65$ means odds multiplied by 1.65 per unit
Probability changes are not constant	Same log-odds change produces different probability changes depending on starting probability

Key Formulas

Formula	Description
$\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 x_1 + \cdots + b_k x_k$	Logistic regression equation
$P = \frac{1}{1 + e^{-(b_0 + b_1 x_1 + \cdots)}}$	Sigmoid function (probability from log-odds)
$e^{b_i}$	Odds ratio for predictor $x_i$
$\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}$	Overall correct rate
$\text{Sensitivity} = \frac{TP}{TP + FN}$	True positive rate (recall)
$\text{Specificity} = \frac{TN}{TN + FP}$	True negative rate
$\text{Precision} = \frac{TP}{TP + FP}$	Positive predictive value
$F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$	Harmonic mean of precision and recall

Confusion Matrix Reference

	Actually Positive	Actually Negative
Predicted Positive	True Positive (TP)	False Positive (FP)
Predicted Negative	False Negative (FN)	True Negative (TN)

Metric Quick Reference

Metric	Formula	Question It Answers	Best For
Accuracy	(TP+TN)/Total	How often is the model right overall?	Balanced classes
Sensitivity	TP/(TP+FN)	Of actual positives, how many caught?	When missing positives is costly
Specificity	TN/(TN+FP)	Of actual negatives, how many correctly identified?	When false alarms are costly
Precision	TP/(TP+FP)	Of predicted positives, how many are real?	When acting on predictions is expensive
F1 Score	Harmonic mean	Balance between precision and recall?	Imbalanced classes
AUC	Area under ROC curve	How well does the model discriminate?	Comparing models, threshold-independent

ROC and AUC Interpretation

AUC Range	Model Quality
0.50	No better than random guessing
0.60-0.70	Poor
0.70-0.80	Acceptable
0.80-0.90	Excellent
0.90-1.00	Outstanding

AUC interpretation: The probability that a randomly chosen positive case gets a higher predicted probability than a randomly chosen negative case.

Odds Ratio Interpretation

$e^{b}$	Meaning
$e^b = 1$	No effect
$e^b > 1$	Higher odds (positive association)
$e^b < 1$	Lower odds (negative association)
$e^b = 2.0$	Odds double per unit increase
$e^b = 0.5$	Odds halve per unit increase
$e^b = 1.42$	42% increase in odds per unit
$e^b = 0.75$	25% decrease in odds per unit

Common Mistakes

Mistake	Correction
Using linear regression for binary outcomes	Use logistic regression — linear regression predicts values outside [0, 1]
Interpreting $b_1$ as a change in probability	$b_1$ is a change in log-odds; interpret $e^{b_1}$ as an odds ratio
Confusing odds ratio with probability	OR = 2.0 means odds double, NOT that probability doubles
Relying on accuracy alone	Use sensitivity, specificity, precision, F1, and AUC — especially with imbalanced classes
Using the default 0.5 threshold without thinking	Choose threshold based on the relative costs of FP vs. FN
Claiming a "race-neutral" algorithm is fair	Proxy variables can carry racial bias even when race is excluded from the model
Comparing coefficients across different scales	A coefficient of 0.5 for "age in years" is not comparable to 0.5 for "income in thousands"
Ignoring differential error rates	Overall accuracy can be equal while FPR/FNR differ dramatically between groups

Connections

Connection	Details
Ch. 9 (Bayes' Theorem)	Sensitivity = P(test+
Ch. 22 (Simple Regression)	Logistic regression extends the regression framework to binary outcomes; same idea of "how does x predict y"
Ch. 23 (Multiple Regression)	Same multi-predictor framework, same "holding other variables constant" interpretation, same confounding issues
Ch. 13 (Hypothesis Testing)	Type I error = false positive; Type II error = false negative; the sensitivity-specificity tradeoff parallels the $\alpha$-$\beta$ tradeoff
Ch. 17 (Power and Effect Sizes)	Odds ratio is the effect size measure for logistic regression; confidence intervals for ORs provide both significance and magnitude
AI/ML (Theme 3)	Logistic regression is the simplest classifier; neural networks are stacked logistic regressions; evaluation metrics apply to all classifiers
Ethics (Theme 6)	Classification errors affect people differently; differential error rates by group = algorithmic bias; threshold choice is a values decision