38 min read

In the last two chapters, you learned to predict how much: how many ER visits a community will have, how many minutes a user will watch, how many points a player will score. The response variable was always numerical and continuous. You drew a line...

Learning Objectives

  • Explain why linear regression fails for binary outcomes
  • Interpret logistic regression coefficients as odds ratios
  • Fit and interpret a logistic regression model in Python
  • Evaluate model performance using a confusion matrix and accuracy metrics
  • Connect logistic regression to classification in AI/machine learning

Chapter 24: Logistic Regression: When the Outcome Is Yes or No

"All models are wrong, but some are useful." — George E. P. Box (1987)

Chapter Overview

In the last two chapters, you learned to predict how much: how many ER visits a community will have, how many minutes a user will watch, how many points a player will score. The response variable was always numerical and continuous. You drew a line (or a plane) through the data, and it worked.

But what happens when the question isn't "how much?" but "will it happen?"

  • Will this patient be readmitted to the hospital within 30 days? (Yes or No)
  • Will this subscriber cancel their account? (Yes or No)
  • Will this defendant reoffend within two years? (Yes or No)
  • Will this shot go in? (Yes or No)

These are binary outcomes — the response variable has exactly two possible values. And here's the problem: the regression models you learned in Chapters 22 and 23 don't work for binary outcomes. If you try to force a straight line through yes/no data, you'll get predictions that are impossible — probabilities below 0 or above 1. The model will literally tell you there's a -15% chance of something happening, or a 120% chance.

We need a different tool. That tool is logistic regression, and it might be the single most important technique in this entire textbook.

I'm not exaggerating. Logistic regression is the bridge between statistics and machine learning. Every spam filter in your inbox? Logistic regression (or its descendants). Medical diagnosis algorithms? Logistic regression. Loan approval systems? Logistic regression. When an AI "classifies" something — is this email spam, is this tumor malignant, should this loan be approved — the simplest version of that process is logistic regression. This is Theme 3 in its purest form: this is literally how AI classifies things.

And with that power comes an important caution. When a classification algorithm makes different errors for different groups of people — approving loans at different rates by race, or flagging defendants at different rates by ethnicity — that's Theme 6, bias in classification. The math behind those algorithms is the math in this chapter.

The key conceptual challenge — the threshold concept for this chapter — is thinking in odds. You're used to probabilities: "there's a 75% chance of rain." But logistic regression works in a different currency: odds and log-odds. Shifting from "75% probability" to "3-to-1 odds" to "log-odds of 1.10" is a genuine conceptual leap. Once you make it, logistic regression makes perfect sense. Until you make it, it feels like alphabet soup.

We're going to make that leap together.

In this chapter, you will learn to: - Explain why linear regression fails for binary outcomes - Interpret logistic regression coefficients as odds ratios - Fit and interpret a logistic regression model in Python - Evaluate model performance using a confusion matrix and accuracy metrics - Connect logistic regression to classification in AI/machine learning

Fast Track: If you're comfortable with the regression framework from Chapters 22-23, skim Sections 24.1-24.2, then focus on Sections 24.4 (odds and log-odds), 24.6 (interpreting coefficients), and 24.8 (confusion matrix). Complete quiz questions 1, 10, and 15 to verify.

Deep Dive: After this chapter, read Case Study 1 (Maya's hospital readmission model — a complete logistic regression workflow with Python) for a medical application, then Case Study 2 (James's predictive policing algorithm — the logistic regression behind the risk scores) for a critical evaluation of classification bias. Both include full Python code.


24.1 The Problem with Linear Regression for Binary Outcomes (Productive Struggle)

Before we introduce logistic regression, let's see why we need it. Try this puzzle.

The Prediction Problem

Sam is analyzing Daria Williams's basketball shots. For each shot, he records: - Distance from the basket (in feet) - Whether the shot went in (1 = made, 0 = missed)

Here's a small sample:

Distance (ft) Made?
3 1
5 1
8 1
10 0
12 1
15 0
18 0
20 0
22 1
25 0

Sam tries to fit a simple linear regression: $\widehat{\text{Made}} = b_0 + b_1 \times \text{Distance}$.

Question 1: What would the predicted value be for a shot from 2 feet? What about from 40 feet?

Question 2: What's wrong with these predictions?

Question 3: The response variable is 0 or 1. What do you want the model to predict — a 0 or 1, or something else?

Take a minute to think about this before reading on.


The answer to Question 3 is the key: you don't want the model to predict exactly 0 or 1. You want it to predict the probability that the shot goes in. A shot from 3 feet might have a 90% probability of going in. A shot from 25 feet might have a 15% probability. You want a number between 0 and 1.

And that's exactly the problem with linear regression. A straight line doesn't stop at 0 or 1. Extend it far enough in either direction, and it will predict probabilities greater than 1 (impossible) or less than 0 (even more impossible). For Sam's data, a linear regression might predict a 110% chance of making a layup and a -20% chance of making a half-court shot. Neither makes any sense.

Here's what this looks like visually. Imagine plotting distance on the x-axis and the 0/1 outcomes on the y-axis. The data points form two horizontal bands — a row of 1s across the top and a row of 0s across the bottom. Now draw a straight line through them. The line starts high on the left (near the basket) and slopes downward to the right (far from the basket). But it doesn't stop at 1 on the left — it keeps going up. And it doesn't stop at 0 on the right — it keeps going down.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sam's shot data
distance = np.array([3, 5, 8, 10, 12, 15, 18, 20, 22, 25])
made = np.array([1, 1, 1, 0, 1, 0, 0, 0, 1, 0])

# Fit a linear regression (the WRONG approach)
from numpy.polynomial.polynomial import polyfit
b0, b1 = polyfit(distance, made, 1)

# Generate predictions
x_range = np.linspace(-5, 40, 100)
y_pred_linear = b0 + b1 * x_range

plt.figure(figsize=(10, 6))
plt.scatter(distance, made, color='navy', s=100, zorder=5,
            label='Actual shots (0=miss, 1=made)')
plt.plot(x_range, y_pred_linear, color='red', linewidth=2,
         linestyle='--', label='Linear regression (WRONG)')
plt.axhline(y=0, color='gray', linestyle=':', alpha=0.5)
plt.axhline(y=1, color='gray', linestyle=':', alpha=0.5)
plt.fill_between(x_range, 1, 1.5, alpha=0.1, color='red',
                 label='Impossible zone (prob > 1)')
plt.fill_between(x_range, -0.5, 0, alpha=0.1, color='red',
                 label='Impossible zone (prob < 0)')
plt.xlabel('Distance from basket (feet)', fontsize=12)
plt.ylabel('Probability of making the shot', fontsize=12)
plt.title('Why Linear Regression Fails for Binary Outcomes',
          fontsize=14)
plt.legend(loc='upper right')
plt.ylim(-0.3, 1.3)
plt.tight_layout()
plt.show()

The shaded red zones are impossible territory. A probability cannot be negative. A probability cannot exceed 1. Yet the linear model happily ventures into both.

What You Need to Know

Linear regression fails for binary outcomes because it can predict values outside the range [0, 1]. We need a model that: 1. Always produces predictions between 0 and 1 2. Can handle the S-shaped relationship between a predictor and a probability 3. Works within the familiar regression framework (so we can use multiple predictors, interpret coefficients, and test hypotheses)

Logistic regression delivers all three.


24.2 The Sigmoid Function: A Curve That Stays in Bounds

Here's the solution, and it's elegantly simple: instead of fitting a straight line, we fit a curve — specifically, the sigmoid function (also called the logistic function):

$$P(Y = 1) = \frac{1}{1 + e^{-(b_0 + b_1 x)}}$$

Don't let this formula intimidate you. Let's break it down piece by piece.

The expression $b_0 + b_1 x$ should look familiar — it's the exact same linear combination from simple regression. The magic is what happens around it. The $e^{-(\cdot)}$ and the fraction $\frac{1}{1+(\cdot)}$ work together to squish any value — no matter how large or how small — into the range between 0 and 1.

Here's the key insight: the sigmoid function is an S-shaped curve that: - Approaches 0 (but never reaches it) on the far left - Approaches 1 (but never reaches it) on the far right - Passes through 0.5 in the middle - Is steepest near the middle and flattens out at the extremes

Think of it like a dimmer switch. At some point in the middle, the light goes from mostly off to mostly on. But it transitions smoothly — there's no sudden jump. That's what the sigmoid does for probabilities.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """The sigmoid (logistic) function."""
    return 1 / (1 + np.exp(-z))

z = np.linspace(-8, 8, 200)

plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid(z), color='navy', linewidth=3)
plt.axhline(y=0.5, color='gray', linestyle=':', alpha=0.7,
            label='P = 0.5')
plt.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
plt.axhline(y=1, color='gray', linestyle='-', alpha=0.3)
plt.xlabel('z = b₀ + b₁x  (linear predictor)', fontsize=12)
plt.ylabel('P(Y = 1)  (predicted probability)', fontsize=12)
plt.title('The Sigmoid Function: Always Between 0 and 1',
          fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Why the sigmoid works: - When $b_0 + b_1 x$ is a large positive number (say, +10), $e^{-10}$ is tiny, so $P \approx \frac{1}{1+0} \approx 1$ - When $b_0 + b_1 x$ is a large negative number (say, -10), $e^{10}$ is huge, so $P \approx \frac{1}{1+22026} \approx 0$ - When $b_0 + b_1 x = 0$, $e^0 = 1$, so $P = \frac{1}{1+1} = 0.5$

The sigmoid is a mathematical guarantee: no matter what values we plug in, the output is always a valid probability. Problem solved.

Historical Note

The logistic function was introduced by the Belgian mathematician Pierre-Fran\u00e7ois Verhulst in 1838 to model population growth. He called it the "courbe logistique." It was later adopted for regression by Joseph Berkson (1944) and David Cox (1958). The same Berkson whose collider bias we mentioned in Chapter 23 — he was busy.


24.3 From Probability to Odds: A Different Way to Think About Chance

Before we can write the logistic regression equation, we need to introduce a new way of expressing probability: odds.

You already understand probability. If there's a 75% chance of rain, P(rain) = 0.75. Simple.

Odds express the same information differently. Instead of "what's the chance it happens?" odds ask "how many times more likely is it to happen than not happen?"

$$\text{Odds} = \frac{P(\text{event})}{1 - P(\text{event})} = \frac{P(\text{event})}{P(\text{not event})}$$

If the probability of rain is 0.75, the odds of rain are:

$$\text{Odds} = \frac{0.75}{0.25} = 3$$

We'd say the odds are "3 to 1" — rain is three times more likely than no rain.

The Betting Analogy

If you've ever watched sports or heard someone talk about betting, you've encountered odds. "The Lakers are 3-to-1 favorites" means the bookmaker thinks the Lakers are three times more likely to win than to lose.

Here's a quick translation table:

Probability Odds In Words
0.50 1.0 Even odds (50-50)
0.75 3.0 3 to 1 in favor
0.90 9.0 9 to 1 in favor
0.25 0.33 3 to 1 against (or 1 to 3)
0.10 0.11 9 to 1 against (or 1 to 9)
0.99 99.0 99 to 1 in favor
0.01 0.01 99 to 1 against

Notice something important: probability ranges from 0 to 1, but odds range from 0 to infinity. That's an improvement — we've eliminated the upper bound problem. But we still have the lower bound stuck at 0.

Log-Odds: The Final Transformation

Now take one more step. Instead of odds, take the natural logarithm of the odds:

$$\text{Log-odds} = \ln\left(\frac{P}{1-P}\right)$$

This quantity is also called the logit (rhymes with "lodg-it").

Probability Odds Log-odds (logit)
0.01 0.01 -4.60
0.10 0.11 -2.20
0.25 0.33 -1.10
0.50 1.00 0.00
0.75 3.00 1.10
0.90 9.00 2.20
0.99 99.0 4.60

Now look at the range: log-odds go from negative infinity to positive infinity. We've transformed a quantity that was stuck between 0 and 1 into a quantity that can take any value on the number line. And that is a quantity we can model with a straight line.

Threshold Concept: Thinking in Odds

This is the conceptual leap of this chapter. You've been thinking in probabilities your whole life: "there's a 30% chance of X." Now you need to think in three currencies simultaneously:

Currency Range When it's zero When it's "even"
Probability 0 to 1 Impossible P = 0.50
Odds 0 to $\infty$ Impossible Odds = 1
Log-odds $-\infty$ to $+\infty$ Impossible Logit = 0

Probabilities are easiest to understand. Odds are useful for comparison (the odds ratio). Log-odds are what the math needs — they let us use the familiar linear regression framework.

The conversion chain works both ways:

Probability $\rightarrow$ Odds: $\text{Odds} = \frac{P}{1-P}$

Odds $\rightarrow$ Log-odds: $\text{Logit} = \ln(\text{Odds})$

Log-odds $\rightarrow$ Odds: $\text{Odds} = e^{\text{Logit}}$

Odds $\rightarrow$ Probability: $P = \frac{\text{Odds}}{1 + \text{Odds}}$

If this feels confusing right now, that's completely normal. The payoff comes in the next section when you see how all three currencies work together in the logistic regression equation.


24.4 The Logistic Regression Equation

Now we can write the logistic regression model. And it's surprisingly familiar.

$$\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$$

Read this aloud: "The log-odds of the event equals a linear combination of the predictors."

That's it. The left side is the logit — the log of the odds. The right side is the exact same linear combination you used in Chapters 22 and 23. Same framework. Different response variable.

This is why the log-odds transformation was necessary. By converting the probability (stuck between 0 and 1) into log-odds (free to be anything), we made it possible to use the linear regression machinery. The logistic regression model is a linear model — just not for the probability directly. It's linear for the log-odds.

Recovering the Probability

To get back to a probability prediction, we reverse the transformation:

$$P(Y = 1) = \frac{e^{b_0 + b_1 x_1 + \cdots + b_k x_k}}{1 + e^{b_0 + b_1 x_1 + \cdots + b_k x_k}} = \frac{1}{1 + e^{-(b_0 + b_1 x_1 + \cdots + b_k x_k)}}$$

That second expression is the sigmoid function from Section 24.2. The circle is complete: the logistic regression model feeds the linear combination through the sigmoid to produce a probability.

A Worked Example

Let's say Sam fits a logistic regression to Daria's shot data and gets:

$$\ln\left(\frac{P(\text{made})}{1-P(\text{made})}\right) = 1.8 - 0.12 \times \text{Distance}$$

For a shot from 10 feet:

  1. Log-odds: $1.8 - 0.12(10) = 1.8 - 1.2 = 0.6$
  2. Odds: $e^{0.6} = 1.82$ (the odds of making it are 1.82 to 1)
  3. Probability: $P = \frac{1.82}{1 + 1.82} = \frac{1.82}{2.82} = 0.646$, or about 65%

For a shot from 25 feet:

  1. Log-odds: $1.8 - 0.12(25) = 1.8 - 3.0 = -1.2$
  2. Odds: $e^{-1.2} = 0.301$ (the odds of making it are about 1 to 3.3)
  3. Probability: $P = \frac{0.301}{1 + 0.301} = \frac{0.301}{1.301} = 0.231$, or about 23%

Notice that both predictions are between 0 and 1. The sigmoid function guarantees this no matter what distance we plug in. A shot from 100 feet? The log-odds would be $1.8 - 12 = -10.2$, giving a probability of $\frac{1}{1+e^{10.2}} = 0.00004$, or about 0.004%. Extremely unlikely, but not negative.


24.5 Spaced Review 1: Sensitivity and Specificity Revisited (from Ch. 9)

Retrieval Practice — try to answer before reading on.

In Chapter 9, you learned about sensitivity and specificity in the context of medical testing. Can you recall: 1. What does sensitivity measure? Write the formula using conditional probability notation. 2. What does specificity measure? 3. What's the difference between a false positive and a false negative? 4. Why does the base rate (prevalence) matter for interpreting a positive test result?

Here's why this matters now: in Chapter 9, sensitivity and specificity described how well a diagnostic test performs. In this chapter, you'll see that a logistic regression model is a diagnostic test — it predicts "yes" or "no" based on input data. And the same vocabulary applies.

Sensitivity (also called the true positive rate or recall) is P(model predicts positive | actually positive). In Chapter 9, this was P(test positive | has disease). Now it might be: - P(model predicts readmission | patient was actually readmitted) — Maya's problem - P(model predicts churn | user actually churned) — Alex's problem - P(model predicts reoffend | defendant actually reoffended) — James's problem - P(model predicts shot made | shot actually went in) — Sam's problem

Specificity (also called the true negative rate) is P(model predicts negative | actually negative). The complementary quantity.

The connection to Chapter 9 goes even deeper. Remember Bayes' theorem? When you asked "given that the test is positive, what's the probability the person actually has the disease?" — that was the positive predictive value (PPV). In classification, the equivalent question is: "given that the model predicts yes, what's the probability it's actually yes?" That's called precision in machine learning. Same concept, different name.

We'll formalize all of this with the confusion matrix in Section 24.8. But the foundation was laid in Chapter 9.


24.6 Interpreting Logistic Regression Coefficients: Odds Ratios

In linear regression, interpreting coefficients was straightforward: "for each one-unit increase in $x$, the predicted $y$ increases by $b_1$." In logistic regression, the interpretation requires an extra step — and this is where the odds ratio becomes essential.

What the Coefficient Means (Three Levels)

Consider a logistic regression with one predictor:

$$\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 x$$

Level 1: Log-odds interpretation (technically correct but hard to feel)

"For each one-unit increase in $x$, the log-odds of the event increase by $b_1$."

True, but what does "increase in log-odds" feel like? Not much. Nobody thinks in log-odds.

Level 2: Odds ratio interpretation (the one you'll actually use)

Exponentiate the coefficient: $e^{b_1}$ gives the odds ratio.

"For each one-unit increase in $x$, the odds of the event are multiplied by $e^{b_1}$."

This is much more interpretable. If $b_1 = 0.5$, then $e^{0.5} = 1.65$, meaning the odds increase by a factor of 1.65 (or 65%) for each one-unit increase in $x$.

Level 3: Probability interpretation (approximate, but most intuitive)

The change in probability depends on where you start, which makes it trickier. Unlike linear regression, where the effect is constant, in logistic regression the effect on probability varies depending on the current probability. The sigmoid curve is steepest near P = 0.5 and flattest near P = 0 or P = 1.

The Odds Ratio ($e^{b_1}$)

The odds ratio (OR) is the multiplicative change in odds for a one-unit increase in the predictor. Here's how to interpret it:

$e^{b_1}$ Interpretation
$e^{b_1} = 1$ No effect (odds unchanged)
$e^{b_1} > 1$ Higher odds (positive association)
$e^{b_1} < 1$ Lower odds (negative association)
$e^{b_1} = 2.0$ The odds double for each one-unit increase
$e^{b_1} = 0.5$ The odds are halved for each one-unit increase

Maya's Example

Maya is predicting whether patients will be readmitted to the hospital within 30 days. Her logistic regression includes the number of previous admissions as a predictor.

$$\ln\left(\frac{P(\text{readmitted})}{1-P(\text{readmitted})}\right) = -2.1 + 0.35 \times \text{PreviousAdmissions}$$

The coefficient for PreviousAdmissions is $b_1 = 0.35$.

Log-odds interpretation: "For each additional previous admission, the log-odds of readmission increase by 0.35."

Odds ratio: $e^{0.35} = 1.42$

Odds ratio interpretation: "For each additional previous admission, the odds of readmission are multiplied by 1.42 — a 42% increase in odds."

This is how medical researchers would report it: "Each prior admission was associated with a 42% increase in the odds of 30-day readmission (OR = 1.42)."

Multiple Predictors

With multiple predictors, the interpretation follows the same "holding other variables constant" logic from Chapter 23:

$$\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k$$

"For each one-unit increase in $x_1$, the odds of the event are multiplied by $e^{b_1}$, holding all other variables constant."

The threshold concept from Chapter 23 carries forward directly. In fact, the "holding other variables constant" interpretation is even more important here because the same confounders that distort linear regression also distort logistic regression.

Common Mistake

Don't confuse the odds ratio with a probability ratio. If $e^{b_1} = 2$, the odds double — but the probability does NOT double. Going from 10% probability to 20% probability is a doubling of probability but a change in odds from 1/9 to 1/4 (a factor of 2.25). Going from 40% to 80% is also a doubling of probability but a change in odds from 2/3 to 4 (a factor of 6). The odds ratio is constant across the range of the predictor; the probability change is not.


24.7 Spaced Review 2: From Simple to Multiple to Logistic Regression (from Ch. 22-23)

Retrieval Practice — trace the evolution.

Feature Ch. 22: Simple Regression Ch. 23: Multiple Regression Ch. 24: Logistic Regression
Response variable Numerical (continuous) Numerical (continuous) ?
Equation $\hat{y} = b_0 + b_1 x$ $\hat{y} = b_0 + b_1 x_1 + \cdots + b_k x_k$ ?
Coefficient interpretation Change in $y$ per unit of $x$ Change in $y$ per unit of $x_i$, holding others constant ?
Model fit measure $R^2$ Adjusted $R^2$, F-test ?

Fill in the blanks before reading on.

Here's the completed table:

Feature Ch. 22: Simple Ch. 23: Multiple Ch. 24: Logistic
Response variable Numerical (continuous) Numerical (continuous) Binary (0/1)
Equation $\hat{y} = b_0 + b_1 x$ $\hat{y} = b_0 + b_1 x_1 + \cdots$ $\ln\left(\frac{P}{1-P}\right) = b_0 + b_1 x_1 + \cdots$
Coefficient interpretation Change in $y$ per unit of $x$ Change in $y$ per unit of $x_i$, holding others constant Change in log-odds per unit of $x_i$; $e^{b_i}$ = odds ratio
Model fit measure $R^2$ Adjusted $R^2$, F-test Confusion matrix, AUC

The pattern is clear: logistic regression inherits the multi-predictor framework from Chapter 23 and applies it to binary outcomes. The linear combination is the same. The interpretation requires the odds ratio step. And model evaluation shifts from "how well does the line fit?" to "how well does the model classify?"

Note one important technical difference: in linear regression, you minimize the sum of squared residuals (least squares). In logistic regression, there are no residuals in the usual sense — the outcome is 0 or 1, not continuous. Instead, logistic regression uses maximum likelihood estimation: it finds the coefficients that make the observed data most probable. The details are beyond this chapter, but the key takeaway is that the fitting algorithm is different even though the equation looks similar.


24.8 The Confusion Matrix: Evaluating Classification

When you predict a number (like ER visit rates), you evaluate your model by asking "how close were the predictions?" That's $R^2$ — the proportion of variability explained.

When you predict a category (like readmitted vs. not readmitted), you evaluate your model by asking "how often were the predictions correct?" And the tool for answering this question is the confusion matrix.

Building the Confusion Matrix

A logistic regression model produces a predicted probability for each observation: "this patient has a 72% probability of readmission." To turn this into a yes/no prediction, you need a threshold — typically 0.5. If the predicted probability exceeds the threshold, predict "yes." Otherwise, predict "no."

Once you've made predictions for every observation, you can cross-tabulate the predictions against the actual outcomes:

Actually Positive (Y = 1) Actually Negative (Y = 0)
Predicted Positive True Positive (TP) False Positive (FP)
Predicted Negative False Negative (FN) True Negative (TN)

Four cells. Four possible outcomes. Two of them are correct (TP and TN) and two are errors (FP and FN).

This should look familiar. In Chapter 9, you encountered the same 2x2 structure with medical tests: TP (correctly identified disease), FP (false alarm), FN (missed diagnosis), and TN (correctly cleared). The confusion matrix is the same idea, but now the "test" is a logistic regression model rather than a medical screening.

The Metrics That Come From the Confusion Matrix

From these four numbers, we can compute an entire family of performance metrics:

Accuracy — the overall correct rate:

$$\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}$$

Accuracy is intuitive but can be misleading. If 95% of patients are not readmitted, a model that simply predicts "not readmitted" for everyone would have 95% accuracy — and be completely useless. This is called the accuracy paradox, and it's why accuracy alone is never sufficient.

Sensitivity (Recall / True Positive Rate) — of all actual positives, how many did we catch?

$$\text{Sensitivity} = \frac{TP}{TP + FN}$$

This is the same sensitivity from Chapter 9. It answers: "of all the patients who were readmitted, what proportion did the model correctly predict would be readmitted?"

Specificity (True Negative Rate) — of all actual negatives, how many did we correctly identify?

$$\text{Specificity} = \frac{TN}{TN + FP}$$

Same specificity from Chapter 9. It answers: "of all the patients who were not readmitted, what proportion did the model correctly predict would not be readmitted?"

Precision (Positive Predictive Value) — of all predicted positives, how many were actually positive?

$$\text{Precision} = \frac{TP}{TP + FP}$$

This is the PPV from Chapter 9. It answers: "when the model says 'this patient will be readmitted,' how often is it right?"

F1 Score — the harmonic mean of precision and recall:

$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

The F1 score balances precision and recall into a single number. It's the go-to metric when you care about both false positives and false negatives.

A Worked Example: Maya's Readmission Model

Maya's logistic regression model produces predictions for 200 patients. Using a 0.5 threshold, here are the results:

Actually Readmitted Not Readmitted
Predicted Readmitted 35 (TP) 15 (FP)
Predicted Not Readmitted 10 (FN) 140 (TN)

Let's compute every metric:

  • Accuracy = (35 + 140) / 200 = 175/200 = 0.875 (87.5%)
  • Sensitivity = 35 / (35 + 10) = 35/45 = 0.778 (77.8%)
  • Specificity = 140 / (140 + 15) = 140/155 = 0.903 (90.3%)
  • Precision = 35 / (35 + 15) = 35/50 = 0.700 (70.0%)
  • F1 Score = 2 $\times$ (0.700 $\times$ 0.778) / (0.700 + 0.778) = 2 $\times$ 0.545 / 1.478 = 0.737

What do these numbers tell us? The model catches about 78% of readmissions (sensitivity), correctly identifies about 90% of non-readmissions (specificity), and when it flags someone for readmission, it's right about 70% of the time (precision).

Is that good enough? It depends on the context. If missing a readmission means a patient doesn't get follow-up care and ends up back in the ER, you might want higher sensitivity — even at the cost of more false positives. If flagging someone for readmission means expensive intervention resources, you might want higher precision.

The Sensitivity-Specificity Tradeoff

You can't maximize both sensitivity and specificity simultaneously. Lowering the threshold (from 0.5 to, say, 0.3) will catch more true positives — but it will also create more false positives. You're shifting the balance between "catching everything" (high sensitivity) and "avoiding false alarms" (high specificity).

This tradeoff is not a mathematical nuisance. It's an ethical decision. In cancer screening, you want very high sensitivity (don't miss any cancers), even at the cost of some false positives (unnecessary biopsies). In spam filtering, you want very high specificity (don't send real emails to spam), even at the cost of some false negatives (a few spam emails in your inbox). The threshold you choose reflects your values, not just your data.


24.9 ROC Curve and AUC: One Number to Rule Them All

The sensitivity-specificity tradeoff creates a question: how do we evaluate the model itself, independent of the threshold we choose?

Enter the ROC curve (Receiver Operating Characteristic curve). It was developed during World War II by radar operators who needed to distinguish real enemy aircraft (true positives) from random noise (false positives). The name stuck.

How the ROC Curve Works

The ROC curve plots sensitivity (y-axis) against 1 - specificity (x-axis) — that is, the true positive rate versus the false positive rate — for every possible threshold from 0 to 1.

Here's the intuition:

  • At threshold = 0: You predict "positive" for everyone. Sensitivity = 1 (you catch every positive), but specificity = 0 (you also call every negative a positive). This is the top-right corner of the plot.
  • At threshold = 1: You predict "negative" for everyone. Specificity = 1 (you correctly identify every negative), but sensitivity = 0 (you miss every positive). This is the bottom-left corner.
  • In between: The curve traces all the intermediate tradeoffs.

A perfect model would produce a curve that goes straight up from (0, 0) to (0, 1) and then across to (1, 1) — perfect sensitivity at every specificity level. A useless model (no better than random guessing) would produce a diagonal line from (0, 0) to (1, 1).

The better the model, the more the curve bows toward the upper-left corner.

AUC: Area Under the Curve

The AUC (Area Under the ROC Curve) collapses the entire ROC curve into a single number between 0 and 1:

AUC Interpretation
0.50 No better than random guessing (coin flip)
0.60-0.70 Poor discrimination
0.70-0.80 Acceptable discrimination
0.80-0.90 Excellent discrimination
0.90-1.00 Outstanding discrimination

The AUC has a beautiful probabilistic interpretation: it equals the probability that a randomly chosen positive case has a higher predicted probability than a randomly chosen negative case. In Maya's context: if you pick a random patient who was readmitted and a random patient who was not, the AUC is the probability that the model assigns a higher readmission probability to the readmitted patient.

ROC and AUC in Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Example: Maya's readmission model
# True outcomes (0 = not readmitted, 1 = readmitted)
y_true = np.array([1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
                   0, 1, 0, 0, 1, 0, 0, 0, 1, 0])

# Predicted probabilities from the logistic regression model
y_prob = np.array([0.82, 0.71, 0.35, 0.65, 0.22, 0.15, 0.88, 0.42, 0.59, 0.28,
                   0.11, 0.76, 0.19, 0.31, 0.63, 0.08, 0.27, 0.14, 0.55, 0.33])

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)

# Plot
plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='navy', linewidth=2,
         label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--',
         label='Random guessing (AUC = 0.5)')
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
plt.ylabel('True Positive Rate (Sensitivity)', fontsize=12)
plt.title("ROC Curve: Maya's Readmission Model", fontsize=14)
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"AUC: {roc_auc:.3f}")

The ROC curve gives you the complete picture of model performance — not just at one threshold, but at every threshold. And the AUC gives you a single summary number that you can use to compare models.


24.10 Fitting Logistic Regression in Python

Let's put it all together and fit a logistic regression model from start to finish. We'll use both statsmodels (for statistical inference — coefficients, p-values, confidence intervals) and sklearn (for machine learning — predictions, confusion matrix, ROC curve).

Alex's Churn Prediction Model

Alex Rivera at StreamVibe wants to predict which users will cancel their subscriptions (churn). For each of 500 subscribers, Alex has:

  • monthly_hours: average hours watched per month
  • months_subscribed: how long they've been a subscriber
  • support_tickets: number of customer support tickets filed
  • churned: 1 = cancelled, 0 = still subscribed
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix, classification_report,
                             roc_curve, auc, accuracy_score)
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# ============================================================
# ALEX'S CHURN PREDICTION — FULL LOGISTIC REGRESSION WORKFLOW
# ============================================================

# Generate realistic churn data
np.random.seed(42)
n = 500

monthly_hours = np.random.normal(25, 10, n).clip(1, 60)
months_subscribed = np.random.exponential(18, n).clip(1, 72).astype(int)
support_tickets = np.random.poisson(2, n)

# True churn probability (logistic model with known parameters)
log_odds = (-1.5
            - 0.08 * monthly_hours
            + 0.03 * months_subscribed  # slight fatigue effect
            + 0.45 * support_tickets)   # complaints increase churn
prob_churn = 1 / (1 + np.exp(-log_odds))
churned = np.random.binomial(1, prob_churn)

df = pd.DataFrame({
    'monthly_hours': monthly_hours.round(1),
    'months_subscribed': months_subscribed,
    'support_tickets': support_tickets,
    'churned': churned
})

print("Dataset overview:")
print(df.describe().round(2))
print(f"\nChurn rate: {df['churned'].mean():.1%}")

Step 1: Explore the Data

# Churn rates by support ticket count
print("\nChurn rate by support tickets:")
print(df.groupby('support_tickets')['churned'].mean().round(3))

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, col in enumerate(['monthly_hours', 'months_subscribed',
                          'support_tickets']):
    axes[i].boxplot([df[df['churned']==0][col],
                     df[df['churned']==1][col]],
                    labels=['Stayed', 'Churned'])
    axes[i].set_title(f'{col} by Churn Status')
    axes[i].set_ylabel(col)

plt.suptitle("Alex's Churn Analysis: Predictors by Outcome",
             fontsize=14)
plt.tight_layout()
plt.show()

Step 2: Fit with statsmodels (for inference)

# Add constant for intercept
X_sm = sm.add_constant(df[['monthly_hours', 'months_subscribed',
                            'support_tickets']])

# Fit logistic regression
logit_model = sm.Logit(df['churned'], X_sm).fit()
print(logit_model.summary())

# Odds ratios with 95% confidence intervals
print("\n--- Odds Ratios ---")
odds_ratios = np.exp(logit_model.params)
ci = np.exp(logit_model.conf_int())
ci.columns = ['OR 2.5%', 'OR 97.5%']
results = pd.DataFrame({
    'Odds Ratio': odds_ratios.round(3),
    'OR 2.5%': ci['OR 2.5%'].round(3),
    'OR 97.5%': ci['OR 97.5%'].round(3),
    'p-value': logit_model.pvalues.round(4)
})
print(results)

Step 3: Interpret the coefficients

Here's how Alex would interpret the output (using hypothetical results close to the true parameters):

  • monthly_hours (OR $\approx$ 0.92): "For each additional hour of monthly viewing, the odds of churning decrease by about 8%, holding other variables constant. Users who watch more are less likely to leave."

  • months_subscribed (OR $\approx$ 1.03): "For each additional month of subscription, the odds of churning increase by about 3%, holding other variables constant. This suggests a small subscriber fatigue effect."

  • support_tickets (OR $\approx$ 1.57): "For each additional support ticket, the odds of churning increase by about 57%, holding other variables constant. Customer complaints are the strongest predictor of churn."

Step 4: Fit with sklearn (for prediction and evaluation)

# Split data into training and testing sets
X = df[['monthly_hours', 'months_subscribed', 'support_tickets']]
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit logistic regression
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]  # probability of churn

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=['Stayed', 'Churned']))

# Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

Step 5: ROC Curve

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='navy', linewidth=2,
         label=f'Logistic Regression (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--',
         label='Random guessing')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title("ROC Curve: Alex's Churn Prediction Model", fontsize=14)
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Two Libraries, Two Purposes

Library Use When Strengths
statsmodels (sm.Logit) You want to understand the model Coefficients, standard errors, p-values, confidence intervals, odds ratios
sklearn (LogisticRegression) You want to use the model for prediction Train/test split, confusion matrix, ROC curve, AUC, cross-validation

In a real project, you'd often use both: statsmodels for inference and sklearn for evaluation.


24.11 Sam's Shot Prediction Model: Distance, Defenders, and Fatigue

Sam has been waiting for this chapter. He wants to predict whether Daria Williams will make a shot based on game context.

For 150 shots, Sam recorded: - distance: feet from the basket - defender_distance: feet between Daria and the nearest defender - quarter: game quarter (1-4, measuring fatigue) - made: 1 = made, 0 = missed

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, classification_report

# ============================================================
# SAM'S SHOT PREDICTION MODEL
# ============================================================

np.random.seed(24)
n_shots = 150

distance = np.random.uniform(3, 28, n_shots)
defender_dist = np.random.uniform(1, 12, n_shots)
quarter = np.random.choice([1, 2, 3, 4], n_shots)

# True model: closer shots, farther defenders, earlier quarters
# all increase P(made)
log_odds = (2.0
            - 0.10 * distance
            + 0.15 * defender_dist
            - 0.20 * quarter)
prob_made = 1 / (1 + np.exp(-log_odds))
made = np.random.binomial(1, prob_made)

shots = pd.DataFrame({
    'distance': distance.round(1),
    'defender_dist': defender_dist.round(1),
    'quarter': quarter,
    'made': made
})

# Fit the model
X = sm.add_constant(shots[['distance', 'defender_dist', 'quarter']])
model = sm.Logit(shots['made'], X).fit()
print(model.summary())

# Odds ratios
print("\n--- Odds Ratios ---")
or_df = pd.DataFrame({
    'Coefficient': model.params.round(4),
    'Odds Ratio': np.exp(model.params).round(3),
    'p-value': model.pvalues.round(4)
})
print(or_df)

Sam's interpretation:

"Each additional foot of distance from the basket decreases the odds of making the shot by about 10% (OR $\approx$ 0.90). Each additional foot of space from the nearest defender increases the odds by about 16% (OR $\approx$ 1.16). And each quarter of the game reduces the odds by about 18% (OR $\approx$ 0.82), suggesting that fatigue matters."

Prediction for a specific shot: Daria is 15 feet from the basket, the nearest defender is 6 feet away, and it's the 3rd quarter.

# Predict a specific shot
log_odds_specific = (model.params['const']
                     + model.params['distance'] * 15
                     + model.params['defender_dist'] * 6
                     + model.params['quarter'] * 3)

prob_specific = 1 / (1 + np.exp(-log_odds_specific))
print(f"\nPredicted probability of making the shot: {prob_specific:.1%}")

This is the kind of analysis that NBA teams run in real time. Every shot, every play, a logistic regression (or a more complex model built on the same foundation) estimates the probability of success. When a commentator says "that was a low-probability shot," they're channeling logistic regression.


24.12 James's Predictive Policing Algorithm: The Model Behind the Risk Score

Here's the part of this chapter that Professor Washington has been building toward since Chapter 1.

The predictive policing algorithm that James has been studying? It's logistic regression.

The algorithm takes defendant characteristics — age at first arrest, number of prior offenses, severity of current charge, employment status — feeds them into a logistic regression model, and outputs a probability: the probability of reoffending within two years. That probability becomes the "risk score" that judges use to make bail and sentencing decisions.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, classification_report

# ============================================================
# JAMES'S ALGORITHM AUDIT — THE PREDICTIVE MODEL
# ============================================================

np.random.seed(606)
n = 800

# Defendant characteristics
age_first_arrest = np.random.normal(22, 5, n).clip(14, 45).astype(int)
prior_offenses = np.random.poisson(2, n)
employed = np.random.binomial(1, 0.55, n)
race = np.random.choice([0, 1], n, p=[0.45, 0.55])  # 0=White, 1=Black

# True recidivism model
log_odds = (-0.5
            - 0.04 * age_first_arrest
            + 0.40 * prior_offenses
            - 0.30 * employed
            + 0.25 * race)  # race effect in the system
prob_reoffend = 1 / (1 + np.exp(-log_odds))
reoffended = np.random.binomial(1, prob_reoffend)

defendants = pd.DataFrame({
    'age_first_arrest': age_first_arrest,
    'prior_offenses': prior_offenses,
    'employed': employed,
    'race': race,
    'reoffended': reoffended
})

# Fit the algorithm (logistic regression)
X = sm.add_constant(defendants[['age_first_arrest', 'prior_offenses',
                                 'employed', 'race']])
model = sm.Logit(defendants['reoffended'], X).fit()
print(model.summary())

# Odds ratios
print("\n--- Odds Ratios ---")
or_df = pd.DataFrame({
    'Odds Ratio': np.exp(model.params).round(3),
    'p-value': model.pvalues.round(4)
})
print(or_df)

James examines the output:

  • prior_offenses (OR $\approx$ 1.49): "Each additional prior offense increases the odds of reoffending by 49%. This is a legitimate predictor."
  • employed (OR $\approx$ 0.74): "Being employed reduces the odds of reoffending by about 26%. This reflects the literature on desistance."
  • race (OR $\approx$ 1.28): "Being Black increases the predicted odds of reoffending by 28%, holding all other variables constant."

That last line stops James cold. The algorithm includes race — and even after controlling for criminal history and employment, race still predicts the outcome. Is this because race causes reoffending (obviously not) or because race is correlated with unmeasured factors (neighborhood policing intensity, socioeconomic barriers) that the model doesn't capture?

Checking for Differential Error Rates

But the coefficients are only part of the story. James also needs to check whether the model makes different types of errors for different groups. This is where the confusion matrix, broken down by race, becomes essential.

# Predictions
defendants['pred_prob'] = model.predict(X)
defendants['pred_reoffend'] = (defendants['pred_prob'] >= 0.5).astype(int)

# Confusion matrices by race
for race_val, race_name in [(0, 'White'), (1, 'Black')]:
    subset = defendants[defendants['race'] == race_val]
    cm = confusion_matrix(subset['reoffended'], subset['pred_reoffend'])
    tn, fp, fn, tp = cm.ravel()

    fpr = fp / (fp + tn)  # False Positive Rate
    fnr = fn / (fn + tp)  # False Negative Rate

    print(f"\n--- {race_name} Defendants ---")
    print(f"  False Positive Rate: {fpr:.1%}")
    print(f"  (Predicted to reoffend but didn't)")
    print(f"  False Negative Rate: {fnr:.1%}")
    print(f"  (Predicted NOT to reoffend but did)")
    print(f"  Accuracy: {(tp+tn)/len(subset):.1%}")

Theme 6: Bias in Classification

If the false positive rate is higher for Black defendants than for White defendants, the algorithm is systematically over-predicting risk for Black individuals. A Black defendant who would not have reoffended is more likely to be classified as high-risk and denied bail or given a harsher sentence.

This isn't a hypothetical concern. ProPublica's 2016 analysis of the COMPAS algorithm — a real-world logistic regression-based risk assessment tool — found exactly this pattern: the false positive rate was nearly twice as high for Black defendants as for White defendants.

The mathematical framework you're learning in this chapter is the same framework used to build (and to audit) these algorithms. Understanding logistic regression doesn't just make you a better statistician. It makes you a more informed citizen.


24.13 Logistic Regression as the Simplest Classifier: The AI/ML Connection

Let's zoom out and connect what you've learned to the bigger picture.

Every time an AI system classifies something — spam vs. not spam, fraudulent vs. legitimate, cat vs. dog — it's doing a more complex version of what you just did. And logistic regression is where it all starts.

The Classification Pipeline

  1. Input: Features (predictors) — the $x$ variables
  2. Model: A function that maps features to a probability
  3. Threshold: A cutoff that converts the probability into a class label
  4. Output: A prediction — "spam" or "not spam"

In logistic regression, the model is the sigmoid function applied to a linear combination. In a neural network, it's the sigmoid (or a cousin like ReLU) applied to a much more complex combination with thousands of parameters. But the structure is the same.

Where Logistic Regression Sits in the ML Landscape

Model Complexity What It Does When to Use
Logistic regression Low Linear boundary When the relationship between predictors and log-odds is roughly linear
Decision tree Medium Nested if-then rules When you want interpretability and have complex interactions
Random forest High Ensemble of decision trees When you want accuracy and have lots of data
Neural network Very high Layers of weighted sums + activation functions When you have massive data and complex patterns

Logistic regression is the foundation. Every other classifier is, in some sense, an attempt to do what logistic regression does — but with more flexibility. And when you evaluate those models? You use the same tools: confusion matrix, accuracy, precision, recall, F1, ROC curve, AUC. The evaluation framework you learned in this chapter applies to all classifiers.

Applications Everywhere

Domain Positive Class Negative Class Features
Email Spam Not spam Word frequencies, sender, links
Medicine Disease present Disease absent Symptoms, test results, history
Finance Loan default Loan repaid Income, credit score, debt ratio
Marketing Customer churns Customer stays Usage, complaints, tenure
Criminal justice Reoffends Doesn't reoffend History, demographics, employment

In every case, the model learns which features predict the outcome, assigns weights (coefficients), and outputs a probability. In every case, the performance is evaluated with the same metrics. And in every case, the ethical questions are the same: Who benefits from the classification? Who is harmed by errors? Are the error rates equal across groups?

Theme 3: This Is How AI Classifies Things

When someone tells you that "an AI algorithm" denied their loan application, or that "a machine learning model" flagged their email as spam, or that "a predictive algorithm" determined their bail amount — the simplest version of what happened is logistic regression. The features went in. The sigmoid function produced a probability. A threshold turned it into a yes or no.

Understanding logistic regression means you understand the fundamental architecture of classification. Everything else is variation on this theme.


24.14 Spaced Review 3: Bayes' Theorem and Classification (from Ch. 9)

Retrieval Practice — connect the threads.

In Chapter 9, you learned Bayes' theorem: $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$.

You used it to answer questions like: "Given a positive test result, what's the probability the patient actually has the disease?"

Now consider: "Given that a logistic regression model predicts 'high risk,' what's the probability the defendant actually reoffends?"

This is the same question. The base rate (prevalence) still matters. If only 5% of defendants reoffend, even a model with 90% sensitivity will have a low positive predictive value — just like the medical screening examples in Chapter 9.

The connection between Bayes' theorem and classification is this: the confusion matrix gives you the ingredients for Bayesian reasoning about model predictions.

From the confusion matrix: - Sensitivity = P(model says positive | actually positive) = P(B|A) - False positive rate = P(model says positive | actually negative) = P(B|not A) - Base rate = P(actually positive) = P(A)

Plug these into Bayes' theorem and you get the precision (PPV):

$$\text{Precision} = P(\text{actually positive} | \text{model says positive}) = \frac{\text{Sensitivity} \times \text{Base Rate}}{\text{Sensitivity} \times \text{Base Rate} + \text{FPR} \times (1 - \text{Base Rate})}$$

This is exactly the Bayes' theorem calculation from Chapter 9, applied to model predictions instead of medical tests. The base rate fallacy from Chapter 9 is still lurking: a model can have high sensitivity and still produce mostly false positives when the base rate is low.


24.15 Progressive Project: Fit a Logistic Regression

Your Data Detective Portfolio — Chapter 24

If your dataset has a binary outcome variable (or if you can create one), this is your chance to apply logistic regression.

Option A: Natural binary outcome If your dataset already contains a binary variable (e.g., smoker/non-smoker, passed/failed, employed/unemployed, survived/died), use it as the response variable.

Option B: Create a binary outcome If your dataset is entirely numerical, create a binary variable by splitting a numerical variable at a meaningful threshold (e.g., "above median income" vs. "below median income," "BMI > 30" vs. "BMI $\leq$ 30").

Your tasks:

  1. Fit a logistic regression with at least two predictors using statsmodels
  2. Report and interpret the odds ratios for each predictor — what do they mean in context?
  3. Create a confusion matrix for your model's predictions
  4. Calculate accuracy, sensitivity, specificity, and precision — write a sentence interpreting each
  5. Generate an ROC curve and report the AUC
  6. Discuss: Is accuracy alone sufficient for evaluating your model? Why or why not?
  7. Reflect: Could your model be biased against any group? What would you check?

Suggested datasets and binary outcomes: - CDC BRFSS: smoker status, diabetes diagnosis, hypertension - Gapminder: above/below median life expectancy - U.S. College Scorecard: graduation rate above/below threshold - World Happiness Report: above/below median happiness score - NOAA Climate Data: extreme weather event (yes/no)

Add this analysis as a new section in your Jupyter notebook portfolio.


24.16 Common Mistakes and Misconceptions

Mistake Why It's Wrong Correction
Using linear regression for binary outcomes Predictions outside [0, 1]; violates regression assumptions Use logistic regression
Interpreting the coefficient as a change in probability The coefficient is a change in log-odds, not probability Report the odds ratio ($e^{b}$); interpret as multiplicative change in odds
Saying "the odds ratio is the probability" OR = 1.5 does NOT mean P = 1.5 or P = 0.15 OR = 1.5 means the odds increase by 50% per unit
Using only accuracy to evaluate the model Accuracy can be misleading with imbalanced classes Report sensitivity, specificity, precision, F1, and AUC
Ignoring the threshold choice The default 0.5 threshold is arbitrary Choose threshold based on the costs of false positives vs. false negatives
Assuming logistic regression proves causation Same caveat as linear regression — observational data has confounders Say "associated with," not "causes"; acknowledge unmeasured confounders
Comparing odds ratios across predictors on different scales OR = 2.0 for age (years) vs. OR = 1.5 for income (thousands) — not comparable Standardize predictors first, or compare practical impact instead

24.17 Ethical Analysis: Who Bears the Cost of Classification Errors?

Ethical Analysis Block

Logistic regression — and classification more broadly — forces a decision: is this observation "positive" or "negative"? And that decision has consequences.

Consider three scenarios:

Scenario 1: Medical diagnosis. A logistic regression predicts whether a tumor is malignant. A false negative (missed cancer) could be fatal. A false positive (unnecessary biopsy) is painful and expensive but survivable. Most people would choose high sensitivity even at the cost of lower specificity.

Scenario 2: Criminal justice. A logistic regression predicts whether a defendant will reoffend. A false positive (wrongly classified as high-risk) means someone who wouldn't have reoffended loses their freedom. A false negative (wrongly classified as low-risk) means someone who would have reoffended is released. Who bears the cost of each error? Does your answer change depending on the defendant's race?

Scenario 3: Loan approval. A logistic regression predicts whether a borrower will default. A false positive (wrongly denied a loan) means someone who would have repaid is denied an opportunity. A false negative (wrongly approved) means the bank loses money. Banks optimize for their own costs — but the societal cost of denying loans to creditworthy applicants from marginalized communities is much harder to quantify.

Discussion questions: 1. In each scenario, who decides the threshold — and who bears the consequences of that decision? 2. If a model has equal accuracy across racial groups but different false positive rates, is it "fair"? 3. Should a model ever include race as a predictor? What about variables that are proxies for race (zip code, school district)? 4. James argues that "the algorithm is neutral — it just follows the data." Professor Washington challenges: "The data itself encodes historical injustice." Who is right?


24.18 Chapter Summary

Let's trace the logic of this chapter from start to finish.

The problem: Linear regression fails for binary outcomes because it predicts values outside [0, 1].

The solution: The sigmoid function maps any real number to a probability between 0 and 1.

The model: Logistic regression models the log-odds of the outcome as a linear function of the predictors: $\ln(P/(1-P)) = b_0 + b_1 x_1 + \cdots$

The interpretation: Exponentiate the coefficient to get the odds ratio: $e^{b_i}$ is the multiplicative change in odds for a one-unit increase in $x_i$, holding other variables constant.

The evaluation: The confusion matrix gives TP, FP, FN, TN. From these: accuracy, sensitivity, specificity, precision, F1 score. The ROC curve and AUC summarize performance across all thresholds.

The connection to AI: Logistic regression is the simplest classifier. Every AI classification system — spam filters, medical diagnosis, loan approval, predictive policing — is a more complex version of what you learned in this chapter. The evaluation tools (confusion matrix, ROC, AUC) apply to all classifiers.

The ethical dimension: Classification errors are not symmetric. Who bears the cost depends on the domain, and different groups may bear different costs. Evaluating fairness requires checking error rates separately for each group.


What's Next

In Chapter 25, you'll learn to communicate all the statistical analyses you've built throughout this textbook — including the logistic regression results from this chapter. How do you present a confusion matrix to a non-technical audience? How do you explain an odds ratio to a hospital administrator or a judge? How do you write a clear, honest, and compelling data story?

The analysis means nothing if nobody understands it. Chapter 25 is where you learn to make them understand.


Chapter 24 learning objectives — check your understanding:

  • [ ] I can explain why linear regression fails for binary outcomes
  • [ ] I can convert between probability, odds, and log-odds
  • [ ] I can interpret a logistic regression coefficient as an odds ratio
  • [ ] I can fit a logistic regression model in Python using both statsmodels and sklearn
  • [ ] I can construct and interpret a confusion matrix
  • [ ] I can calculate accuracy, sensitivity, specificity, precision, and F1 score
  • [ ] I can explain what an ROC curve shows and interpret the AUC
  • [ ] I can connect logistic regression to classification in AI/machine learning
  • [ ] I can identify ethical concerns with classification algorithms, especially differential error rates