Chapter 27: Logistic Regression and Classification — Predicting Categories

Contributors to Introduction to Data Science

15 min read

> "Prediction is very difficult, especially if it's about the future."

Prerequisites

{'chapter': 26, 'description': 'Linear Regression — understanding regression, coefficients, and scikit-learn workflow'}
{'chapter': 25, 'description': 'What Is a Model — features, targets, train-test splits, overfitting'}
{'chapter': 20, 'description': 'Probability Thinking — understanding probabilities between 0 and 1'}
{'chapter': 24, 'description': "Correlation and Causation — awareness that models don't prove causation"}

Learning Objectives

Explain why linear regression is inappropriate for classification problems
Describe how the sigmoid function maps any number to a probability between 0 and 1
Fit a logistic regression model using scikit-learn and interpret its outputs
Use predict_proba to obtain probability estimates and explain why probabilities are more useful than binary predictions
Construct and interpret a confusion matrix with true positives, false positives, true negatives, and false negatives
Explain the tradeoff between precision and recall and why it matters
Recognize and handle class imbalance in classification problems
Apply logistic regression to classify countries as "high" vs "low" vaccination

In This Chapter

Chapter Overview
27.1 Why Not Just Use Linear Regression?
27.2 The Sigmoid Function: Turning a Line Into a Probability
27.3 From Probability to Prediction: The Threshold
27.4 Fitting Logistic Regression with scikit-learn
27.5 Probabilities vs. Predictions: Why predict_proba Matters
27.6 Interpreting Logistic Regression Coefficients
27.7 The Confusion Matrix: Detailed Error Analysis
27.8 Precision and Recall: Two Sides of the Same Coin
27.9 The Classification Report: All Metrics at Once
27.10 Class Imbalance: The Elephant in the Room
27.11 Project Milestone: Classifying Countries as High vs. Low Vaccination
27.12 Choosing the Right Threshold
27.13 Comparing Regression and Classification Approaches
27.14 When Logistic Regression Fails
27.15 Chapter Summary
Connections to What You've Learned
Looking Ahead

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 27: Logistic Regression and Classification — Predicting Categories

"Prediction is very difficult, especially if it's about the future." — Niels Bohr, physicist

Chapter Overview

In Chapter 26, you predicted a number: given a country's GDP and healthcare spending, what vaccination rate should you expect? That's regression — the target is continuous.

Now the question changes. Instead of "What vaccination rate?" you ask: "Will this country have high vaccination or low vaccination?" You're not predicting a number on a sliding scale — you're predicting a category. High or low. Yes or no. Spam or not spam. Malignant or benign.

This is classification, and it's everywhere in data science. Should the bank approve this loan? Will this customer churn? Is this email spam? Does this medical image show a tumor? Each of these is a classification problem — the answer is a category, not a number.

You might think: can't I just use linear regression? Set up a scale where 0 = "low vaccination" and 1 = "high vaccination," then fit a line? You could try, but as you'll see in Section 27.1, it doesn't work well. The predictions can go below 0 or above 1 — which makes no sense for probabilities — and the model doesn't understand that it's predicting a category.

Instead, you'll learn logistic regression — a model specifically designed for classification. Despite its name (it has "regression" in it), logistic regression is a classification algorithm. It predicts the probability that an observation belongs to a particular category, and then uses that probability to make a classification decision.

In this chapter, you will learn to:

Explain why linear regression fails for classification (all paths)
Describe how the sigmoid function converts numbers to probabilities (all paths)
Fit a logistic regression model using scikit-learn (all paths)
Use predict_proba to get probability estimates (all paths)
Construct and interpret confusion matrices (all paths)
Explain the precision-recall tradeoff (all paths)
Recognize class imbalance and its effects on model evaluation (standard + deep dive)
Apply logistic regression to classify countries as high vs. low vaccination (all paths)

27.1 Why Not Just Use Linear Regression?

Let's try. Suppose you have data on whether customers renewed their subscription (1 = yes, 0 = no) and you want to predict renewal from customer satisfaction scores.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

np.random.seed(42)
n = 100

satisfaction = np.random.uniform(1, 10, n)
prob_renew = 1 / (1 + np.exp(-(satisfaction - 5)))
renewed = (np.random.random(n) < prob_renew).astype(int)

# Fit linear regression to a 0/1 target
X = satisfaction.reshape(-1, 1)
lin_model = LinearRegression().fit(X, renewed)

x_plot = np.linspace(0, 11, 200).reshape(-1, 1)
y_plot = lin_model.predict(x_plot)

plt.figure(figsize=(10, 6))
plt.scatter(satisfaction, renewed, alpha=0.5, color='steelblue')
plt.plot(x_plot, y_plot, color='coral', linewidth=2,
         label='Linear regression')
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.3)
plt.axhline(y=1, color='gray', linestyle='--', alpha=0.3)
plt.xlabel('Customer Satisfaction (1-10)')
plt.ylabel('Renewed (0 or 1)')
plt.title('Linear Regression for Classification: The Problem')
plt.legend()
plt.ylim(-0.3, 1.3)
plt.show()

See the problems?

Predictions go below 0 and above 1. For customers with very low satisfaction, the model predicts a negative probability of renewal. For very high satisfaction, it predicts more than 100%. Neither makes sense.
The model doesn't respect the binary nature of the target. It tries to fit a straight line to data that only takes values 0 and 1. The line doesn't capture the S-shaped pattern in the data — the rapid transition from "probably not" to "probably yes" around the middle satisfaction range.
The predicted values aren't probabilities. They're just numbers on a line. You can't interpret a prediction of 0.73 as a 73% probability of renewal, because the model doesn't constrain its outputs to be proper probabilities.

We need a model that outputs values between 0 and 1 — values we can interpret as probabilities. Enter the sigmoid function.

27.2 The Sigmoid Function: Turning a Line Into a Probability

The sigmoid function (also called the logistic function — hence "logistic regression") is a mathematical curve that takes any number and squishes it into the range 0 to 1:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-8, 8, 200)

plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid(z), color='steelblue', linewidth=2)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('Input (z)')
plt.ylabel('Output: sigmoid(z)')
plt.title('The Sigmoid Function')
plt.grid(True, alpha=0.3)
plt.show()

The sigmoid has a characteristic S-shape:

When z is very negative (like -10), sigmoid(z) is very close to 0
When z is very positive (like +10), sigmoid(z) is very close to 1
When z = 0, sigmoid(z) = 0.5 exactly
The transition from near-0 to near-1 happens smoothly

Here's the key insight: logistic regression computes a linear combination of features (just like linear regression) and then passes the result through the sigmoid function. The linear part captures the relationship between features and outcome. The sigmoid part squishes the result into a probability.

Linear regression:   prediction = intercept + w1*x1 + w2*x2
Logistic regression: probability = sigmoid(intercept + w1*x1 + w2*x2)

That's it. Logistic regression is linear regression wrapped in a sigmoid. The linear part calculates a score; the sigmoid converts that score to a probability between 0 and 1.

27.3 From Probability to Prediction: The Threshold

The sigmoid gives you a probability. But classification needs a decision: high or low? Yes or no? Spam or not spam?

To convert a probability into a category, you choose a threshold (also called a decision boundary). The default is 0.5:

If probability >= 0.5, predict the positive class (e.g., "high vaccination")
If probability < 0.5, predict the negative class (e.g., "low vaccination")

probabilities = np.array([0.12, 0.45, 0.52, 0.78, 0.91, 0.33])
threshold = 0.5
predictions = (probabilities >= threshold).astype(int)

for prob, pred in zip(probabilities, predictions):
    label = "High" if pred == 1 else "Low"
    print(f"  Probability: {prob:.2f} -> Prediction: {label}")

But 0.5 isn't always the right threshold. We'll come back to this important point later in the chapter.

27.4 Fitting Logistic Regression with scikit-learn

The scikit-learn workflow is nearly identical to linear regression. The only difference is the import and the model class:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Simulated customer data
np.random.seed(42)
n = 300

customers = pd.DataFrame({
    'satisfaction': np.random.uniform(1, 10, n),
    'months_active': np.random.uniform(1, 60, n),
    'support_tickets': np.random.poisson(3, n)
})

# Renewal probability depends on features
logit = (
    0.5 * customers['satisfaction'] -
    0.02 * customers['months_active'] -
    0.3 * customers['support_tickets'] - 1
)
customers['renewed'] = (
    np.random.random(n) < sigmoid(logit)
).astype(int)

# Define features and target
X = customers[['satisfaction', 'months_active', 'support_tickets']]
y = customers['renewed']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
print(f"Training accuracy: {model.score(X_train, y_train):.3f}")
print(f"Test accuracy:     {model.score(X_test, y_test):.3f}")

Notice: .score() for a classifier returns accuracy — the fraction of predictions that are correct. For regression, .score() returned R-squared. Same method name, different metric.

27.5 Probabilities vs. Predictions: Why predict_proba Matters

Here's something that separates good data scientists from beginners: probabilities are more useful than binary predictions.

When a model says "this customer will renew" or "this customer will not renew," you lose nuance. A customer with a 51% probability of renewal and one with a 99% probability both get labeled "will renew" — but the two situations are very different.

The predict_proba method gives you the underlying probabilities:

# Binary predictions
predictions = model.predict(X_test)

# Probability predictions
probabilities = model.predict_proba(X_test)

# predict_proba returns two columns:
# Column 0: probability of class 0 (did not renew)
# Column 1: probability of class 1 (renewed)
print("First 5 predictions:")
print(f"{'Prediction':>12}  {'P(Not Renew)':>14}  {'P(Renew)':>10}")
for i in range(5):
    print(f"{predictions[i]:>12}  {probabilities[i, 0]:>14.3f}  "
          f"{probabilities[i, 1]:>10.3f}")

Why are probabilities better? Three reasons:

You can rank cases. Sort customers by renewal probability to focus retention efforts on the borderline cases (probability around 0.3-0.7), not the definite renewals or definite losses.
You can set your own threshold. Maybe you want to be aggressive about retention — flag anyone with renewal probability below 0.7 (not just below 0.5). Probabilities give you that flexibility.
You can communicate uncertainty. "This country has a 62% probability of high vaccination" is more informative than "This country will have high vaccination." Decision-makers can factor the uncertainty into their choices.

27.6 Interpreting Logistic Regression Coefficients

In linear regression, a coefficient of 2.5 means "a one-unit increase in the feature is associated with a 2.5-unit increase in the target." Simple.

In logistic regression, interpretation is slightly trickier because of the sigmoid. A coefficient of 0.5 does not mean "a one-unit increase in the feature increases the probability by 0.5." The effect on probability depends on where you start — a one-unit change near the middle of the sigmoid has a larger effect on probability than the same change near the extremes.

What stays simple is the direction:

Positive coefficient: Higher values of this feature increase the probability of the positive class
Negative coefficient: Higher values of this feature decrease the probability of the positive class
Larger absolute value: Stronger association with the outcome

print("Coefficients:")
for feat, coef in zip(X.columns, model.coef_[0]):
    direction = "increases" if coef > 0 else "decreases"
    print(f"  {feat}: {coef:.3f} "
          f"(higher {feat} {direction} renewal probability)")

For a more precise interpretation, logistic regression coefficients can be converted to odds ratios by exponentiating them. An odds ratio greater than 1 means the feature increases the odds of the positive outcome; less than 1 means it decreases the odds:

print("\nOdds ratios:")
for feat, coef in zip(X.columns, model.coef_[0]):
    odds_ratio = np.exp(coef)
    print(f"  {feat}: {odds_ratio:.3f}")

An odds ratio of 1.65 for satisfaction means: a one-unit increase in satisfaction multiplies the odds of renewal by 1.65 (a 65% increase in odds).

27.7 The Confusion Matrix: Detailed Error Analysis

Accuracy tells you the overall percentage of correct predictions. But it hides important details. A model that gets 90% accuracy sounds great — until you learn that 90% of the data belongs to one class, and the model just predicts that class every time.

The confusion matrix breaks down predictions into four categories:

                      Predicted
                  Positive    Negative
Actual  Positive    TP          FN
        Negative    FP          TN

True Positive (TP): Model predicted positive, and it was actually positive. (Correct)
True Negative (TN): Model predicted negative, and it was actually negative. (Correct)
False Positive (FP): Model predicted positive, but it was actually negative. (Type I error, "false alarm")
False Negative (FN): Model predicted negative, but it was actually positive. (Type II error, "miss")

from sklearn.metrics import confusion_matrix, classification_report

y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(f"  TN={cm[0,0]}  FP={cm[0,1]}")
print(f"  FN={cm[1,0]}  TP={cm[1,1]}")

print(f"\nAccuracy: {(cm[0,0]+cm[1,1])/cm.sum():.3f}")

Making It Visual

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(cm, cmap='Blues')

labels = ['Not Renewed', 'Renewed']
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

for i in range(2):
    for j in range(2):
        ax.text(j, i, str(cm[i, j]),
                ha='center', va='center', fontsize=18)

plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

Why the Confusion Matrix Matters

Consider a medical screening test. The confusion matrix tells you:

False Negatives (missed cases): A patient has cancer, but the test says they don't. This could be fatal. In medicine, false negatives are often the most dangerous error.
False Positives (false alarms): A healthy patient is told they might have cancer. This causes anxiety and unnecessary follow-up tests. Costly and stressful, but not fatal.

The type of error matters more than the number of errors. A model with 95% accuracy but many false negatives might be more dangerous than a model with 90% accuracy but few false negatives. The confusion matrix helps you see which errors your model is making.

27.8 Precision and Recall: Two Sides of the Same Coin

Two key metrics emerge from the confusion matrix:

Precision: Of all the cases the model predicted as positive, how many were actually positive?

Precision = TP / (TP + FP)

High precision means: "When the model says yes, it's usually right." Few false alarms.

Recall (also called sensitivity): Of all the actually positive cases, how many did the model correctly identify?

Recall = TP / (TP + FN)

High recall means: "The model catches most of the positive cases." Few misses.

from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")
print(f"  'When the model predicts renewal, it's right "
      f"{precision*100:.0f}% of the time'")
print(f"\nRecall: {recall:.3f}")
print(f"  'The model catches {recall*100:.0f}% of actual renewals'")

The Precision-Recall Tradeoff

Here's the tension: increasing precision typically decreases recall, and vice versa.

Why? Because precision and recall pull in opposite directions through the threshold:

Raise the threshold (e.g., only predict "positive" if probability > 0.8): The model becomes pickier — it only predicts positive when it's very confident. Precision goes up (fewer false alarms), but recall goes down (more misses).
Lower the threshold (e.g., predict "positive" if probability > 0.3): The model becomes more aggressive — it predicts positive for borderline cases too. Recall goes up (fewer misses), but precision goes down (more false alarms).

# Demonstrate the precision-recall tradeoff
proba = model.predict_proba(X_test)[:, 1]

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
print(f"{'Threshold':>10}  {'Precision':>10}  {'Recall':>8}")
for t in thresholds:
    preds = (proba >= t).astype(int)
    p = precision_score(y_test, preds, zero_division=0)
    r = recall_score(y_test, preds, zero_division=0)
    print(f"{t:>10.1f}  {p:>10.3f}  {r:>8.3f}")

Which Matters More?

It depends on the consequences of each type of error:

Scenario	Priority	Why
Cancer screening	High recall	Missing cancer (FN) is worse than a false alarm (FP)
Spam filtering	High precision	Marking real email as spam (FP) is worse than letting spam through (FN)
Criminal justice	High precision	Convicting an innocent person (FP) is worse than letting a guilty person go (FN)
Fraud detection	High recall	Missing fraud (FN) costs money; investigating false alarms (FP) costs less

There's no universally correct answer. The right balance depends on the domain and the consequences.

27.9 The Classification Report: All Metrics at Once

scikit-learn's classification_report gives you precision, recall, and F1-score (their harmonic mean) all at once:

print(classification_report(y_test, y_pred,
      target_names=['Not Renewed', 'Renewed']))

This prints a clean table with all the metrics you need. The F1-score combines precision and recall into a single number (their harmonic mean), which is useful when you want a balanced measure.

27.10 Class Imbalance: The Elephant in the Room

Here's a scenario that trips up nearly every beginning data scientist.

You build a model to detect fraudulent credit card transactions. Only 0.5% of transactions are fraudulent. Your model achieves 99.5% accuracy. Amazing?

No. A model that predicts "not fraud" for every transaction would also achieve 99.5% accuracy. It would catch zero fraud. It would be completely useless.

This is the class imbalance problem: when one class is much more common than the other, accuracy becomes misleading.

# Simulate imbalanced data
np.random.seed(42)
n = 1000
X_imb = np.random.normal(0, 1, (n, 3))
y_imb = np.zeros(n)
y_imb[:50] = 1  # Only 5% positive class

print(f"Class distribution:")
print(f"  Negative (0): {(y_imb==0).sum()} ({(y_imb==0).mean()*100:.1f}%)")
print(f"  Positive (1): {(y_imb==1).sum()} ({(y_imb==1).mean()*100:.1f}%)")

# Baseline: always predict majority class
baseline_acc = (y_imb == 0).mean()
print(f"\nBaseline accuracy (always predict 0): {baseline_acc:.1%}")

What to Do About Class Imbalance

Several strategies:

Don't use accuracy as your primary metric. Use precision, recall, F1-score, or the confusion matrix instead.
Use class_weight='balanced' in scikit-learn. This tells the model to pay more attention to the minority class:

model_balanced = LogisticRegression(
    class_weight='balanced', max_iter=1000
)
model_balanced.fit(X_train, y_train)

Adjust the threshold. Instead of the default 0.5, lower the threshold to catch more positive cases (at the cost of more false positives).
Report multiple metrics. Always present the confusion matrix, precision, and recall alongside accuracy.

27.11 Project Milestone: Classifying Countries as High vs. Low Vaccination

Time to apply everything to the progressive project. We'll reframe the vaccination prediction as a classification problem.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix,
    classification_report, accuracy_score)

# Simulated country data
np.random.seed(42)
n = 150

countries = pd.DataFrame({
    'gdp_per_capita': np.random.lognormal(9, 1.2, n),
    'health_spending_pct': np.random.uniform(2, 12, n),
    'education_index': np.random.uniform(0.3, 0.95, n),
    'urban_pct': np.random.uniform(15, 95, n)
})

# Generate vaccination rate
vacc_rate = (
    20 +
    5 * np.log(countries['gdp_per_capita'] / 1000) +
    1.5 * countries['health_spending_pct'] +
    30 * countries['education_index'] +
    0.08 * countries['urban_pct'] +
    np.random.normal(0, 6, n)
).clip(15, 100)

# Convert to binary: high (>= 80%) vs low (< 80%)
countries['high_vaccination'] = (vacc_rate >= 80).astype(int)

print("Class distribution:")
print(countries['high_vaccination'].value_counts())
print(f"\n{countries['high_vaccination'].mean():.1%} of countries "
      f"have high vaccination")

Building the Classifier

features = ['gdp_per_capita', 'health_spending_pct',
            'education_index', 'urban_pct']
X = countries[features]
y = countries['high_vaccination']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Baseline: always predict the majority class
majority_class = y_train.mode()[0]
baseline_acc = (y_test == majority_class).mean()
print(f"Baseline accuracy: {baseline_acc:.3f}")

# Logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print(f"\nModel accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"  TN={cm[0,0]}  FP={cm[0,1]}")
print(f"  FN={cm[1,0]}  TP={cm[1,1]}")

print(f"\n{classification_report(y_test, y_pred, target_names=['Low', 'High'])}")

Interpreting the Results

# What did the model learn?
print("Feature coefficients:")
for feat, coef in zip(features, model.coef_[0]):
    direction = "+" if coef > 0 else ""
    print(f"  {feat}: {direction}{coef:.4f}")

print(f"\nIntercept: {model.intercept_[0]:.4f}")

The coefficients tell us which country indicators are most strongly associated with high vaccination status. Education index likely has the largest positive coefficient, suggesting it's the strongest predictor.

Probabilities for Policy Insight

# Show probability predictions for sample countries
sample_countries = X_test.head(8).copy()
sample_countries['actual'] = y_test.head(8).values
sample_countries['probability'] = y_proba[:8]
sample_countries['predicted'] = y_pred[:8]

print("\nSample predictions:")
print(sample_countries[['actual', 'probability',
    'predicted']].to_string())

Notice how the probabilities give you much more information than the binary predictions. A country with a 0.52 probability of high vaccination is very different from one with a 0.98 probability, even though both are classified as "high" with the default 0.5 threshold.

27.12 Choosing the Right Threshold

The default threshold of 0.5 isn't always appropriate. Consider:

Scenario A: A public health organization wants to identify countries that might need vaccination support. Missing a country that actually needs help (false negative) is worse than incorrectly flagging a country that's doing fine (false positive). Lower the threshold — flag more countries, accept more false alarms.

Scenario B: You have limited resources and can only help 10 countries. You want to be confident that the countries you flag actually need help. Raise the threshold — be more selective, accept more misses.

print(f"{'Threshold':>10}  {'Accuracy':>9}  {'Precision':>10}  "
      f"{'Recall':>8}  {'F1':>6}")

for t in [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
    preds = (y_proba >= t).astype(int)
    acc = accuracy_score(y_test, preds)
    prec = precision_score(y_test, preds, zero_division=0)
    rec = recall_score(y_test, preds, zero_division=0)
    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
    print(f"{t:>10.1f}  {acc:>9.3f}  {prec:>10.3f}  "
          f"{rec:>8.3f}  {f1:>6.3f}")

The table reveals the tradeoff. As the threshold increases, precision goes up (fewer false alarms) but recall goes down (more misses). The right threshold depends on the costs of each type of error in your specific context.

27.13 Comparing Regression and Classification Approaches

We now have two models for the vaccination data:

Approach	Model	Question	Output
Regression (Ch. 26)	LinearRegression	"What vaccination rate?"	A number (e.g., 87.3%)
Classification (Ch. 27)	LogisticRegression	"High or low vaccination?"	A category + probability

Which is better? Neither — they answer different questions.

Use regression when you need a specific number (e.g., forecasting the actual vaccination rate for budgeting purposes).

Use classification when you need a decision (e.g., should we send a vaccination support team to this country?).

Use probability outputs when you need to communicate uncertainty or rank cases (e.g., "These 15 countries are most likely to have low vaccination — prioritize them for outreach").

In practice, many data scientists build both models and use whichever one best fits the decision at hand.

27.14 When Logistic Regression Fails

Like linear regression, logistic regression has limitations:

It assumes a linear decision boundary. The boundary between "high" and "low" vaccination is a flat hyperplane in feature space. If the true boundary is curved or irregular, logistic regression will underfit.
It assumes features contribute independently. If the effect of GDP on vaccination depends on the education level (an interaction), basic logistic regression misses this.
It's sensitive to feature scaling. Unlike linear regression, logistic regression with regularization (the default in scikit-learn) can produce different results depending on whether features are scaled. For consistent results, scale your features.
It struggles with many irrelevant features. Adding noise features can degrade performance, just as with linear regression.

Despite these limitations, logistic regression is one of the most widely used classification methods in medicine, social science, and business — largely because its outputs (probabilities and interpretable coefficients) are so useful for decision-making.

27.15 Chapter Summary

You've learned to predict categories, not just numbers. Here's what you now understand:

Classification predicts which category an observation belongs to, not what number to assign it. The target is categorical (yes/no, high/low, spam/not spam).

Logistic regression is linear regression wrapped in a sigmoid function. The linear part computes a score from features; the sigmoid converts that score to a probability between 0 and 1.

predict_proba gives you probabilities, which are more informative than binary predictions. Use probabilities to rank cases, set custom thresholds, and communicate uncertainty.

The confusion matrix breaks predictions into four categories: true positives, false positives, true negatives, and false negatives. This reveals which kinds of errors the model makes, not just how many.

Precision measures the reliability of positive predictions (few false alarms). Recall measures the completeness of positive detection (few misses). They trade off against each other through the threshold.

Class imbalance makes accuracy misleading. When one class dominates, a model can achieve high accuracy by always predicting the majority class. Use precision, recall, and F1-score instead.

The threshold converts probabilities to decisions. Choosing the threshold depends on the relative costs of false positives and false negatives in your specific context.

In Chapter 28, you'll learn about decision trees — a completely different approach to both regression and classification that makes decisions by asking a series of yes/no questions about the features.

Connections to What You've Learned

Concept from This Chapter	Foundation from Earlier
Sigmoid function	Probability between 0 and 1 (Chapter 20)
Logistic regression coefficients	Linear regression coefficients (Chapter 26)
Confusion matrix	Contingency tables (Chapter 23)
Precision and recall	Type I and Type II errors (Chapter 23)
Class imbalance	Baseline models (Chapter 25)
Train-test split and evaluation	Model evaluation framework (Chapter 25)
Feature interpretation	Correlation vs. causation (Chapter 24)

Looking Ahead

Next Chapter	What You'll Learn
Chapter 28: Decision Trees	A visual, intuitive model that handles nonlinear relationships
Chapter 29: Evaluating Models	Cross-validation, ROC curves, and choosing the right metrics
Chapter 30: ML Workflow	The complete pipeline from data to deployment

Prerequisites

Learning Objectives

In This Chapter

Chapter 27: Logistic Regression and Classification — Predicting Categories

Chapter Overview

27.1 Why Not Just Use Linear Regression?

27.2 The Sigmoid Function: Turning a Line Into a Probability

27.3 From Probability to Prediction: The Threshold

27.4 Fitting Logistic Regression with scikit-learn

27.5 Probabilities vs. Predictions: Why predict_proba Matters

27.6 Interpreting Logistic Regression Coefficients

27.7 The Confusion Matrix: Detailed Error Analysis

Making It Visual

Why the Confusion Matrix Matters

27.8 Precision and Recall: Two Sides of the Same Coin

The Precision-Recall Tradeoff

Which Matters More?

27.9 The Classification Report: All Metrics at Once

27.10 Class Imbalance: The Elephant in the Room

What to Do About Class Imbalance

27.11 Project Milestone: Classifying Countries as High vs. Low Vaccination

Building the Classifier

Interpreting the Results

Probabilities for Policy Insight

27.12 Choosing the Right Threshold

27.13 Comparing Regression and Classification Approaches

27.14 When Logistic Regression Fails

27.15 Chapter Summary

Connections to What You've Learned

Looking Ahead

Related Reading