Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories

Contributors to Introduction to Data Science

Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories

How to use these exercises: Part A tests your understanding of classification concepts, the sigmoid function, and error types. Part B applies these concepts to real-world scenarios where the costs of errors differ. Part C involves coding logistic regression models and evaluating them. Part D challenges you with threshold selection, class imbalance, and ethical considerations.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding

Exercise 27.1 — Regression vs. classification ⭐

Classify each problem as regression or classification. Explain your reasoning.

Predicting the temperature tomorrow in degrees Fahrenheit
Predicting whether it will rain tomorrow (yes/no)
Predicting a student's letter grade (A, B, C, D, F)
Predicting how many minutes a flight will be delayed
Predicting whether a tumor is malignant or benign
Predicting a restaurant's star rating (1-5 stars)

Guidance

1. **Regression.** Temperature is a continuous number. 2. **Classification.** Rain/no rain is a binary category. 3. **Classification.** Letter grades are discrete categories (though they have an ordering). 4. **Regression.** Minutes is a continuous number. 5. **Classification.** Malignant/benign is a binary category. 6. **Could be either.** If the ratings are treated as ordered categories (1, 2, 3, 4, 5), it's classification. If treated as a continuous scale, it's regression. Many practitioners use regression for this since the numbers have meaningful order and distances.

Exercise 27.2 — The sigmoid function ⭐

Without running code, determine the approximate output of the sigmoid function for each input:

sigmoid(0)
sigmoid(5)
sigmoid(-5)
sigmoid(100)
sigmoid(-100)

Then explain in plain language what the sigmoid function does and why it's useful for classification.

Guidance

1. sigmoid(0) = 0.5 (exactly the midpoint) 2. sigmoid(5) ≈ 0.993 (very close to 1) 3. sigmoid(-5) ≈ 0.007 (very close to 0) 4. sigmoid(100) ≈ 1.000 (essentially 1) 5. sigmoid(-100) ≈ 0.000 (essentially 0) The sigmoid function takes any number (from negative infinity to positive infinity) and maps it to a value between 0 and 1. This is useful for classification because we can interpret the output as a probability — the probability that an observation belongs to the positive class. It converts a linear score (which could be any number) into a proper probability (which must be between 0 and 1).

Exercise 27.3 — Reading a confusion matrix ⭐

A spam filter produces the following confusion matrix:

                Predicted
              Spam    Not Spam
Actual Spam    180       20
   Not Spam     30      770

Identify the TP, FP, TN, and FN values
Calculate the accuracy
Calculate the precision for spam detection
Calculate the recall for spam detection
What type of error is more harmful here: false positives or false negatives?

Guidance

1. TP = 180 (correctly identified spam), FP = 30 (real email marked as spam), TN = 770 (correctly identified real email), FN = 20 (spam that got through). 2. Accuracy = (180 + 770) / 1000 = 950/1000 = 95.0% 3. Precision = TP / (TP + FP) = 180 / (180 + 30) = 180/210 = 85.7% 4. Recall = TP / (TP + FN) = 180 / (180 + 20) = 180/200 = 90.0% 5. **False positives are more harmful.** A false positive means a legitimate email is sent to the spam folder — the user might miss an important message (job offer, medical result, etc.). A false negative means spam gets through — annoying but rarely catastrophic. Most spam filters prioritize high precision (few real emails in spam) over high recall (catching every spam).

Exercise 27.4 — Precision vs. recall ⭐⭐

For each scenario, state whether you would prioritize precision or recall, and explain why:

A fire alarm system in a hospital
A recommendation engine suggesting movies
A test for detecting COVID-19 during a pandemic
An automated hiring tool that screens resumes
A self-driving car's pedestrian detection system

Guidance

1. **Recall.** Missing a real fire (false negative) could be fatal. False alarms (false positives) are disruptive but not deadly. Better to have a few false alarms than to miss a real fire. 2. **Precision.** Recommending a bad movie (false positive) is mildly annoying. Missing a good movie (false negative) is fine — there are plenty of other movies to suggest. No need for high recall; users prefer accurate suggestions. 3. **Recall.** Missing an infected person (false negative) allows them to spread the virus. A false positive leads to unnecessary quarantine — costly but less dangerous than letting an infected person circulate. 4. **This is debatable, but lean toward precision.** Rejecting a good candidate (false negative) is costly but can be partially addressed by the candidate reapplying. Advancing a clearly unqualified candidate (false positive) wastes interviewer time and may displace better candidates. However, if the tool systematically disadvantages certain groups (creating biased false negatives), recall for those groups matters immensely. 5. **Recall, strongly.** Missing a pedestrian (false negative) could be fatal. Braking for a false detection (false positive) causes a minor inconvenience. Self-driving systems must detect virtually all pedestrians.

Exercise 27.5 — The threshold effect ⭐⭐

A model produces the following probability predictions for 10 patients being tested for a disease:

Patient	P(Disease)	Actual
1	0.92	Disease
2	0.15	No disease
3	0.78	Disease
4	0.45	No disease
5	0.62	Disease
6	0.88	Disease
7	0.33	No disease
8	0.55	Disease
9	0.28	Disease
10	0.71	No disease

Using a threshold of 0.5, classify each patient and compute the confusion matrix
Using a threshold of 0.3, classify each patient and compute the confusion matrix
How did recall change between the two thresholds? Precision?
Which threshold is more appropriate for medical screening and why?

Guidance

**Threshold 0.5:** Predict disease if P >= 0.5 - Predictions: 1=D, 2=N, 3=D, 4=N, 5=D, 6=D, 7=N, 8=D, 9=N, 10=D - TP=4 (1,3,5,6), FP=1 (10), FN=1 (9), TN=4 (2,4,7, but 8 is TP) - Wait — Patient 8: P=0.55 >=0.5, predicted D, actual D → TP. Patient 9: P=0.28 <0.5, predicted N, actual D → FN. - TP=5 (1,3,5,6,8), FP=1 (10), FN=1 (9), TN=3 (2,4,7) - Precision = 5/6 = 83.3%, Recall = 5/6 = 83.3% **Threshold 0.3:** Predict disease if P >= 0.3 - Predictions: 1=D, 2=N, 3=D, 4=D, 5=D, 6=D, 7=D, 8=D, 9=N, 10=D - TP=5 (1,3,5,6,8), FP=3 (4,7,10), FN=1 (9, P=0.28 < 0.3), TN=1 (2) - Precision = 5/8 = 62.5%, Recall = 5/6 = 83.3% Actually, Patient 9 has P=0.28, still below 0.3, so still FN. Recall stays 83.3% but precision drops. The 0.3 threshold catches the same disease cases but flags more false positives. For medical screening, the lower threshold is more appropriate because missing disease (false negative) is more dangerous than a false alarm (false positive). You want high recall.

Exercise 27.6 — Class imbalance awareness ⭐⭐

A credit card fraud detection model achieves 99.8% accuracy. The dataset contains 0.2% fraudulent transactions.

What accuracy would a model that always predicts "not fraud" achieve?
Is the 99.8% accuracy meaningful? Why or why not?
What metrics would you use instead of accuracy?
If the model catches 70% of fraud but falsely flags 0.5% of legitimate transactions, is it useful? Calculate the numbers for a day with 100,000 transactions, 200 of which are fraudulent.

Guidance

1. Always predicting "not fraud" gives 99.8% accuracy — identical to the model. 2. No, it's not meaningful. The model might just be predicting "not fraud" for almost everything. Accuracy is useless for highly imbalanced classes. 3. Recall (what fraction of actual fraud is caught), precision (what fraction of flagged transactions are actually fraud), and the confusion matrix. 4. Out of 100,000 transactions (200 fraud, 99,800 legitimate): - Catches 70% of fraud: 140 true positives, 60 missed frauds - Falsely flags 0.5% of legitimate: 499 false positives - The model flags 140 + 499 = 639 transactions for review - Precision: 140/639 = 21.9% (most flagged transactions aren't fraud) - Recall: 140/200 = 70% - Is it useful? Yes — 140 frauds caught, even though each flag requires investigation. The 60 missed frauds are concerning but the model still catches the majority. The question is whether the cost of investigating 499 false positives is worth catching 140 frauds.

Exercise 27.7 — Why linear regression fails for classification ⭐

In your own words, explain two specific problems with using linear regression (instead of logistic regression) for a binary classification problem. Use concrete examples.

Guidance

**Problem 1: Predictions outside [0,1].** Linear regression produces predictions that can be any number — including negative values and values above 1. If the model predicts a -0.15 probability of disease, that's meaningless. A probability must be between 0 and 1. Example: predicting whether a student passes (1) or fails (0) from study hours, the linear model might predict -0.2 for 0 hours or 1.3 for 15 hours. **Problem 2: The relationship isn't linear.** In many classification problems, the probability transitions rapidly from near-0 to near-1 over a narrow range of the feature (an S-shaped curve). A straight line can't capture this shape — it either overestimates probabilities at the extremes or underestimates the transition in the middle. The sigmoid function naturally captures this S-shape.

Part B: Applied Scenarios ⭐⭐

Exercise 27.8 — Designing a classification system

You're building a system to predict whether online reviews are genuine or fake. Fake reviews can be either overly positive (fake 5-star reviews) or overly negative (fake 1-star reviews from competitors).

Define the target and features for this classification problem
What would the baseline accuracy be if 15% of reviews are fake?
Should you prioritize precision or recall? Consider both the platform's perspective and the reviewer's perspective.
How might false positives (flagging genuine reviews as fake) harm the platform?
How might false negatives (letting fake reviews through) harm the platform?

Guidance

1. Target: fake (1) or genuine (0). Features: review length, time since purchase, reviewer history, number of reviews by this reviewer, sentiment extremity, similarity to other reviews, reviewer account age. 2. Baseline: always predict "genuine" = 85% accuracy. 3. From the platform's perspective: slightly favor recall (catching fake reviews protects trust). From the reviewer's perspective: favor precision (being wrongly accused of faking is frustrating and alienating). A balanced approach is needed. 4. False positives: genuine reviewers are flagged, their reviews removed, and they become angry. This damages the platform's relationship with contributors and discourages honest reviews. 5. False negatives: fake reviews remain visible, misleading consumers and undermining trust in the platform. Businesses gaming the system gain an unfair advantage.

Exercise 27.9 — Interpreting logistic regression output ⭐⭐

A logistic regression model predicts whether a loan applicant will default. The model outputs:

Coefficients:
  income:        -0.0002
  credit_score:  -0.015
  loan_amount:    0.00005
  prior_defaults:  1.8

Intercept: 5.2

Interpret the sign of each coefficient (which features increase default probability?)
Which feature has the strongest association with default? (Be careful about scale.)
A customer has income=$80,000, credit_score=720, loan_amount=$25,000, prior_defaults=0. Calculate the logit score and the predicted probability.
Same customer but with 2 prior defaults. How does the probability change?

Guidance

1. Income (negative): higher income → lower default probability. Credit score (negative): higher credit score → lower default probability. Loan amount (positive): larger loans → higher default probability. Prior defaults (positive): more past defaults → higher default probability. 2. Can't tell directly from raw coefficients because scales differ. Income is in tens of thousands, credit score in hundreds, loan amount in tens of thousands, prior defaults 0-5. Need standardized coefficients for comparison. The prior_defaults coefficient (1.8) looks large, but it operates on a 0-5 scale. Credit_score's -0.015 operates on a 300-850 scale, so its effect is actually -0.015 * 550 ≈ -8.25 across the range. 3. Logit = 5.2 + (-0.0002)(80000) + (-0.015)(720) + (0.00005)(25000) + 1.8(0) = 5.2 - 16 - 10.8 + 1.25 + 0 = -20.35. Probability = sigmoid(-20.35) ≈ 0.000 (essentially zero default risk). 4. Logit = -20.35 + 1.8(2) = -16.75. Probability = sigmoid(-16.75) ≈ 0.000 still essentially zero. The strong negative contribution from income and credit score dominates. This customer is low risk regardless.

Exercise 27.10 — The cost of errors in healthcare ⭐⭐

A classification model screens mammograms for breast cancer. Results on a test set of 10,000 mammograms (200 have cancer):

Confusion Matrix:
                Predicted
              Cancer    No Cancer
Actual Cancer   170        30
   No Cancer    800      9,000

Calculate accuracy, precision, and recall
This model has low precision. What does that mean in practice?
This model has high recall. What does that mean in practice?
Is this tradeoff acceptable for a screening test? Why or why not?
What happens if you raise the threshold to improve precision?

Guidance

1. Accuracy = (170 + 9000) / 10000 = 91.7%. Precision = 170 / (170 + 800) = 17.5%. Recall = 170 / (170 + 30) = 85.0%. 2. Low precision (17.5%) means: of every 100 patients flagged for follow-up, only about 18 actually have cancer. Most flagged patients are false positives who will undergo unnecessary biopsies and anxiety. 3. High recall (85%) means: the model catches 85% of actual cancers. 30 out of 200 cancers are missed (15%) — these patients won't receive early treatment. 4. For a screening test, this tradeoff is generally acceptable. The purpose of screening is to identify potential cases for further evaluation — high recall is critical because missing cancer can be fatal. The false positives lead to additional testing, which is costly and stressful but not dangerous. However, the 15% miss rate is concerning and could be improved. 5. Raising the threshold would increase precision (fewer false positives) but decrease recall (more missed cancers). Given the stakes, this is a dangerous tradeoff in cancer screening.

Exercise 27.11 — Comparing models with different error profiles ⭐⭐

Two models predict customer churn:

Metric	Model A	Model B
Accuracy	85%	82%
Precision	60%	45%
Recall	50%	80%
F1	54.5%	57.6%

Which model catches more actual churners?
Which model has fewer false alarms?
If it costs $50 to contact a customer for retention but a lost customer costs $500 in lost revenue, which model saves more money?
Explain why accuracy alone is insufficient for choosing between these models.

Guidance

1. **Model B** — recall of 80% vs. 50%. It catches 80% of customers who will churn. 2. **Model A** — precision of 60% vs. 45%. Of the customers it flags, 60% actually churn (vs. 45% for Model B). 3. Suppose 100 customers, 20 will churn. Model A: catches 10 churners (recall 50%), flags 10/0.6 ≈ 17 total. Cost = 17 * $50 = $850 contact + 10 * $500 = $5,000 lost revenue from 10 missed churners = $5,850. Model B: catches 16 churners (recall 80%), flags 16/0.45 ≈ 36 total. Cost = 36 * $50 = $1,800 contact + 4 * $500 = $2,000 lost revenue from 4 missed churners = $3,800. **Model B saves more money** despite lower accuracy and precision. 4. Accuracy doesn't tell you about the types of errors. In churn prediction, the cost of missing a churner ($500) far exceeds the cost of a false alarm ($50). Accuracy treats all errors equally, which is inappropriate when error costs are asymmetric.

Exercise 27.12 — Probability calibration ⭐⭐

Your logistic regression model says a customer has a 0.72 probability of churning. Your manager asks: "So there's a 72% chance they'll leave?"

Is this interpretation correct? What does the 0.72 actually mean?
How would you test whether the model's probabilities are well-calibrated?
Why are calibrated probabilities important for business decisions?

Guidance

1. Ideally, yes — a well-calibrated model's 0.72 means approximately 72% of customers with this score will actually churn. But in practice, model probabilities are not always well-calibrated. The 0.72 is the model's estimate, not a guaranteed probability. It's more accurate to say "the model assigns this customer a risk score of 0.72 on a 0-1 scale." 2. Group predictions into buckets (0.0-0.1, 0.1-0.2, etc.) and compare the average predicted probability to the actual fraction of positives in each bucket. If the model says "0.7 probability" and about 70% of such cases are positive, it's well-calibrated. This is called a calibration plot or reliability diagram. 3. Calibrated probabilities are essential for decisions that depend on the *magnitude* of risk. If you're deciding whether to offer a customer a $100 discount to prevent $500 in lost revenue, you need to know the actual probability of churn, not just a relative score. Well-calibrated probabilities enable expected-value calculations.

Part C: Coding Exercises ⭐⭐

Exercise 27.13 — Basic logistic regression

Fit a logistic regression model to predict whether a student passes a course:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 200
students = pd.DataFrame({
    'study_hours': np.random.uniform(0, 15, n),
    'attendance_pct': np.random.uniform(40, 100, n),
})
logit = -4 + 0.3 * students['study_hours'] + 0.04 * students['attendance_pct']
students['passed'] = (np.random.random(n) < 1/(1+np.exp(-logit))).astype(int)

Tasks: 1. Split into 80/20 train/test sets 2. Calculate the baseline accuracy (most frequent class) 3. Fit a LogisticRegression model 4. Report accuracy on train and test sets 5. Print the confusion matrix 6. Calculate precision and recall for "passed"

Guidance

from sklearn.metrics import confusion_matrix, precision_score, recall_score

X = students[['study_hours', 'attendance_pct']]
y = students['passed']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

baseline = y_train.value_counts().max() / len(y_train)
print(f"Baseline accuracy: {baseline:.3f}")

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Train accuracy: {model.score(X_train, y_train):.3f}")
print(f"Test accuracy:  {model.score(X_test, y_test):.3f}")
print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")

Exercise 27.14 — Using predict_proba

Using the model from Exercise 27.13:

Get probability predictions for the test set
Print the probabilities for the first 10 test observations alongside actual values
Sort test observations by probability and identify the 5 most uncertain predictions (probabilities closest to 0.5)
Create a histogram of predicted probabilities

Guidance

proba = model.predict_proba(X_test)[:, 1]

print("First 10 predictions:")
for i in range(10):
    print(f"  P(pass)={proba[i]:.3f}, Actual={y_test.iloc[i]}")

# Most uncertain
uncertainty = np.abs(proba - 0.5)
uncertain_idx = np.argsort(uncertainty)[:5]
print("\nMost uncertain:")
for i in uncertain_idx:
    print(f"  P(pass)={proba[i]:.3f}, Actual={y_test.iloc[i]}")

plt.hist(proba, bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('P(Pass)')
plt.ylabel('Count')
plt.title('Distribution of Predicted Probabilities')
plt.show()

Exercise 27.15 — Threshold exploration ⭐⭐

Using the model from Exercise 27.13, explore how different thresholds affect classification performance:

Tasks: 1. Calculate precision, recall, and F1-score for thresholds from 0.2 to 0.8 (step 0.1) 2. Create a plot showing precision and recall vs. threshold 3. At what threshold are precision and recall approximately equal? 4. If "missing a student who needs help" (false negative) is worse than "over-flagging students for support" (false positive), what threshold would you recommend?

Guidance

from sklearn.metrics import f1_score

thresholds = np.arange(0.2, 0.85, 0.1)
precisions, recalls, f1s = [], [], []

for t in thresholds:
    preds = (proba >= t).astype(int)
    precisions.append(precision_score(y_test, preds, zero_division=0))
    recalls.append(recall_score(y_test, preds, zero_division=0))
    f1s.append(f1_score(y_test, preds, zero_division=0))

plt.plot(thresholds, precisions, 'o-', label='Precision')
plt.plot(thresholds, recalls, 's-', label='Recall')
plt.plot(thresholds, f1s, '^--', label='F1')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.legend()
plt.title('Precision-Recall Tradeoff')
plt.show()

For the help scenario (false negatives are worse), use a lower threshold (0.3-0.4) to maximize recall, accepting lower precision.

Exercise 27.16 — Confusion matrix visualization ⭐⭐

Create a visually clear confusion matrix heatmap with labels:

Tasks: 1. Fit a logistic regression model on the student data 2. Create a confusion matrix using sklearn 3. Visualize it as a heatmap with counts and labels 4. Add a title showing the overall accuracy

Guidance

from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(cm, cmap='Blues')
plt.colorbar(im)

labels = ['Failed', 'Passed']
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

for i in range(2):
    for j in range(2):
        color = 'white' if cm[i,j] > cm.max()/2 else 'black'
        ax.text(j, i, str(cm[i,j]), ha='center',
                va='center', fontsize=20, color=color)

acc = (cm[0,0]+cm[1,1])/cm.sum()
plt.title(f'Confusion Matrix (Accuracy: {acc:.1%})')
plt.tight_layout()
plt.show()

Exercise 27.17 — Handling class imbalance ⭐⭐

Create an imbalanced dataset and compare model performance with and without class weighting:

np.random.seed(42)
n = 1000
X_imb = pd.DataFrame({
    'feature1': np.random.normal(0, 1, n),
    'feature2': np.random.normal(0, 1, n),
})
# Only 5% positive class
logit = -3 + 1.5 * X_imb['feature1'] + X_imb['feature2']
y_imb = (np.random.random(n) < 1/(1+np.exp(-logit))).astype(int)

Tasks: 1. Check the class distribution 2. Split and fit a standard LogisticRegression 3. Fit a LogisticRegression with class_weight='balanced' 4. Compare confusion matrices, precision, and recall for both models 5. Which model is better at detecting the minority class?

Guidance

print(f"Class distribution: {pd.Series(y_imb).value_counts().to_dict()}")

Xtr, Xte, ytr, yte = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42)

m_std = LogisticRegression(max_iter=1000).fit(Xtr, ytr)
m_bal = LogisticRegression(class_weight='balanced',
                           max_iter=1000).fit(Xtr, ytr)

for name, m in [('Standard', m_std), ('Balanced', m_bal)]:
    pred = m.predict(Xte)
    print(f"\n{name}:")
    print(confusion_matrix(yte, pred))
    print(f"Precision: {precision_score(yte, pred, zero_division=0):.3f}")
    print(f"Recall: {recall_score(yte, pred, zero_division=0):.3f}")

The balanced model should have higher recall (catches more minority class cases) but potentially lower precision (more false positives). For detecting rare events, the balanced model is usually preferred.

Exercise 27.18 — Complete vaccination classification ⭐⭐

Build a complete classification pipeline for the vaccination project:

np.random.seed(42)
n = 150
countries = pd.DataFrame({
    'gdp_per_capita': np.random.lognormal(9, 1.2, n),
    'health_spending': np.random.uniform(2, 12, n),
    'education_index': np.random.uniform(0.3, 0.95, n),
})
vacc = 30 + 5*np.log(countries['gdp_per_capita']/1000) + 2*countries['health_spending'] + 25*countries['education_index'] + np.random.normal(0,6,n)
countries['high_vacc'] = (vacc.clip(15,100) >= 80).astype(int)

Tasks: 1. Report the class distribution 2. Split 80/20, fit a logistic regression 3. Print the classification report 4. Identify the most important feature (largest absolute coefficient) 5. Print probability predictions for 5 countries with actual labels 6. Write a 2-sentence summary of the model's performance

Guidance

from sklearn.metrics import classification_report

X = countries[['gdp_per_capita', 'health_spending', 'education_index']]
y = countries['high_vacc']

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000).fit(Xtr, ytr)
ypred = model.predict(Xte)

print(f"Class distribution: {y.value_counts().to_dict()}")
print(classification_report(yte, ypred, target_names=['Low', 'High']))

for f, c in sorted(zip(X.columns, model.coef_[0]),
                   key=lambda x: abs(x[1]), reverse=True):
    print(f"  {f}: {c:.4f}")

Exercise 27.19 — Comparing regression and classification approaches ⭐⭐⭐

Using the vaccination data from Exercise 27.18, build both a linear regression model (predicting vaccination rate as a number) and a logistic regression model (predicting high/low). Compare:

Which approach gives more information?
Can you convert the regression predictions into classifications?
Which model's errors are more interpretable?

Guidance

from sklearn.linear_model import LinearRegression

# Regression model
y_reg = vacc.clip(15, 100)
Xtr_r, Xte_r, ytr_r, yte_r = train_test_split(
    X, y_reg, test_size=0.2, random_state=42)
reg_model = LinearRegression().fit(Xtr_r, ytr_r)
reg_pred = reg_model.predict(Xte_r)

# Convert regression to classification
reg_class = (reg_pred >= 80).astype(int)
actual_class = (yte_r >= 80).astype(int)

print("Regression -> Classification:")
print(confusion_matrix(actual_class, reg_class))

# Direct classification
print("\nDirect Classification:")
print(confusion_matrix(yte, ypred))

Regression gives more information (actual predicted rate) and can be converted to classification at any threshold. Direct classification is simpler but loses nuance. Regression errors (off by X percentage points) are more interpretable than classification errors.

Exercise 27.20 — Feature scaling effect ⭐⭐⭐

Investigate whether feature scaling affects logistic regression results:

Tasks: 1. Fit a logistic regression on unscaled vaccination data 2. Scale features using StandardScaler and fit again 3. Compare the coefficients (they should change) 4. Compare the predictions (they might differ slightly) 5. Explain why scaling can matter more for logistic regression than for linear regression

Guidance

from sklearn.preprocessing import StandardScaler

# Unscaled
m1 = LogisticRegression(max_iter=1000).fit(Xtr, ytr)

# Scaled
scaler = StandardScaler()
Xtr_s = scaler.fit_transform(Xtr)
Xte_s = scaler.transform(Xte)
m2 = LogisticRegression(max_iter=1000).fit(Xtr_s, ytr)

print("Unscaled coefficients:", m1.coef_[0].round(5))
print("Scaled coefficients:", m2.coef_[0].round(5))

pred1 = m1.predict(Xte)
pred2 = m2.predict(Xte_s)
print(f"Prediction differences: {(pred1 != pred2).sum()}")

Scaling matters for logistic regression because scikit-learn's implementation uses regularization by default (C parameter). Regularization penalizes large coefficients, and without scaling, features on larger scales get penalized more. This means unscaled logistic regression can produce suboptimal results for features with very different scales.

Part D: Synthesis and Critical Thinking ⭐⭐⭐

Exercise 27.21 — The threshold as a policy decision

You build a model to flag potentially fraudulent insurance claims. The model outputs probabilities, and you need to choose a threshold. For each stakeholder, explain what threshold they would prefer and why:

The insurance company's fraud investigation team (limited staff)
The company's customer satisfaction department
The company's CEO (concerned about both fraud losses and customer retention)
A consumer protection advocate

Guidance

1. **Fraud team (high threshold, ~0.7-0.8).** With limited staff, they want to investigate only the most likely cases. High precision means less wasted investigation time. 2. **Customer satisfaction (high threshold, ~0.8+).** They want to minimize false accusations. Falsely flagging a legitimate claim damages customer trust. 3. **CEO (moderate threshold, ~0.5-0.6).** Balance between catching fraud (saves money) and not alienating customers. Would likely want a cost-benefit analysis showing the optimal threshold. 4. **Consumer advocate (high threshold, ~0.8+).** Protect consumers from wrongful denial of legitimate claims. Would argue that the burden of proof should be on the company, not the customer.

Exercise 27.22 — Ethical implications of classification in criminal justice ⭐⭐⭐

A classification model predicts whether a person is "high risk" or "low risk" for recidivism (committing another crime) to inform parole decisions.

What features might such a model use?
Why might the training data be biased?
If the model has different false positive rates for different racial groups, what are the implications?
Should such a model be used? Under what conditions, if any?
How does the choice of threshold directly affect people's liberty?

Guidance

1. Prior criminal history, age at first offense, employment status, education level, substance abuse history, family situation. 2. Historical criminal justice data reflects decades of biased policing and sentencing. Arrests are not a neutral measure of criminal behavior — over-policing of certain communities inflates arrest rates. The model learns these biases. 3. If the model has a higher false positive rate for Black defendants (as documented in ProPublica's analysis of the COMPAS tool), then Black individuals are more likely to be wrongly classified as high risk and denied parole. This perpetuates systemic inequality. 4. Highly debatable. If used, conditions should include: transparency about the model, mandatory bias audits, human override capability, prohibition on using race as a feature (including proxies), and ensuring the model augments rather than replaces human judgment. Many argue such tools should not be used at all given the documented biases. 5. At a lower threshold, more people are classified as "high risk" — more people are kept in prison or denied parole. The threshold directly determines who is free and who is not. This is perhaps the clearest example of why the threshold is not just a technical parameter but a moral and policy decision.

Exercise 27.23 — Model comparison essay ⭐⭐⭐

Write a short essay (200-300 words) comparing linear regression and logistic regression. Address: - When to use each - How they're similar - How they're different - What a beginner is most likely to confuse about them

Guidance

A strong essay should cover: Both are linear models that learn weighted combinations of features. The key difference is the output — regression produces a continuous number, classification produces a probability/category. They use the same scikit-learn workflow (fit/predict/score). Common confusions: logistic regression is a classification algorithm despite having "regression" in its name; logistic regression's coefficients can't be interpreted the same way as linear regression's; accuracy is not R-squared; predict vs. predict_proba.

Exercise 27.24 — When probabilities beat predictions ⭐⭐⭐

Give three real-world scenarios where probability outputs from logistic regression are more useful than binary yes/no predictions. For each, explain what decision the probability enables that a binary prediction wouldn't.

Guidance

1. **Medical triage:** A probability of disease (0.82 vs. 0.23) helps doctors prioritize patients. A binary "positive" for both patients would put them in the same queue despite very different risk levels. 2. **Customer retention:** Rank customers by churn probability to focus retention budget. A customer with P(churn) = 0.65 might get a discount offer, while P(churn) = 0.95 might not be worth the investment (likely to leave anyway). 3. **Insurance pricing:** Set premiums proportional to the predicted probability of a claim. A binary prediction (will/won't claim) can't support risk-based pricing.

Exercise 27.25 — Building a complete classification report ⭐⭐⭐⭐

Choose a classification problem that interests you (spam detection, disease screening, credit approval, student success, etc.) and design a complete analysis:

Define the problem (target, features, data source)
State the baseline accuracy
Identify whether precision or recall is more important and explain why
Describe what the confusion matrix entries mean in your specific context
Recommend a threshold and justify it based on error costs
Identify one potential source of bias in the training data
State one ethical concern about deploying this model

Guidance

This is an open-ended exercise. Strong answers will demonstrate understanding of all chapter concepts applied to a specific, well-defined problem. Key elements: the baseline accuracy provides context (is 85% accuracy impressive or trivial?), the precision-recall priority depends on specific error costs (a strong answer explains the reasoning clearly), and the ethical concern should be specific and thoughtful (not just "bias might exist" but identifying a specific mechanism for bias).