Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff

Contributors to Introduction to Data Science

Chapter 25 Exercises: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff

How to use these exercises: Part A tests your understanding of core modeling concepts — what models are, why they simplify, and the difference between prediction and explanation. Part B applies the concepts to real-world scenarios. Part C involves Python code for train-test splitting and baselines. Part D pushes toward deeper analysis of the bias-variance tradeoff and ethical considerations.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding

Exercise 25.1 — Models as simplifications ⭐

For each of the following, explain what a model of this system would need to include and what it could safely ignore. Then explain why the simplification is useful.

A model predicting how long a pizza delivery will take
A model predicting whether a student will graduate on time
A model predicting tomorrow's temperature

Guidance

1. **Pizza delivery:** Include: distance, time of day, order backlog. Safely ignore: the driver's shoe size, what music is playing, the color of the delivery vehicle. The simplification lets the pizza app give you a useful estimate without knowing everything about the universe. 2. **Student graduation:** Include: GPA, credit hours completed, major, part-time work hours. Safely ignore: favorite color, number of library visits, cafeteria meal choice. The simplification helps advisors identify at-risk students for early intervention. 3. **Temperature:** Include: yesterday's temperature, season, historical averages, weather front movements. Safely ignore: the exact state of every air molecule. The simplification makes weather prediction computationally feasible while still being useful for planning.

Exercise 25.2 — Prediction vs. explanation ⭐

Classify each scenario as primarily a prediction task or an explanation task. Justify your answer.

A bank wants to know which loan applicants will default so it can reject risky applications.
A public health researcher wants to know whether a new vaccine reduces infection rates.
A streaming service wants to recommend movies users will enjoy.
An economist wants to understand why some countries grow faster than others.
A retailer wants to forecast next month's sales to plan inventory.

Guidance

1. **Prediction.** The bank cares about accuracy of the forecast (will they default?), not necessarily understanding *why* someone defaults. 2. **Explanation.** The researcher wants to establish a causal mechanism — does the vaccine actually work? Interpretability and causal reasoning matter more than raw prediction accuracy. 3. **Prediction.** The service wants to accurately predict which movies you'll like. The "why" is less important than getting the recommendation right. 4. **Explanation.** The economist wants to understand causal mechanisms — which factors drive growth? Interpretable coefficients matter. 5. **Prediction.** The retailer needs an accurate number for planning. Understanding *why* sales go up or down is secondary to getting the forecast right.

Exercise 25.3 — Supervised vs. unsupervised ⭐

For each task, state whether it is supervised or unsupervised learning. Identify the features and target if supervised, or describe the structure being sought if unsupervised.

Predicting house prices based on square footage and location
Grouping customers into segments based on purchasing behavior
Classifying emails as spam or not spam based on email content
Reducing a dataset with 50 variables down to 5 key components
Predicting whether a patient has diabetes based on blood test results

Guidance

1. **Supervised (regression).** Features: square footage, location. Target: house price. 2. **Unsupervised (clustering).** No target — you're looking for natural groupings in the data. 3. **Supervised (classification).** Features: word frequencies, sender info, etc. Target: spam/not spam. 4. **Unsupervised (dimensionality reduction).** No target — you're finding a simpler representation of the data. 5. **Supervised (classification).** Features: blood glucose, BMI, age, etc. Target: diabetes/no diabetes.

Exercise 25.4 — Identifying features and targets ⭐

For each modeling problem, list at least 4 plausible features and identify the target variable. Also state whether the problem is regression or classification.

Predicting the price of a used car
Predicting whether a flight will be delayed
Predicting a student's final exam score
Predicting whether a tumor is malignant or benign

Guidance

1. **Used car price (regression).** Target: sale price. Features: mileage, year, make/model, condition rating, number of previous owners. 2. **Flight delay (classification — delayed or not).** Target: delayed/on-time. Features: airline, departure airport, time of day, weather conditions, day of week. 3. **Exam score (regression).** Target: exam score. Features: homework average, attendance rate, hours studied, previous exam scores. 4. **Tumor classification (classification).** Target: malignant/benign. Features: tumor size, cell shape uniformity, clump thickness, mitoses count.

Exercise 25.5 — The cardinal rule ⭐

A friend says: "I trained my model on all the data and then tested it on the same data. It got 99% accuracy!" Explain to your friend, in plain language, why this evaluation is unreliable. Use an analogy to make the point clear.

Guidance

Testing on training data is like taking a test where you've already seen all the questions and answers. Getting a high score doesn't prove you understand the material — it only proves you can memorize. The model might have memorized the specific data points rather than learning the underlying patterns. To truly evaluate the model, you need to test it on data it has never seen — just like a real exam uses new questions. The friend should split the data into training and test sets and only evaluate on the test set.

Exercise 25.6 — Overfitting vs. underfitting ⭐

A data scientist trains three models on the same dataset. Here are the results:

Model	Training Accuracy	Test Accuracy
Model A	55%	53%
Model B	88%	85%
Model C	99%	62%

Which model is overfitting? How can you tell?
Which model is underfitting? How can you tell?
Which model has the best generalization? Why?

Guidance

1. **Model C is overfitting.** Training accuracy (99%) is much higher than test accuracy (62%) — a gap of 37 percentage points. The model has memorized the training data but can't generalize. 2. **Model A is underfitting.** Both training and test accuracy are low (55% and 53%). The model is too simple to capture the patterns in the data. 3. **Model B has the best generalization.** Training and test accuracy are both relatively high and close to each other (88% vs 85% — only 3 points apart). The model has captured real patterns without excessive memorization.

Exercise 25.7 — The bias-variance tradeoff in everyday life ⭐⭐

For each analogy, explain whether the situation reflects high bias, high variance, or a good balance:

A doctor who diagnoses every patient with the flu, regardless of symptoms
A chef who follows a completely different recipe every time someone orders the same dish
A weather app that gives reasonably accurate forecasts that are occasionally off by a few degrees
A student who memorizes practice exams word-for-word but can't answer questions phrased differently

Guidance

1. **High bias.** The doctor uses an overly simplistic model (everything = flu) and ignores important information. Consistent but consistently wrong for most patients. 2. **High variance.** The chef's output is unpredictable — each version is wildly different. High sensitivity to the specific conditions of each attempt. 3. **Good balance.** Reasonably accurate with small errors. Low bias (generally correct) and low variance (consistently close). 4. **High variance (overfitting).** The student has memorized specific examples rather than learning general patterns. Performance is excellent on seen material and poor on new material.

Exercise 25.8 — Baseline thinking ⭐⭐

For each problem, define an appropriate baseline model and explain what it would predict:

Predicting whether a coin flip will be heads or tails (50/50 coin)
Predicting tomorrow's stock price
Predicting whether a passenger survived the Titanic (where ~38% survived)
Predicting a country's vaccination rate (mean rate = 82%)

Guidance

1. **Random guess (50/50).** The baseline predicts heads or tails randomly. Expected accuracy: 50%. Any model should beat 50%. 2. **Yesterday's price.** A simple baseline: predict that tomorrow's price equals today's price. This is surprisingly hard to beat for daily stock prices. 3. **Always predict "did not survive."** Since 62% of passengers died, predicting "did not survive" for everyone gives ~62% accuracy. A useful model must exceed this. 4. **Always predict 82%.** Predicting the mean for every country gives a baseline MAE that any useful model must beat.

Exercise 25.9 — The Goldilocks zone ⭐⭐

Explain, using the "connecting dots" analogy from the chapter, what happens at each stage:

You draw a horizontal flat line through a scatter plot that clearly shows an upward trend
You draw a straight line that captures the upward trend but misses some curvature
You draw a smooth curve that captures the overall shape of the data
You draw a jagged line that passes through every single point

Which of these represents underfitting? Overfitting? A good fit? Which would you expect to perform best on new data?

Guidance

1. **Severe underfitting.** Ignores the trend entirely. High bias, low variance. 2. **Mild underfitting (or acceptable fit).** Captures the main trend but misses some detail. Depending on the actual data, this might be good enough. 3. **Good fit.** Captures the real pattern without fitting noise. Best expected performance on new data. 4. **Overfitting.** Fits every point including noise. Will perform poorly on new data because it's responding to random fluctuations rather than real patterns. Option 3 would perform best on new data.

Part B: Applied Scenarios ⭐⭐

Exercise 25.10 — Framing a prediction problem

You work for a hospital and have a dataset with the following columns: patient_age, blood_pressure, cholesterol, exercise_hours_per_week, smoker (yes/no), and heart_disease (yes/no).

Identify the target and features for predicting heart disease
Is this regression or classification?
What would a good baseline model be?
What test_size would you choose and why?
What ethical concerns should you consider?

Guidance

1. Target: heart_disease (yes/no). Features: patient_age, blood_pressure, cholesterol, exercise_hours_per_week, smoker. 2. Classification — the target is a category (yes/no), not a continuous number. 3. Always predict "no heart disease" if the majority class is "no." You'd need to know the class balance. If 85% of patients don't have heart disease, predicting "no" for everyone gives 85% accuracy — your model needs to beat that. 4. A 80/20 split is reasonable. With medical data, you want enough test data to be confident in results (maybe even 70/30), especially since false negatives (missing heart disease) have severe consequences. 5. Ethical concerns: false negatives could lead to missed diagnoses. The model might perform differently across demographic groups. Training data might underrepresent certain populations. Model should augment, not replace, doctor judgment.

Exercise 25.11 — When models mislead ⭐⭐

A company builds a model to predict employee performance ratings based on features like years of experience, education level, and department. The model achieves 87% accuracy on the test set.

What could go wrong if this model is used for hiring or promotion decisions?
What biases might be in the training data?
Would you classify this as a prediction task or an explanation task? Why does that distinction matter here?
What baseline would you compare the 87% against?

Guidance

1. The model could perpetuate existing biases in performance reviews. If certain demographic groups historically received lower ratings due to bias (not actual performance), the model will learn and reproduce that bias. 2. Performance ratings are subjective and may reflect manager biases. Departments may be rated on different scales. Years of experience may correlate with age, introducing age discrimination. 3. This is primarily an explanation task — you want to understand what actually drives performance, not just predict a subjective rating. The distinction matters because a prediction model might achieve high accuracy by learning bias patterns, while an explanation model would reveal those biases. 4. Compare against always predicting the most common rating category. If "meets expectations" is 70% of ratings, the baseline is 70%. The 87% accuracy is more impressive against that context but still might embed bias.

Exercise 25.12 — Choosing features wisely ⭐⭐

You're building a model to predict whether a student will pass a course. You have access to many variables. For each potential feature below, discuss whether you should include it and why:

Number of assignments submitted
Student's name
Student's ZIP code
Midterm exam score
Whether the student passed a different course last semester
Student's ID number

Guidance

1. **Include.** Directly relevant — assignment completion is a strong predictor of course performance. 2. **Exclude.** Names don't predict course performance, and including them could introduce biases (names can correlate with ethnicity, gender, etc.). 3. **Maybe, with caution.** ZIP code correlates with socioeconomic status, which may predict performance, but using it could reinforce inequalities. Consider the ethical implications carefully. 4. **Include.** A strong, legitimate predictor of final course outcome. 5. **Include.** Prior academic performance is relevant. But consider: could this create a cycle where students who struggled once are permanently disadvantaged? 6. **Exclude.** ID numbers are arbitrary and carry no predictive information. Including them would just add noise.

Exercise 25.13 — The bias-variance tradeoff in practice ⭐⭐

A team is building a model to predict housing prices. They try three approaches:

Approach A: Use only square footage (1 feature)
Approach B: Use square footage, bedrooms, bathrooms, and neighborhood (4 features)
Approach C: Use 200 features including carpet color, mailbox style, distance to nearest fire hydrant, etc.

Predict the likely bias and variance characteristics of each approach. Which is most likely to overfit? Underfit? Achieve the best generalization?

Guidance

- **Approach A:** High bias, low variance. Too simple — many important factors are ignored. Likely to underfit. - **Approach B:** Moderate bias, moderate variance. Includes the most important features without overloading. Most likely to achieve good generalization. - **Approach C:** Low bias, high variance. With 200 features (many irrelevant), the model will find spurious patterns in the training data. Most likely to overfit, especially if the training set isn't very large.

Exercise 25.14 — Interpreting train-test results ⭐⭐

Five models are evaluated on the same dataset. Analyze each result:

Model	Training R²	Test R²	Verdict
Model 1	0.15	0.12	?
Model 2	0.78	0.75	?
Model 3	0.99	0.35	?
Model 4	0.85	0.83	?
Model 5	0.95	0.94	?

Fill in the "Verdict" column (underfitting, good fit, overfitting)
Which model would you deploy? Why?
Why might Model 5 not always be the best choice, even though it has the best test score?

Guidance

1. Model 1: Underfitting (both low). Model 2: Good fit. Model 3: Severe overfitting (huge gap). Model 4: Good fit. Model 5: Excellent fit. 2. Model 5 has the best test performance. However, Models 2 and 4 are also strong options with good generalization. 3. Model 5 might be more complex than Model 4, and if the improvement from 0.83 to 0.94 doesn't matter practically, the simpler model (Model 4) might be preferred for interpretability, speed, or maintainability. Also, test R² of 0.94 might indicate slight data leakage or an unusually easy test set — it warrants investigation.

Part C: Coding Exercises ⭐⭐

Exercise 25.15 — Train-test split in practice

Using the code below as a starting point, perform a train-test split and calculate the baseline MAE for predicting exam scores:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Simulated student data
np.random.seed(42)
n = 200
data = pd.DataFrame({
    'study_hours': np.random.uniform(1, 15, n),
    'attendance_pct': np.random.uniform(50, 100, n),
    'previous_gpa': np.random.uniform(1.5, 4.0, n),
    'exam_score': None  # You'll generate this
})

# Generate exam_score based on features (with noise)
data['exam_score'] = (
    3 * data['study_hours'] +
    0.4 * data['attendance_pct'] +
    8 * data['previous_gpa'] +
    np.random.normal(0, 5, n)
)

Tasks: 1. Define X (features) and y (target) 2. Split into 80/20 training/test sets with random_state=42 3. Calculate the mean of y_train (the baseline prediction) 4. Calculate the baseline MAE on the test set 5. Print a summary of your results

Guidance

X = data[['study_hours', 'attendance_pct', 'previous_gpa']]
y = data['exam_score']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

baseline_pred = y_train.mean()
baseline_mae = np.abs(y_test - baseline_pred).mean()

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Baseline prediction (mean): {baseline_pred:.2f}")
print(f"Baseline MAE: {baseline_mae:.2f}")

Exercise 25.16 — Visualizing overfitting

Create a visualization that demonstrates overfitting using polynomial regression. Generate noisy data from a simple function, then fit polynomials of degrees 1, 3, and 12. Plot all three fits on the same or side-by-side plots.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
X = np.sort(np.random.uniform(0, 6, 25))
y = np.sin(X) + np.random.normal(0, 0.3, 25)

Tasks: 1. Fit polynomial models of degree 1, 3, and 12 2. Plot each fit alongside the data points 3. Add titles indicating "Underfitting," "Good Fit," and "Overfitting" 4. Which model would you expect to perform best on new data?

Guidance

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_plot = np.linspace(0, 6, 200)

for ax, degree, label in zip(axes, [1, 3, 12],
    ['Underfitting (deg 1)', 'Good Fit (deg 3)', 'Overfitting (deg 12)']):
    coeffs = np.polyfit(X, y, degree)
    y_plot = np.polyval(coeffs, X_plot)

    ax.scatter(X, y, color='steelblue', zorder=5)
    ax.plot(X_plot, y_plot, color='coral', linewidth=2)
    ax.set_title(label)
    ax.set_ylim(-2, 2.5)

plt.tight_layout()
plt.show()

The degree-3 model best captures the sinusoidal pattern and would perform best on new data.

Exercise 25.17 — Comparing train and test scores ⭐⭐

Using the student exam data from Exercise 25.15, fit polynomial regression models of increasing complexity and track both training and test scores to observe overfitting:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

Tasks: 1. Use only study_hours as the feature (for visualization purposes) 2. Fit polynomial models of degrees 1, 2, 3, 5, 10, and 15 3. Record the R² score on both training and test data for each degree 4. Plot training R² and test R² vs. polynomial degree 5. At what degree does overfitting become apparent?

Guidance

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X_1d = data[['study_hours']]
X_tr, X_te, y_tr, y_te = train_test_split(
    X_1d, y, test_size=0.2, random_state=42)

degrees = [1, 2, 3, 5, 10, 15]
train_scores, test_scores = [], []

for d in degrees:
    poly = PolynomialFeatures(d)
    X_tr_p = poly.fit_transform(X_tr)
    X_te_p = poly.transform(X_te)
    model = LinearRegression().fit(X_tr_p, y_tr)
    train_scores.append(model.score(X_tr_p, y_tr))
    test_scores.append(model.score(X_te_p, y_te))

plt.plot(degrees, train_scores, 'o-', label='Training')
plt.plot(degrees, test_scores, 's-', label='Test')
plt.xlabel('Polynomial Degree')
plt.ylabel('R² Score')
plt.title('Overfitting: Train vs Test')
plt.legend()
plt.show()

Overfitting typically becomes apparent around degree 5-10, where training R² keeps rising but test R² drops.

Exercise 25.18 — The effect of training set size ⭐⭐

Investigate how training set size affects model performance. Using the exam data:

Fix the test set (use the same 20% test split)
Train a LinearRegression model using increasing amounts of training data: 10%, 20%, 40%, 60%, 80%, and 100% of the training set
Record the test R² for each training set size
Plot test R² vs. training set size
What pattern do you observe? How does this relate to the bias-variance tradeoff?

Guidance

from sklearn.linear_model import LinearRegression

fractions = [0.1, 0.2, 0.4, 0.6, 0.8, 1.0]
test_r2 = []

for frac in fractions:
    n_samples = int(len(X_train) * frac)
    X_sub = X_train.iloc[:n_samples]
    y_sub = y_train.iloc[:n_samples]
    model = LinearRegression().fit(X_sub, y_sub)
    test_r2.append(model.score(X_test, y_test))

plt.plot(fractions, test_r2, 'o-')
plt.xlabel('Fraction of Training Data Used')
plt.ylabel('Test R²')
plt.title('Learning Curve')
plt.show()

Performance generally improves with more training data, then plateaus. More data reduces variance (the model becomes more stable), which is why bigger datasets tend to reduce overfitting.

Exercise 25.19 — Baseline comparison ⭐⭐

Build a complete baseline analysis for the progressive project's vaccination data:

# Simulated country indicators
np.random.seed(42)
n = 150
countries = pd.DataFrame({
    'gdp_per_capita': np.random.lognormal(9, 1.2, n),
    'health_spending_pct': np.random.uniform(2, 12, n),
    'education_index': np.random.uniform(0.3, 0.95, n),
    'vaccination_rate': None
})

countries['vaccination_rate'] = (
    10 * np.log(countries['gdp_per_capita'] / 1000) +
    2 * countries['health_spending_pct'] +
    30 * countries['education_index'] +
    np.random.normal(0, 8, n)
).clip(20, 100)

Tasks: 1. Define features and target 2. Perform an 80/20 train-test split 3. Calculate the mean and median baselines 4. Compute MAE for both baselines on the test set 5. Which baseline is better? Why might the median baseline outperform the mean?

Guidance

X = countries[['gdp_per_capita', 'health_spending_pct',
               'education_index']]
y = countries['vaccination_rate']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

mean_baseline = y_train.mean()
median_baseline = y_train.median()

mae_mean = np.abs(y_test - mean_baseline).mean()
mae_median = np.abs(y_test - median_baseline).mean()

print(f"Mean baseline MAE: {mae_mean:.2f}")
print(f"Median baseline MAE: {mae_median:.2f}")

The median baseline often outperforms the mean when the target distribution is skewed, because the median minimizes the mean absolute error. If vaccination rates are left-skewed (piling up at high values), the median will be a better baseline.

Exercise 25.20 — Random state exploration ⭐⭐

The random_state parameter in train_test_split controls the random split. Investigate how different splits affect your baseline results:

Run train_test_split with random_state values 0 through 9
For each split, compute the baseline MAE (mean baseline)
Calculate the mean and standard deviation of the 10 MAE values
What does the standard deviation tell you about the reliability of a single train-test split?

Guidance

maes = []
for rs in range(10):
    _, X_te, _, y_te = train_test_split(
        X, y, test_size=0.2, random_state=rs)
    baseline = y_train.mean()
    maes.append(np.abs(y_te - baseline).mean())

print(f"MAE across splits: {np.mean(maes):.2f} +/- {np.std(maes):.2f}")

If the standard deviation is large relative to the mean, a single train-test split gives unreliable estimates. This motivates cross-validation ([Chapter 29](../chapter-29-evaluating-models/index.md)), where you average results across multiple splits.

Part D: Synthesis and Critical Thinking ⭐⭐⭐

Exercise 25.21 — The model simplification paradox

George Box said "All models are wrong, but some are useful." Yet we also say "simple models often beat complex ones." Isn't this contradictory? If all models are wrong, shouldn't we try to make them as accurate (complex) as possible?

Write a paragraph explaining why the answer is no, using the bias-variance tradeoff to support your argument.

Guidance

The paradox dissolves when you consider generalization. Yes, all models are wrong — they simplify reality. But adding complexity to reduce wrongness on the training data often increases wrongness on new data (overfitting). The bias-variance tradeoff tells us that beyond a certain point, additional complexity reduces bias (better training fit) but increases variance (worse generalization). The goal isn't to be perfectly right on the data you have — it's to be usefully right on data you haven't seen. Simple models are more stable, more interpretable, and often generalize better. "Wrong but useful" is better than "precisely fitted to noise."

Exercise 25.22 — Ethical analysis: predictive policing ⭐⭐⭐

A city considers using a machine learning model to predict where crimes will occur, deploying more police to high-prediction areas.

What target variable and features might such a model use?
What biases could exist in the training data (historical crime data)?
How might the model create a feedback loop?
Using the prediction vs. explanation framework, why is explanation especially important here?
What safeguards would you recommend?

Guidance

1. Target: number of crimes in an area. Features: past crime rates, demographic data, economic indicators, time of day, day of week. 2. Historical crime data reflects where police were deployed and what they chose to enforce, not where crime actually occurred. Areas with more policing have more recorded crime, creating bias. 3. The model predicts more crime in heavily policed areas, which sends more police there, which records more crime, which increases future predictions — a self-reinforcing cycle. 4. Explanation matters because you need to understand *why* the model predicts crime in certain areas. If the answer is "because that's where police were before" rather than "because those conditions lead to crime," the model is perpetuating bias, not preventing crime. 5. Safeguards: regular bias audits, transparency about model features, excluding race/ethnicity proxies, community input, human oversight on deployment decisions, regular evaluation of whether the model reduces or increases disparity.

Exercise 25.23 — The "just predict the average" challenge ⭐⭐⭐

Many real-world machine learning models barely beat the baseline. Research suggests that for some prediction tasks (stock prices, political elections, economic forecasts), simple baselines perform nearly as well as complex models.

Why might this be the case?
Does this mean machine learning is useless for these problems?
What does it tell us about the "signal-to-noise ratio" in these domains?
How should a data scientist communicate this finding to stakeholders who expected a breakthrough?

Guidance

1. These domains have very high irreducible noise — the outcomes are influenced by countless unpredictable factors (human behavior, random events, chaotic systems). When noise dominates signal, even the best model can't do much better than the baseline. 2. Not useless — even small improvements over the baseline can be valuable at scale. A 0.5% improvement in stock prediction, applied to billions of dollars, is significant. But expectations should be calibrated. 3. Low signal-to-noise ratio. The patterns in the data are weak compared to the randomness. This is a property of the domain, not a failure of the model. 4. Be honest: "The data suggests that this outcome is largely unpredictable with available features. Our model improves on guessing by X%, which translates to Y dollars/lives/units. More sophisticated models showed diminishing returns. We recommend focusing on improving data quality or identifying new features rather than adding model complexity."

Exercise 25.24 — Design your own modeling problem ⭐⭐⭐

Choose a topic that interests you (sports, music, health, environment, etc.) and design a complete supervised learning problem:

Define the question in plain language
Identify the target variable (regression or classification?)
List 5-8 plausible features
Describe what training data you would need and where you might find it
Define an appropriate baseline model
Describe potential ethical concerns
Predict whether this will be a high-signal or low-signal problem and explain why

Guidance

This is an open-ended exercise. A strong answer should demonstrate understanding of all major concepts from the chapter: clear target/feature identification, thoughtful baseline selection, awareness of bias-variance tradeoff implications (based on the number of features vs. likely data size), and consideration of ethical issues. The signal prediction should relate to how predictable the domain is — predicting physical quantities (e.g., crop yield from weather) is typically higher signal than predicting human behavior (e.g., whether someone will vote).

Exercise 25.25 — Reflection: what changes now? ⭐⭐⭐⭐

Looking back at Chapters 1 through 24, you were learning to understand data that exists. Starting now, you're learning to predict data that doesn't exist yet.

Write a reflection (3-5 paragraphs) on:

How the shift from description to prediction changes what "success" means in data analysis
Why the concepts from earlier chapters (data cleaning, visualization, statistical thinking) are still essential for modeling
One real-world prediction problem you would like to solve and how you would frame it using the vocabulary from this chapter
What concerns you most about using models to make predictions about people

Guidance

This is a reflective exercise with no single correct answer. Strong responses will: (1) recognize that success shifts from "did I describe the data accurately?" to "does my model work on new data?" — from summary to generalization; (2) connect earlier skills to modeling — clean data produces better models, visualization reveals whether assumptions hold, statistical thinking prevents overconfidence in results; (3) clearly frame a problem with identifiable features, target, and task type; (4) engage thoughtfully with ethical concerns about bias, fairness, and the consequences of prediction errors.