Key Takeaways: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
This is your reference card for Chapter 25. It contains the threshold concept that changes how you think about every model you build, every prediction you read, and every data-driven claim you encounter.
The Threshold Concept
A model is a deliberate simplification of reality. All models are wrong, but some are useful.
This means:
- Every model throws away complexity on purpose, keeping only the patterns that matter for the task at hand
- You should never expect a model to be perfectly right — the question is whether it's useful enough for the decision you need to make
- The art of modeling is choosing what to keep and what to discard
- A model that captures noise is worse than one that ignores it — simplification is a feature, not a bug
Key Concepts
-
Model: A simplified representation of a complex process, used to make predictions or understand relationships. Like a map — useful precisely because it doesn't show everything.
-
Prediction: Using a model to forecast outcomes for new, unseen data. The goal is accuracy. "What will happen?"
-
Explanation: Using a model to understand which factors drive outcomes and how. The goal is interpretability. "Why does it happen?"
-
Supervised learning: Learning from labeled data (features + known answers). The model learns the pattern connecting inputs to outputs, then applies it to new inputs.
-
Unsupervised learning: Finding structure in data without labels. Clustering, dimensionality reduction, anomaly detection.
-
Features (X): The input variables used to make predictions. Also called predictors, inputs, or independent variables.
-
Target (y): The variable being predicted. Also called response, output, label, or dependent variable.
-
Regression: Predicting a continuous number (price, temperature, vaccination rate).
-
Classification: Predicting a category (spam/not spam, high/low, disease/healthy).
Training and Testing
- Training set: Data the model learns from (typically 80%)
- Test set: Data used to evaluate the model (typically 20%)
- The cardinal rule: Never let the model see the test data during training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Overfitting vs. Underfitting
| Condition | Training Performance | Test Performance | Problem |
|---|---|---|---|
| Underfitting | Poor | Poor | Model too simple |
| Good fit | Good | Good (similar) | Just right |
| Overfitting | Excellent | Poor | Model too complex |
The diagnostic: Compare training and test scores. A large gap signals overfitting.
The Bias-Variance Tradeoff
Bias: Error from overly simplistic assumptions (underfitting)
Variance: Error from sensitivity to specific training data (overfitting)
Total Error = Bias² + Variance + Irreducible Noise
Simple model: High bias, low variance → consistently wrong
Complex model: Low bias, high variance → unpredictably wrong
Best model: Balance of both → usually approximately right
The dartboard analogy: - High bias, low variance: tight cluster, far from bullseye - Low bias, high variance: scattered around bullseye - Low bias, low variance: tight cluster near bullseye (the goal)
Baseline Models
Always establish a baseline before building a complex model:
| Task | Baseline Strategy | What It Tells You |
|---|---|---|
| Regression | Predict the mean of training targets | The minimum error any useful model must beat |
| Classification | Predict the most frequent class | The minimum accuracy any useful model must beat |
If your model can't beat the baseline, it isn't learning anything useful.
The scikit-learn Pattern
Every model in scikit-learn follows the same API:
from sklearn.some_module import SomeModel
model = SomeModel() # Create
model.fit(X_train, y_train) # Train
predictions = model.predict(X_test) # Predict
score = model.score(X_test, y_test) # Evaluate
The Machine Learning Workflow
1. Frame the problem → What are you predicting? Features? Regression or classification?
2. Prepare the data → Clean, select features, handle missing values
3. Split the data → Training set and test set
4. Establish baseline → What's the simplest possible prediction?
5. Choose a model → Start simple
6. Train the model → model.fit(X_train, y_train)
7. Evaluate the model → model.score(X_test, y_test) — compare to baseline
8. Iterate → If underfitting: more complexity. If overfitting: less complexity.
Three Ethical Questions for Every Model
- Who does this model affect? Models that influence decisions about people carry higher stakes.
- What biases might be in the training data? Historical data reflects historical inequalities.
- What happens when the model is wrong? The consequences of errors are rarely symmetric.
Common Pitfalls
- Evaluating on training data instead of test data
- Confusing prediction accuracy with causal explanation
- Assuming more complex is always better
- Ignoring the baseline when interpreting model performance
- Not checking whether the model performs differently across subgroups
- Treating model outputs as ground truth rather than useful estimates
What You Should Be Able to Do Now
- [ ] Define what a model is and why simplification is intentional
- [ ] Distinguish prediction from explanation and choose the right goal
- [ ] Identify features and targets in a supervised learning problem
- [ ] Split data into training and test sets with
train_test_split - [ ] Explain overfitting and underfitting to a non-technical audience
- [ ] Describe the bias-variance tradeoff and its practical implications
- [ ] Build a baseline model and explain why it matters
- [ ] Frame a new problem as a supervised learning task
- [ ] Ask the right ethical questions before building a model about people
The Sentence That Summarizes This Chapter
A model is a deliberate simplification of reality. The goal is not to be perfectly right on the data you have — it's to be usefully right on data you haven't seen yet.
You're ready for Chapter 26, where you'll build your first predictive model: linear regression. You'll take the scatter plots from Chapter 24 (where you measured correlations) and turn them into prediction machines. The concepts from this chapter — features, targets, train-test splits, baselines, overfitting — will all come into play with real code and real results.