Chapter 25 Quiz: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff

Q: In supervised learning, the term "features" refers to: - (A) The variable you are trying to predict - (B) The input variables used to make predictions - (C) The accuracy of the model - (D) The training and test sets combined

Correct: (B) Features are the input variables (also called predictors or independent variables) that the model uses to make predictions. (A) describes the target, not the features. (C) and (D) are unrelated concepts.

Q: In the bias-variance tradeoff, a model with high bias: - (A) Is too complex and memorizes noise - (B) Has too many features - (C) Makes overly simplistic assumptions and misses real patterns - (D) Requires more test data

Correct: (C) High bias means the model's assumptions are too rigid or simplistic to capture the real patterns in the data. It underfits — ignoring important signals. (A) describes high variance, not high bias. (B) describes a symptom of potential overfitting. (D) is unrelated.

Q: What does `test_size=0.2` mean in `train_test_split`? - (A) 20% of the data goes to the training set - (B) 20% of the data goes to the test set - (C) The model will be tested 20 times - (D) 20 features will be used

Correct: (B) `test_size=0.2` means 20% of the data is reserved for testing and 80% is used for training. This is the most common split ratio.

Q: True or False: A model that achieves 100% accuracy on training data is always a good model.

False. 100% training accuracy is often a sign of overfitting — the model has memorized the training data, including its noise. What matters is test accuracy, not training accuracy. A model with 100% training accuracy and 50% test accuracy has learned nothing useful about the underlying patterns.

Q: True or False: In supervised learning, the model learns by comparing its predictions to known correct answers.

True. Supervised learning uses labeled data — data where the correct answers (targets) are known. The model makes predictions, compares them to the actual values, and adjusts to reduce the error. The "supervision" comes from having the correct answers to learn from.

Contributors to Introduction to Data Science

Chapter 25 Quiz: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff

Instructions: This quiz tests your understanding of Chapter 25. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. George Box's famous statement "All models are wrong, but some are useful" means:

(A) Models should never be trusted because they are always inaccurate
(B) Models are deliberate simplifications of reality that can still provide valuable predictions or insights
(C) Only the most complex models are useful; simple models are always wrong
(D) You should keep adding features until the model is no longer wrong

Answer

**Correct: (B)** Box's insight is that every model simplifies reality — no model captures all the complexity of the real world. But simplification is a feature, not a bug. By capturing the most important patterns while ignoring irrelevant detail, a model can make useful predictions. (A) goes too far — models can be trusted for specific purposes. (C) reverses the lesson — simpler models are often more useful. (D) contradicts the message — you can never make a model "not wrong."

Question 2. A model designed to understand which factors drive employee satisfaction is primarily:

(A) A prediction model
(B) An explanation model
(C) An unsupervised learning model
(D) A baseline model

Answer

**Correct: (B)** Understanding which factors drive an outcome is an explanation task. The goal is interpretability — knowing *why* employees are satisfied and which factors matter most, not just predicting a satisfaction score. A prediction model would focus on accuracy; this task focuses on understanding mechanisms.

Question 3. In supervised learning, the term "features" refers to:

(A) The variable you are trying to predict
(B) The input variables used to make predictions
(C) The accuracy of the model
(D) The training and test sets combined

Answer

**Correct: (B)** Features are the input variables (also called predictors or independent variables) that the model uses to make predictions. (A) describes the target, not the features. (C) and (D) are unrelated concepts.

Question 4. Why must you evaluate a model on data it has never seen during training?

(A) Because training data is always low quality
(B) Because a model that memorizes training data may not generalize to new situations
(C) Because scikit-learn requires it
(D) Because test data is always more representative than training data

Answer

**Correct: (B)** A model might achieve high accuracy on training data by memorizing specific examples rather than learning general patterns. Testing on unseen data reveals whether the model has truly learned patterns that transfer to new situations — this is generalization. (A) is false — training data quality varies. (C) is false — it's a statistical principle, not a software requirement. (D) is false — both sets are random samples from the same data.

Question 5. A model has a training accuracy of 97% and a test accuracy of 52%. This model is:

(A) Underfitting
(B) Overfitting
(C) Well-generalized
(D) A strong baseline

Answer

**Correct: (B)** The large gap between training accuracy (97%) and test accuracy (52%) is the hallmark of overfitting. The model has memorized the training data (high training score) but cannot generalize to new data (low test score). Underfitting would show poor performance on *both* sets. A well-generalized model would show similar scores on both sets.

Question 6. In the bias-variance tradeoff, a model with high bias:

(A) Is too complex and memorizes noise
(B) Has too many features
(C) Makes overly simplistic assumptions and misses real patterns
(D) Requires more test data

Answer

**Correct: (C)** High bias means the model's assumptions are too rigid or simplistic to capture the real patterns in the data. It underfits — ignoring important signals. (A) describes high variance, not high bias. (B) describes a symptom of potential overfitting. (D) is unrelated.

Question 7. Which of the following is an example of unsupervised learning?

(A) Predicting tomorrow's stock price from historical data
(B) Classifying emails as spam or not spam
(C) Grouping customers into segments based on purchasing behavior
(D) Predicting student test scores from study hours

Answer

**Correct: (C)** Customer segmentation is unsupervised learning — you're finding natural groups in the data without predefined labels. There's no "correct answer" to learn from; you're discovering structure. (A), (B), and (D) all have known targets (price, spam/not spam, test scores), making them supervised learning.

Question 8. A baseline model for a regression problem typically:

(A) Uses the most complex algorithm available
(B) Predicts the mean (or median) of the training target for every observation
(C) Uses all available features
(D) Achieves at least 90% accuracy

Answer

**Correct: (B)** A baseline for regression is the simplest possible prediction — usually the mean or median of the training data, applied to every observation. Any useful model must beat this baseline. (A), (C), and (D) describe goals of more sophisticated models, not baselines. Baselines are intentionally simple.

Question 9. What does test_size=0.2 mean in train_test_split?

(A) 20% of the data goes to the training set
(B) 20% of the data goes to the test set
(C) The model will be tested 20 times
(D) 20 features will be used

Answer

**Correct: (B)** `test_size=0.2` means 20% of the data is reserved for testing and 80% is used for training. This is the most common split ratio.

Question 10. Which statement best describes the relationship between model complexity and generalization?

(A) More complex models always generalize better because they capture more patterns
(B) Simpler models always generalize better because they have fewer parameters
(C) There is an optimal level of complexity; too simple underfits, too complex overfits
(D) Complexity has no effect on generalization

Answer

**Correct: (C)** The bias-variance tradeoff dictates that there is a sweet spot. Too simple (high bias) means missing real patterns. Too complex (high variance) means fitting noise. The optimal model is complex enough to capture real patterns but not so complex that it memorizes noise. Neither extreme is ideal.

Section 2: True/False (4 questions, 5 points each)

Question 11. True or False: A model that achieves 100% accuracy on training data is always a good model.

Answer

**False.** 100% training accuracy is often a sign of overfitting — the model has memorized the training data, including its noise. What matters is test accuracy, not training accuracy. A model with 100% training accuracy and 50% test accuracy has learned nothing useful about the underlying patterns.

Question 12. True or False: In supervised learning, the model learns by comparing its predictions to known correct answers.

Answer

**True.** Supervised learning uses labeled data — data where the correct answers (targets) are known. The model makes predictions, compares them to the actual values, and adjusts to reduce the error. The "supervision" comes from having the correct answers to learn from.

Question 13. True or False: The bias-variance tradeoff means you must choose between a model that is consistently wrong (high bias) and a model that is unpredictably wrong (high variance).

Answer

**True (with nuance).** This is essentially correct. High bias means systematically wrong (consistently misses the pattern). High variance means unstably wrong (predictions change dramatically depending on the specific training data). The tradeoff means reducing one tends to increase the other. The goal is to find a balance point with acceptable levels of both.

Question 14. True or False: Increasing the amount of training data always reduces overfitting.

Answer

**True (generally).** More training data helps reduce variance because the model sees more examples and is less likely to memorize any particular quirk. With enough data, even complex models can generalize well. However, if the model is fundamentally too complex for the problem, adding data helps but may not fully eliminate overfitting. And if the new data introduces additional noise, the benefit may be limited.

Section 3: Short Answer (3 questions, 5 points each)

Question 15. Explain the difference between regression and classification in supervised learning. Give one example of each.

Answer

**Regression** predicts a continuous numerical value. Example: predicting a house's sale price from its features (square footage, bedrooms, location). The target is a number on a continuous scale. **Classification** predicts a category or class label. Example: predicting whether an email is spam or not spam based on its content. The target is a discrete category, not a number. The key distinction is the nature of the target: continuous numbers for regression, discrete categories for classification.

Question 16. What is data leakage, and why is it dangerous?

Answer

**Data leakage** occurs when information from the test set (or from the future) is used during model training. This gives the model an unfair advantage, making it appear to perform better than it actually would on truly new data. It's dangerous because it creates overconfident evaluations — you think your model is good, but when you deploy it in the real world (where it can't peek at future data), its performance drops. The most common form is failing to properly separate training and test data.

Question 17. Why is it important to start with a baseline model before building a complex one?

Answer

A baseline model establishes the minimum performance that any useful model must beat. Without a baseline, you cannot judge whether your model is actually learning patterns or whether its performance is trivially achievable. For example, if 90% of emails are not spam, a model with 90% accuracy might seem impressive until you realize it could achieve that by simply predicting "not spam" for everything. Baselines keep you honest and provide context for evaluating model quality.

Section 4: Applied Scenarios (2 questions, 7.5 points each)

Question 18. A company builds a model to predict which customers will cancel their subscriptions. The model achieves 95% accuracy on test data. However, only 5% of customers actually cancel.

What is the baseline accuracy for this problem?
Is the model's 95% accuracy impressive? Why or why not?
What additional information would you need to properly evaluate this model?

Answer

1. **Baseline accuracy: 95%.** If you predict "will NOT cancel" for every customer, you'd be right 95% of the time (since only 5% actually cancel). This is the most-frequent-class baseline. 2. **No, 95% is not impressive** — it matches the baseline exactly. The model might simply be predicting "will not cancel" for everyone, which requires no machine learning at all. It hasn't demonstrated any ability to identify the 5% who will cancel. 3. **You need to know:** (a) How many actual cancelers the model correctly identifies (recall/sensitivity); (b) How many predicted cancelers actually cancel (precision); (c) The confusion matrix showing true positives, false positives, true negatives, and false negatives. Overall accuracy is misleading when classes are highly imbalanced — you need metrics that evaluate performance on the minority class specifically.

Question 19. You are building a model to predict exam scores and you have 500 students in your dataset. You try two approaches:

Model A: Uses 3 features (study hours, attendance, prior GPA). Training R²=0.72, Test R²=0.70.
Model B: Uses 50 features (including birth month, eye color, phone brand, etc.). Training R²=0.94, Test R²=0.45.

Which model is overfitting? How do you know?
Which model would you deploy? Why?
What does Model B's performance tell you about many of its 50 features?
How does this illustrate the bias-variance tradeoff?

Answer

1. **Model B is overfitting.** The gap between training R² (0.94) and test R² (0.45) is 0.49 — nearly half the scale. The model has memorized training data rather than learning generalizable patterns. 2. **Deploy Model A.** Training and test R² are close (0.72 vs 0.70), indicating good generalization. The model captures real patterns that transfer to new data. While 0.70 is lower than Model B's training score, it's much higher than Model B's test score, which is what actually matters. 3. **Many of Model B's 50 features are noise.** Features like birth month, eye color, and phone brand have no real relationship with exam scores. But with 50 features and only 500 observations, the model can find spurious patterns — random correlations that exist in the training data but don't generalize. 4. **Model A** has moderate bias (only 3 features, so it misses some real patterns) but low variance (stable, generalizable). **Model B** has low bias (50 features let it fit the training data closely) but high variance (wildly different performance on training vs. test data). The bias-variance tradeoff shows that Model A's slightly higher bias is more than compensated by its much lower variance.

Section 5: Code Analysis (1 question, 5 points)

Question 20. What is wrong with the following code? Identify the error and explain why it leads to unreliable evaluation.

from sklearn.linear_model import LinearRegression

# Load data
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']

# Train model on ALL data
model = LinearRegression()
model.fit(X, y)

# Evaluate model on the SAME data
score = model.score(X, y)
print(f"Model R² score: {score:.3f}")

Answer

**The error:** The model is trained and evaluated on the same data. There is no train-test split. The code uses all the data for training (`model.fit(X, y)`) and then evaluates on that same data (`model.score(X, y)`). **Why this is unreliable:** The reported R² score reflects how well the model fits data it has already seen, not how well it would perform on new, unseen data. This is like testing a student on questions they've already studied — a high score doesn't prove understanding. The model might have memorized the training data (overfitting), and the reported score would overestimate its real-world performance. **The fix:** Split the data first:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Model R² score: {score:.3f}")