Chapter 26 Quiz: Linear Regression — Your First Predictive Model

Q: A linear regression model predicts employee salary: `Salary = 30000 + 2500 * YearsExperience`. The slope of 2500 means: - (A) Every employee earns 2,500 - (C) Each additional year of experience is associated with $2,500 higher salary - (D) The model is 2,500 times more accurate than the baseline

Correct: (C) The slope represents the expected change in the target (salary) for a one-unit increase in the feature (years of experience). Each additional year is associated with 30,000). (D) is nonsensical.

Q: A residual is defined as: - (A) The slope of the regression line - (B) The difference between the actual value and the predicted value - (C) The average of all predictions - (D) The R-squared value

Correct: (B) Residual = Actual - Predicted. It measures how far off the model's prediction is for a single data point. Positive residuals mean the model underestimated; negative residuals mean it overestimated.

Q: Which scikit-learn method trains the model on data? - (A) `model.predict(X_train, y_train)` - (B) `model.score(X_train, y_train)` - (C) `model.fit(X_train, y_train)` - (D) `model.transform(X_train, y_train)`

Correct: (C) `model.fit(X_train, y_train)` trains the model by learning the relationship between features (X_train) and target (y_train). `predict` makes predictions on new data. `score` evaluates the model. `transform` is used by preprocessing objects like scalers, not models.

Contributors to Introduction to Data Science

Chapter 26 Quiz: Linear Regression — Your First Predictive Model

Instructions: This quiz tests your understanding of Chapter 26. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. Linear regression finds the line that minimizes:

(A) The sum of the residuals
(B) The sum of the squared residuals
(C) The number of points above the line
(D) The correlation coefficient

Answer

**Correct: (B)** Linear regression minimizes the sum of squared residuals (least squares). (A) is wrong because summing raw residuals allows positive and negative errors to cancel — a line through the middle of any dataset would have residuals summing to approximately zero. (C) is wrong — linear regression doesn't count points on either side. (D) is wrong — while the line of best fit is related to the correlation, the optimization criterion is squared residuals.

Question 2. A linear regression model predicts employee salary: Salary = 30000 + 2500 * YearsExperience. The slope of 2500 means:

(A) Every employee earns $2,500
(B) Employees start at $2,500
(C) Each additional year of experience is associated with $2,500 higher salary
(D) The model is 2,500 times more accurate than the baseline

Answer

**Correct: (C)** The slope represents the expected change in the target (salary) for a one-unit increase in the feature (years of experience). Each additional year is associated with $2,500 more salary, according to the model. (A) confuses the slope with a fixed value. (B) describes the intercept ($30,000). (D) is nonsensical.

Question 3. An R-squared value of 0.63 means:

(A) The model is 63% accurate
(B) 63% of the variance in the target is explained by the features
(C) The model makes correct predictions 63% of the time
(D) The correlation between features and target is 0.63

Answer

**Correct: (B)** R-squared measures the proportion of variance in the target that is explained by the model's features. 63% explained means 37% is left unexplained (due to other factors, noise, or nonlinearity). (A) and (C) confuse R² with accuracy — R² is not about right/wrong predictions. (D) confuses R² with r; R² is the square of the correlation coefficient in simple regression, but the values are different (r would be about 0.79).

Question 4. A residual is defined as:

(A) The slope of the regression line
(B) The difference between the actual value and the predicted value
(C) The average of all predictions
(D) The R-squared value

Answer

**Correct: (B)** Residual = Actual - Predicted. It measures how far off the model's prediction is for a single data point. Positive residuals mean the model underestimated; negative residuals mean it overestimated.

Question 5. In multiple regression, each coefficient represents:

(A) The total effect of that feature on the target
(B) The effect of that feature, holding all other features constant
(C) The correlation between that feature and the target
(D) The importance of that feature relative to all others

Answer

**Correct: (B)** In multiple regression, each coefficient represents the expected change in the target for a one-unit increase in that feature, *holding all other features constant*. This "holding constant" part is crucial and distinguishes multiple regression from simple correlation. (A) omits the "holding constant" qualifier. (C) describes bivariate correlation, not a regression coefficient. (D) is only true for standardized coefficients on comparable scales.

Question 6. Multicollinearity is a problem because:

(A) It makes the model's predictions inaccurate
(B) It makes individual coefficient estimates unstable and hard to interpret
(C) It prevents the model from fitting
(D) It always causes overfitting

Answer

**Correct: (B)** When features are highly correlated with each other, the model has difficulty separating their individual effects. Coefficients become unstable — they can change dramatically with small changes in the data. However, multicollinearity typically does NOT significantly affect prediction accuracy (A is wrong), the model still fits (C is wrong), and it doesn't necessarily cause overfitting (D is wrong).

Question 7. Which scikit-learn method trains the model on data?

(A) model.predict(X_train, y_train)
(B) model.score(X_train, y_train)
(C) model.fit(X_train, y_train)
(D) model.transform(X_train, y_train)

Answer

**Correct: (C)** `model.fit(X_train, y_train)` trains the model by learning the relationship between features (X_train) and target (y_train). `predict` makes predictions on new data. `score` evaluates the model. `transform` is used by preprocessing objects like scalers, not models.

Question 8. You fit a linear regression model and get Training R² = 0.92 and Test R² = 0.88. What should you conclude?

(A) The model is severely overfitting
(B) The model generalizes well — performance is similar on both sets
(C) The model is underfitting
(D) The test set is too small

Answer

**Correct: (B)** A small gap between training R² (0.92) and test R² (0.88) — only 0.04 — indicates the model is generalizing well. It performs almost as well on unseen data as on training data. Severe overfitting would show a much larger gap (e.g., 0.95 vs. 0.50). Underfitting would show low values on both sets.

Question 9. Feature scaling (standardization) in linear regression:

(A) Always improves prediction accuracy
(B) Changes the prediction formula but not the predictions themselves
(C) Is required for the model to work
(D) Removes multicollinearity

Answer

**Correct: (B)** For basic linear regression, scaling changes the coefficient values (making them comparable across features) but produces identical predictions. The model compensates for the scaling through the intercept and coefficient values. (A) is wrong — scaling doesn't improve basic linear regression accuracy. (C) is wrong — linear regression works fine without scaling. (D) is wrong — scaling doesn't fix multicollinearity.

Question 10. A residual plot shows a clear U-shaped pattern. This suggests:

(A) The model is a perfect fit
(B) The relationship is linear
(C) The relationship is nonlinear and the model is missing a pattern
(D) There are outliers in the data

Answer

**Correct: (C)** A pattern in the residual plot indicates that the model is systematically wrong in a predictable way — there's a pattern it's not capturing. A U-shape specifically suggests a quadratic (curved) relationship that a linear model can't capture. A good residual plot shows random scatter around zero with no pattern.

Section 2: True/False (4 questions, 5 points each)

Question 11. True or False: A linear regression model with an R² of 0.95 proves that the features cause the target variable.

Answer

**False.** R² measures the strength of the statistical association, not causation. A model can have a very high R² and still be capturing correlational rather than causal relationships. Confounding variables, reverse causation, or spurious correlations can all produce high R² values without any causal mechanism. Causation requires additional evidence beyond model fit ([Chapter 24](../../part-04-statistical-thinking/chapter-24-correlation-causation/index.md)).

Question 12. True or False: Adding more features to a linear regression model always increases the training R².

Answer

**True.** Adding a feature can never decrease training R² — at worst, the model assigns it a coefficient of zero and ignores it. In practice, even random noise features slightly increase training R² because the model finds coincidental patterns. This is one reason why training R² alone is misleading and why test R² matters.

Question 13. True or False: If two features are highly correlated, you should always remove one before fitting a linear regression model.

Answer

**False.** The right response depends on your goal. For prediction, multicollinearity usually doesn't hurt — the model's overall predictions remain accurate even if individual coefficients are unstable. For explanation (interpreting coefficients), multicollinearity is problematic and you might remove one feature. "Always remove" is too strong — it depends on the context.

Question 14. True or False: The intercept of a regression model always has a meaningful real-world interpretation.

Answer

**False.** The intercept is the predicted target value when all features equal zero. If zero isn't a realistic value for the features (e.g., a person with 0 years of age, a house with 0 square feet, a country with $0 GDP), the intercept is a mathematical artifact with no meaningful interpretation. It's a necessary part of the equation but may not correspond to any real scenario.

Section 3: Short Answer (3 questions, 5 points each)

Question 15. Explain why squaring residuals (instead of taking absolute values) is the standard approach in linear regression. What is one advantage and one disadvantage of squaring?

Answer

**Advantage:** Squaring penalizes large errors disproportionately — a residual of 10 contributes 100 to the sum, while a residual of 2 contributes only 4. This means the regression line is strongly pulled toward reducing the largest errors. Squaring also produces a smooth, differentiable objective function with a unique solution, making the optimization mathematically clean. **Disadvantage:** The sensitivity to large errors also means the method is sensitive to outliers. A single extreme data point can pull the regression line substantially because its large squared residual dominates the sum. This can distort the model's fit to the rest of the data.

Question 16. A model predicts house prices with R² = 0.30. Your colleague says the model is useless. Respond with an argument for why R² = 0.30 might still be valuable.

Answer

R² = 0.30 means the model explains 30% of the variance in house prices — 70% is unexplained. While this leaves much variance unaccounted for, the model may still be useful if: (1) It significantly outperforms the baseline (predicting the mean price), reducing average prediction errors meaningfully. (2) In domains where many unobserved factors influence outcomes, 30% explained variance is substantial — house prices depend on countless factors from local school quality to neighborhood trends that may not be in the data. (3) Even imperfect predictions can improve business decisions — a buyer with a rough price estimate is better off than one with no estimate. The relevant question is not "is R² high?" but "does this model help make better decisions than the alternative?"

Question 17. Explain why you should always fit a StandardScaler on the training data only and then apply it to the test data, rather than fitting it on the entire dataset.

Answer

Fitting the scaler on the entire dataset means the scaler's parameters (mean and standard deviation) include information from the test set. When you then scale the training data using these parameters, the training data has been influenced by test set statistics — this is a form of data leakage. The model indirectly "sees" information about the test data during training, which can lead to overly optimistic performance estimates. The correct approach: `scaler.fit_transform(X_train)` then `scaler.transform(X_test)` — the scaler learns the distribution from training data only and applies the same transformation to test data.

Section 4: Applied Scenarios (2 questions, 7.5 points each)

Question 18. You build a model to predict crop yield (tons per acre) using rainfall (inches) and temperature (degrees F):

Yield = 2.1 + 0.15 * Rainfall - 0.08 * Temperature

Training R² = 0.71, Test R² = 0.65

Interpret the rainfall coefficient.
Interpret the temperature coefficient. Does this make physical sense?
Is the model overfitting? Justify your answer.
A farmer in a desert asks you to predict yield for 2 inches of rain and 110 degrees F. What would you predict, and what caveat would you add?

Answer

1. For each additional inch of rainfall, crop yield increases by 0.15 tons per acre, holding temperature constant. 2. For each additional degree F of temperature, yield decreases by 0.08 tons per acre, holding rainfall constant. This could make physical sense — excessive heat can stress crops and reduce yields. However, the true relationship might be nonlinear (moderate warmth helps, extreme heat hurts). 3. Mild overfitting at most. The gap between training R² (0.71) and test R² (0.65) is 0.06 — small enough to be acceptable. The model appears to generalize reasonably well. 4. Prediction: 2.1 + 0.15(2) - 0.08(110) = 2.1 + 0.3 - 8.8 = -6.4 tons/acre. This is negative — physically impossible. **Caveat:** The model was likely trained on moderate conditions and is being extrapolated to extreme values (very low rainfall, very high temperature). The linear assumption breaks down at extremes, producing nonsensical predictions. This prediction should not be trusted.

Question 19. Two teams build models to predict vaccination rates:

Team A's model: Uses 3 features (GDP, health spending, education index). Training R² = 0.68, Test R² = 0.64. Coefficients are stable and interpretable.

Team B's model: Uses 30 features (including internet users, phone subscriptions, CO2 emissions, military spending, etc.). Training R² = 0.94, Test R² = 0.55.

Which team's model is overfitting?
Which model would you trust for making predictions on new countries?
Why does Team B's model have a higher training R² but lower test R²?
What advice would you give Team B?

Answer

1. **Team B is overfitting.** The gap between training R² (0.94) and test R² (0.55) is 0.39 — very large. The model has memorized training data. 2. **Team A's model.** Its test R² (0.64) is higher than Team B's test R² (0.55), and the small gap between training and test shows good generalization. 3. Team B's 30 features give the model enough flexibility to fit the training data very closely — including noise and coincidental patterns. These noise patterns don't generalize, so test performance drops. With 30 features, the model can always find *something* to fit, even if much of it is meaningless. 4. Reduce the number of features dramatically. Start with the most plausible predictors (GDP, health spending, education — the features Team A used). Add features one at a time, only keeping those that improve test R². Consider whether features like military spending or CO2 emissions have any plausible relationship with vaccination rates. Simplicity almost always wins when you don't have enough data to support complexity.

Section 5: Code Analysis (1 question, 5 points)

Question 20. The following code contains a data leakage error. Identify the error and explain why it produces overly optimistic results.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Scale ALL data first (ERROR!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Then split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Train and evaluate
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Test R²: {model.score(X_test, y_test):.3f}")

Answer

**The error:** The scaler is fit on the entire dataset (`scaler.fit_transform(X)`) before the train-test split. This means the scaler's mean and standard deviation include information from the test set. When the training data is scaled using these parameters, it indirectly incorporates test set information — this is data leakage. **Why it's overly optimistic:** The model benefits from having its training data scaled using statistics that include test data, giving it a slight unfair advantage. The test R² will be higher than it would be in a true deployment scenario where future data statistics are unknown. **The fix:** Split first, then scale:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on train only
X_test_scaled = scaler.transform(X_test)         # Transform test

model = LinearRegression()
model.fit(X_train_scaled, y_train)
print(f"Test R²: {model.score(X_test_scaled, y_test):.3f}")