Quiz: Multiple Regression

Q: The formula for adjusted is . For , , and , adjusted equals: (a) 0.564 (b) 0.596 (c) 0.600 (d) 0.636

(a) 0.564. The gap between (0.60) and adjusted (0.564) reflects the penalty for 4 predictors with only 50 observations. With more observations, this gap would shrink.

Q: You fit a model predicting employee productivity from training hours, experience, and department (5 departments). The model has observations. The degrees of freedom for the residuals are: (a) 200 - 3 - 1 = 196 (b) 200 - 5 - 1 = 194 (c) 200 - 6 - 1 = 193 (d) 200 - 7 = 193

(c) 200 - 6 - 1 = 193. The model has predictors: training_hours (1), experience (1), and department (4 dummy variables for 5 categories). The residual degrees of freedom are . Remember: a categorical variable with 5 categories requires 4 dummy variables, so it uses 4 degrees of freedom, not 1.

Q: In the model , what is the effect of a one-unit increase in when ? (a) 2 (b) 5 (c) 14 (d) 17

(c) 14. The effect of is its coefficient plus the interaction coefficient times : . When there's an interaction term, the "effect" of a variable is no longer a single number — it changes depending on the value of the interacting variable. At , the effect would be just 2; at , it's 14.

Contributors

Quiz: Multiple Regression

Test your understanding of multiple regression coefficients, adjusted $R^2$, the F-test, multicollinearity, dummy variables, residual diagnostics, and interaction terms. Try to answer each question before revealing the answer.

1. The main advantage of multiple regression over simple regression is:

(a) It always produces a higher $R^2$ (b) It allows you to control for confounding variables statistically (c) It proves causation while simple regression cannot (d) It requires fewer assumptions

Answer

**(b) It allows you to control for confounding variables statistically.** While (a) is technically true ($R^2$ can never decrease when you add predictors), this is a side effect, not the main advantage. The core benefit is the ability to estimate the effect of one variable while holding others constant, which partially controls for confounders. Multiple regression does NOT prove causation (c) — only randomized experiments do that — and it actually requires *more* assumptions than simple regression, not fewer (d).

2. In the equation $\hat{y} = 12 + 3.5x_1 - 2.1x_2 + 0.8x_3$, the coefficient 3.5 means:

(a) $y$ increases by 3.5 for each unit increase in $x_1$ (b) $y$ increases by 3.5 for each unit increase in $x_1$, holding $x_2$ and $x_3$ constant (c) $x_1$ is 3.5 times more important than $x_3$ (d) There is a 3.5-unit correlation between $x_1$ and $y$

Answer

**(b) $y$ increases by 3.5 for each unit increase in $x_1$, holding $x_2$ and $x_3$ constant.** The phrase "holding other variables constant" is essential. Without it, you're describing a simple regression coefficient (a), which is a different quantity. Option (c) is wrong because coefficients depend on the scale of measurement and cannot be directly compared. Option (d) confuses a regression coefficient with a correlation coefficient.

3. A researcher fits a simple regression and finds the coefficient for advertising spending is 4.2. After adding product quality and customer satisfaction to the model, the coefficient drops to 1.8. This most likely happened because:

(a) The multiple regression model is wrong (b) Advertising doesn't really affect sales (c) Some of advertising's apparent effect was actually due to product quality and customer satisfaction, which are correlated with advertising (d) The researcher made a calculation error

Answer

**(c) Some of advertising's apparent effect was actually due to product quality and customer satisfaction, which are correlated with advertising.** This is the core phenomenon of multiple regression: coefficients change because the simple regression coefficient captures both the direct effect and the indirect effects through correlated variables. The multiple regression coefficient isolates the unique effect. The coefficient dropping from 4.2 to 1.8 doesn't mean the model is wrong — it means the model is more accurately separating the effects.

4. Simpson's Paradox occurs when:

(a) A trend that appears in each subgroup reverses when the subgroups are combined (b) The residuals of a regression are not normally distributed (c) Two predictors have a correlation greater than 0.90 (d) The sample size is too small for the number of predictors

Answer

**(a) A trend that appears in each subgroup reverses when the subgroups are combined.** Simpson's Paradox is the motivation for multiple regression: analyzing data without accounting for a confounding variable can reverse the apparent direction of a relationship. The kidney stone treatment example in Section 23.1 demonstrates this — Treatment A is better for both small and large stones, yet Treatment B appears better overall because of how patients were allocated to treatments.

5. Adjusted $R^2$ differs from $R^2$ because adjusted $R^2$:

(a) Is always larger than $R^2$ (b) Penalizes the model for adding predictors that don't improve fit enough (c) Only increases when a new predictor is statistically significant (d) Measures the correlation between $y$ and $\hat{y}$

Answer

**(b) Penalizes the model for adding predictors that don't improve fit enough.** $R^2$ always increases (or stays the same) when a predictor is added, but adjusted $R^2$ can decrease if the new predictor doesn't explain enough additional variability to justify the lost degree of freedom. Option (a) is backwards — adjusted $R^2$ is always $\leq$ $R^2$. Option (c) is approximately true in practice but not exactly — adjusted $R^2$ can increase even if the predictor's individual p-value is slightly above 0.05. Option (d) describes the multiple correlation coefficient $R$, not adjusted $R^2$.

6. The formula for adjusted $R^2$ is $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$. For $n = 50$, $k = 4$, and $R^2 = 0.60$, adjusted $R^2$ equals:

(a) 0.564 (b) 0.596 (c) 0.600 (d) 0.636

Answer

**(a) 0.564.** $R^2_{\text{adj}} = 1 - \frac{(1 - 0.60)(50 - 1)}{50 - 4 - 1} = 1 - \frac{0.40 \times 49}{45} = 1 - \frac{19.6}{45} = 1 - 0.4356 = 0.564$ The gap between $R^2$ (0.60) and adjusted $R^2$ (0.564) reflects the penalty for 4 predictors with only 50 observations. With more observations, this gap would shrink.

7. The F-test in multiple regression tests:

(a) Whether each individual predictor is significant (b) Whether the model explains any variability at all ($H_0$: all $\beta_i = 0$) (c) Whether the residuals are normally distributed (d) Whether the predictors are correlated with each other

Answer

**(b) Whether the model explains any variability at all ($H_0$: all $\beta_i = 0$).** The F-test is an omnibus test: it asks whether *at least one* predictor has a real relationship with $y$. It does NOT test individual predictors (a) — that's what the individual t-tests do. It does NOT test assumptions (c) or multicollinearity (d). A significant F-test means the model as a whole is useful; a non-significant F-test means there's no evidence that any of the predictors matter.

8. A regression output shows F-statistic = 28.4 ($p < 0.001$), but the individual predictor $x_2$ has $p = 0.34$. This means:

(a) The model is contradictory and should be discarded (b) The overall model is significant, but $x_2$ doesn't add much beyond what the other predictors already explain (c) There must be a computational error (d) $x_2$ should definitely be removed from the model

Answer

**(b) The overall model is significant, but $x_2$ doesn't add much beyond what the other predictors already explain.** This is perfectly normal. The F-test says at least one predictor matters; it doesn't require ALL predictors to be significant. The predictor $x_2$ might be redundant with other predictors (multicollinearity) or genuinely unrelated to $y$. Option (d) is tempting but too strong — there might be theoretical reasons to keep $x_2$ even if it's not statistically significant, and removing variables based solely on p-values is a form of data-driven model building that can lead to bias.

9. Multicollinearity is a problem because:

(a) It makes the overall model predictions inaccurate (b) It inflates the standard errors of individual coefficients, making them hard to interpret (c) It violates the assumption that residuals are normally distributed (d) It causes $R^2$ to decrease

Answer

**(b) It inflates the standard errors of individual coefficients, making them hard to interpret.** Multicollinearity does NOT affect overall model predictions (a) — the model can still predict well even with highly correlated predictors. It does NOT violate residual assumptions (c) or decrease $R^2$ (d). Its main damage is to individual coefficient estimates: the standard errors balloon, coefficients become unstable (sensitive to small data changes), and it becomes impossible to separate the individual effects of correlated predictors.

10. A VIF of 15 for a predictor means:

(a) The predictor is 15 times more important than the others (b) The predictor's coefficient variance is 15 times larger than it would be without multicollinearity (c) The predictor has a correlation of 0.15 with $y$ (d) 15% of the predictor's variance is unique

Answer

**(b) The predictor's coefficient variance is 15 times larger than it would be without multicollinearity.** VIF stands for Variance Inflation Factor — it literally measures how much the variance (and thus the standard error) of a coefficient is inflated due to correlations with other predictors. A VIF of 15 means the standard error is $\sqrt{15} \approx 3.87$ times larger than it would be if the predictor were uncorrelated with all other predictors. This is above the threshold of 10, indicating severe multicollinearity that should be addressed.

11. For a categorical variable with 5 categories, how many indicator (dummy) variables should be included in the regression model?

(a) 5 (b) 4 (c) 6 (d) 1

Answer

**(b) 4.** You always need $k - 1$ indicator variables for a categorical variable with $k$ categories. The excluded category becomes the reference category — its effect is captured by the intercept. Including all $k$ indicator variables would create perfect multicollinearity (they sum to 1 for every observation, which is perfectly collinear with the constant term), and the model would fail.

12. In a model with the dummy variable C(region)[T.South] having a coefficient of $-3,200$, the reference category being "North," the interpretation is:

(a) The South region has a predicted value of $-3,200$ (b) Observations from the South are predicted to have a value $3,200$ lower than the North, holding other variables constant (c) Moving from North to South decreases $y$ by 3,200 units total (d) The South region has 3,200 fewer observations than the North

Answer

**(b) Observations from the South are predicted to have a value $3,200$ lower than the North, holding other variables constant.** The dummy variable coefficient represents the *difference* from the reference category, not an absolute value (a). The phrase "holding other variables constant" is critical — it's the difference between South and North communities that are otherwise identical on all other predictors in the model. Option (c) ignores the conditional nature of the interpretation. Option (d) confuses coefficients with sample sizes.

13. A researcher includes both "height in inches" and "height in centimeters" as predictors. This will:

(a) Double the predictive power of height (b) Cause perfect multicollinearity and the model will fail (c) Be fine as long as both are significant (d) Slightly improve $R^2$

Answer

**(b) Cause perfect multicollinearity and the model will fail.** Height in centimeters is a perfect linear function of height in inches ($\text{cm} = 2.54 \times \text{inches}$). This creates perfect multicollinearity — the $\mathbf{X}^T\mathbf{X}$ matrix is singular and cannot be inverted to compute the least squares estimates. Python will either drop one variable automatically or raise an error. This is an extreme (but illustrative) example of multicollinearity.

14. A residual plot shows a fan shape — residuals spread out as predicted values increase. This indicates:

(a) Non-linearity (b) Non-independence (c) Heteroscedasticity (unequal variance) (d) Non-normality

Answer

**(c) Heteroscedasticity (unequal variance).** A fan or funnel shape means the variability of residuals changes across the range of predicted values — the model predicts some ranges more precisely than others. This violates the "E" in LINE (Equal variance). Common remedies include transforming the response variable (often $\log(y)$) or using weighted least squares. This does NOT indicate non-linearity (a), which would show a curved pattern, or non-independence (b), which typically shows trends or cycles in time-ordered data.

15. You fit a model predicting employee productivity from training hours, experience, and department (5 departments). The model has $n = 200$ observations. The degrees of freedom for the residuals are:

(a) 200 - 3 - 1 = 196 (b) 200 - 5 - 1 = 194 (c) 200 - 6 - 1 = 193 (d) 200 - 7 = 193

Answer

**(c) 200 - 6 - 1 = 193.** The model has $k = 6$ predictors: training_hours (1), experience (1), and department (4 dummy variables for 5 categories). The residual degrees of freedom are $n - k - 1 = 200 - 6 - 1 = 193$. Remember: a categorical variable with 5 categories requires 4 dummy variables, so it uses 4 degrees of freedom, not 1.

16. An interaction term between $x_1$ and $x_2$ in a regression model means:

(a) $x_1$ and $x_2$ are correlated with each other (b) The effect of $x_1$ on $y$ depends on the value of $x_2$ (c) $x_1$ and $x_2$ together explain more variance than either alone (d) $x_1$ causes $x_2$

Answer

**(b) The effect of $x_1$ on $y$ depends on the value of $x_2$.** An interaction means the slope for one variable changes depending on the level of another. For example, the effect of study hours on exam scores might be larger for students who also sleep well (positive interaction between study hours and sleep quality). This is different from mere correlation (a), which can exist without interactions. Option (c) describes the additive benefit of multiple predictors, not interactions. Option (d) confuses interaction with causation between predictors.

17. In the model $\hat{y} = 10 + 2x_1 + 5x_2 + 3x_1 x_2$, what is the effect of a one-unit increase in $x_1$ when $x_2 = 4$?

(a) 2 (b) 5 (c) 14 (d) 17

Answer

**(c) 14.** The effect of $x_1$ is its coefficient plus the interaction coefficient times $x_2$: $\frac{\partial \hat{y}}{\partial x_1} = 2 + 3 x_2 = 2 + 3(4) = 14$. When there's an interaction term, the "effect" of a variable is no longer a single number — it changes depending on the value of the interacting variable. At $x_2 = 0$, the effect would be just 2; at $x_2 = 4$, it's 14.

18. Which model building strategy starts with all potential predictors and removes them one at a time?

(a) Forward selection (b) Backward elimination (c) Substantive (theory-driven) model building (d) Stepwise regression

Answer

**(b) Backward elimination.** Backward elimination starts with a full model and removes the least significant predictor at each step. Forward selection (a) starts with no predictors and adds them. Substantive model building (c) is based on theory, not data-driven removal. Stepwise regression (d) combines forward and backward steps. The recommended approach is substantive model building, with forward or backward selection as secondary tools.

19. A social scientist finds that after controlling for income and education, the effect of race on health outcomes is reduced but still statistically significant. The most appropriate conclusion is:

(a) Race directly causes health disparities (b) Income and education are more important than race (c) After accounting for differences in income and education, race is still associated with health outcomes, suggesting additional unmeasured factors or direct effects (d) The regression is invalid because race cannot be randomly assigned

Answer

**(c) After accounting for differences in income and education, race is still associated with health outcomes, suggesting additional unmeasured factors or direct effects.** This is similar to James's criminal justice analysis. The coefficient shrinks (because some of the racial disparity is explained by income and education differences), but it doesn't disappear. Option (a) is too strong — regression cannot prove direct causation. Option (b) makes an unsupported comparison. Option (d) is incorrect — regression can and does study variables that cannot be randomized; it just can't establish causation for them.

20. You're advising a student who has $n = 40$ observations and wants to include 8 predictors. You should:

(a) Encourage them — more predictors always improve the model (b) Warn them that they have only 5 observations per predictor, well below the 10-15 recommended, and suggest reducing the number of predictors (c) Tell them this is fine as long as $R^2 > 0.50$ (d) Suggest they use forward selection to include all 8 anyway

Answer

**(b) Warn them that they have only 5 observations per predictor, well below the 10-15 recommended, and suggest reducing the number of predictors.** With $n = 40$ and $k = 8$, there are only 5 observations per predictor. This creates serious risks of overfitting — the model will capture noise rather than signal, and $R^2$ will be misleadingly high. The adjusted $R^2$ formula shows this: $n - k - 1 = 31$, meaning a large fraction of the sample is "used up" by the model. The student should either collect more data or reduce the number of predictors to 3-4 based on substantive reasoning.