Quiz: Chapter 11
Linear Models Revisited
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.
Question 1 (Multiple Choice)
Which of the following best describes why ordinary least squares (OLS) regression fails with 200 features and 500 observations?
- A) OLS cannot handle more than 100 features
- B) OLS has enough degrees of freedom to fit noise, resulting in overfitting
- C) OLS requires features to be normally distributed
- D) OLS only works with continuous features, not categorical
Answer: B) OLS has enough degrees of freedom to fit noise, resulting in overfitting. With features approaching the number of observations, OLS can find coefficient values that fit the training data nearly perfectly --- including noise --- while failing to generalize to unseen data. The training R-squared will be deceptively high while test R-squared is low.
Question 2 (Multiple Choice)
Ridge regression (L2) differs from Lasso regression (L1) in that:
- A) Ridge can drive coefficients to exactly zero; Lasso cannot
- B) Lasso can drive coefficients to exactly zero; Ridge cannot
- C) Ridge requires feature scaling; Lasso does not
- D) Lasso minimizes squared residuals; Ridge minimizes absolute residuals
Answer: B) Lasso can drive coefficients to exactly zero; Ridge cannot. The L1 penalty (sum of absolute values) has a geometric property that allows coefficients to reach exactly zero at the corner of the constraint region. The L2 penalty (sum of squares) shrinks coefficients toward zero but never reaches it. This makes Lasso a simultaneous feature selection and regularization method.
Question 3 (Short Answer)
Explain why feature scaling is critical for regularized models but not for unregularized OLS.
Answer: Regularization penalizes large coefficients. If features are on different scales, the coefficient magnitudes reflect the scale of measurement, not the importance of the feature. A feature measured in dollars (0--100,000) will have a tiny coefficient, while a feature measured as a proportion (0--1) will have a large coefficient. Regularization penalizes the proportion feature's coefficient more heavily, not because it is less important, but because of an accident of scale. OLS without regularization has no penalty term, so coefficient scale does not affect the optimization.
Question 4 (Multiple Choice)
In scikit-learn's LogisticRegression, the parameter C controls regularization strength. Which statement is correct?
- A) Large C means strong regularization
- B) Small C means weak regularization
- C) C = 1/alpha, so large C means weak regularization
- D) C and alpha are the same parameter with different names
Answer: C) C = 1/alpha, so large C means weak regularization. This is the inverse convention: large C reduces the regularization penalty, allowing the model more freedom to fit the data (closer to unregularized). Small C increases the penalty, forcing more coefficient shrinkage. This inverse relationship confuses many practitioners.
Question 5 (Multiple Choice)
You fit a logistic regression with L1 penalty on a dataset with 100 features. The model selects 23 non-zero coefficients. A colleague argues that the 77 zeroed features are "proven to be irrelevant." What is wrong with this argument?
- A) Nothing; Lasso correctly identifies irrelevant features
- B) Lasso selects features based on the specific regularization strength; a different C value might include or exclude different features
- C) Lasso only works with continuous features, so the zeroed features might be categorical
- D) The 77 features were probably highly correlated with the 23 selected ones, not irrelevant
Answer: B) Lasso selects features based on the specific regularization strength; a different C value might include or exclude different features. Additionally, option D raises a valid concern: among correlated features, Lasso tends to pick one arbitrarily and zero the others. A feature with a zero coefficient may be predictive but redundant with a selected feature, not irrelevant. Lasso performs feature selection conditional on the regularization strength and the specific correlations in the data.
Question 6 (Short Answer)
A model achieves 91.8% accuracy on a churn prediction task where the churn rate is 8.2%. Is this a good model? Explain.
Answer: No. With an 8.2% churn rate, a model that predicts "no churn" for every observation achieves 91.8% accuracy. This is the majority-class baseline, and it is useless because it catches zero churners. Accuracy is misleading for imbalanced classification problems. Metrics like AUC-ROC, precision, recall, and F1 score are more informative because they measure the model's ability to discriminate between classes, not just its agreement with the dominant class.
Question 7 (Multiple Choice)
Elastic Net combines L1 and L2 penalties. When is Elastic Net preferred over pure Lasso?
- A) When you have very few features
- B) When features are independent and uncorrelated
- C) When groups of correlated features exist and you want to retain all members of important groups
- D) When you want to maximize the number of zero coefficients
Answer: C) When groups of correlated features exist and you want to retain all members of important groups. Pure Lasso tends to select one feature from a correlated group and zero the rest. Elastic Net's L2 component distributes the weight among correlated features, providing Lasso-like sparsity while handling correlated groups more gracefully.
Question 8 (Multiple Choice)
Which of the following is the correct way to apply StandardScaler in a train-test workflow?
- A) Fit the scaler on the entire dataset, then split into train and test
- B) Fit the scaler on the training set, transform both training and test sets
- C) Fit separate scalers on the training and test sets independently
- D) Scale the features manually using the global mean and standard deviation
Answer: B) Fit the scaler on the training set, transform both training and test sets. Fitting on the entire dataset (option A) leaks test-set statistics into the training process. Fitting separate scalers (option C) means the training and test features are on different scales. Using a scikit-learn Pipeline handles this automatically: fit() calls fit_transform() on training data, and predict() calls transform() on new data.
Question 9 (Short Answer)
A logistic regression model has a coefficient of 0.45 for the feature support_tickets (after StandardScaler). Interpret this coefficient.
Answer: A one-standard-deviation increase in support tickets is associated with a 0.45 increase in the log-odds of the positive class (e.g., churn). Converting to an odds ratio: exp(0.45) = 1.57, meaning a one-standard-deviation increase in support tickets is associated with 57% higher odds of churn, holding all other features constant. Because StandardScaler was applied, "one standard deviation" has a concrete meaning tied to the training data's distribution.
Question 10 (Multiple Choice)
You plot a Lasso coefficient path and observe that Feature A activates (becomes non-zero) at alpha = 0.5, while Feature B activates at alpha = 0.1. What can you conclude?
- A) Feature A is more important than Feature B
- B) Feature B is more important than Feature A
- C) Feature A has a larger numeric range than Feature B
- D) Features A and B are correlated
Answer: A) Feature A is more important than Feature B. In the Lasso path, features that activate at higher alpha values (stronger regularization) are the ones that the model finds most useful first. At alpha = 0.5, only the most predictive features survive the penalty. Feature A enters the model under stronger regularization, indicating it provides more predictive value per unit of coefficient.
Question 11 (Multiple Choice)
Which of the following is NOT a valid reason for a hospital to choose logistic regression over XGBoost for readmission prediction?
- A) Clinicians need to understand why individual patients are flagged
- B) Logistic regression always has higher AUC than XGBoost
- C) Regulatory requirements mandate interpretable models
- D) Faster retraining enables more frequent model updates
Answer: B) Logistic regression always has higher AUC than XGBoost. This is false --- on most complex tabular datasets, XGBoost achieves higher AUC. The other three options are all legitimate reasons to prefer logistic regression despite its typically lower predictive performance.
Question 12 (Short Answer)
Explain the difference between coefficient shrinkage (Ridge) and coefficient elimination (Lasso). Give a scenario where each is preferable.
Answer: Ridge shrinks all coefficients toward zero but keeps them non-zero, reducing the influence of each feature without removing any. Lasso drives some coefficients to exactly zero, effectively removing features from the model. Ridge is preferable when most features are expected to be relevant and you want stable estimates (e.g., a model with 15 carefully selected clinical features). Lasso is preferable when you suspect most features are noise and want automatic feature selection (e.g., a model with 500 one-hot-encoded features after broad feature engineering).
Question 13 (Multiple Choice)
A logistic regression with class_weight='balanced' is trained on a dataset with 92% negative and 8% positive observations. Compared to the default (no class weight), the balanced model will tend to have:
- A) Higher accuracy, lower recall
- B) Lower accuracy, higher recall
- C) Higher accuracy, higher recall
- D) Lower accuracy, lower recall
Answer: B) Lower accuracy, higher recall. The balanced option upweights the minority class in the loss function, making the model more sensitive to positive examples. This shifts the effective decision boundary, catching more positives (higher recall) at the cost of more false positives (lower precision and lower overall accuracy). On a dataset where 92% are negative, a model that predicts all-negative achieves 92% accuracy; the balanced model will sacrifice some of that accuracy to detect the 8% positive class.
Question 14 (Short Answer)
You have built a logistic regression baseline with AUC = 0.82. A colleague builds an XGBoost model with AUC = 0.84. The colleague says: "XGBoost wins; we should deploy it." Write a response outlining the considerations beyond AUC that should inform the deployment decision.
Answer: A 2-point AUC difference is meaningful but must be weighed against operational costs. Consider: (1) Interpretability --- can stakeholders understand why individual predictions are made? Logistic regression provides coefficients; XGBoost requires SHAP. (2) Training and serving costs --- logistic regression trains in seconds, XGBoost in minutes, which matters for frequent retraining. (3) Debugging and maintenance --- a linear model is easier to audit when something goes wrong in production. (4) The business impact of 2 AUC points --- translate it into dollars or outcomes (e.g., additional caught churners) and compare against the operational overhead. A 2-point improvement that translates to $50K/year may not justify doubled engineering effort.
Question 15 (Multiple Choice --- Select Two)
Which TWO of the following are consequences of multicollinearity in linear regression?
- A) The model's overall predictive accuracy on test data always decreases
- B) Individual coefficient estimates become unstable across different random samples
- C) The model fails to converge during training
- D) The sum of coefficients for correlated features may be stable even though individual coefficients are not
- E) Feature scaling resolves multicollinearity
Answer: B) and D). Multicollinearity causes individual coefficients to swing wildly across different subsets of the data (high variance), even though the model's predictions and the combined effect of correlated features may remain stable. Option A is incorrect because multicollinearity does not necessarily reduce predictive accuracy --- it makes coefficients uninterpretable but the predictions may still be reasonable. Option C is incorrect because OLS always converges (it has a closed-form solution), though numerical precision may suffer. Option E is incorrect because scaling does not change the correlation structure between features.
This quiz supports Chapter 11: Linear Models Revisited. Return to the chapter to review concepts.