Chapter 34 Quiz — Predictive Models: Regression and Classification

DataField.Dev

Chapter 34 Quiz — Predictive Models: Regression and Classification

Instructions: Answer all questions. For multiple choice, select the single best answer. For True/False questions, provide a one-sentence justification. For short answer questions, write 2-4 sentences. The answer key with full explanations follows all questions.

Part A: Multiple Choice (12 questions)

Question 1

You are predicting whether a customer will purchase a premium subscription (yes or no). Which type of model is appropriate?

A) Linear regression B) Classification C) Clustering D) Time series decomposition

Question 2

A linear regression model has R² = 0.82 on the training set and R² = 0.41 on the test set. What does this most likely indicate?

A) The model is underfitting — it needs more features B) The model is overfitting — it has memorized the training data C) R² cannot be calculated on a test set D) The model is performing well; a difference of 0.41 is normal

Question 3

You train a decision tree with max_depth=None on 500 samples. Training accuracy reaches 100%. Which action is most likely to improve test-set performance?

A) Increase the number of features B) Add more training data only C) Reduce max_depth to 4 or 5 D) Switch to linear regression

Question 4

Your churn dataset has 950 non-churners and 50 churners. You train a logistic regression model that predicts "not churn" for every customer. What is its accuracy?

A) 5% B) 50% C) 95% D) The accuracy cannot be determined without more information

Question 5

Which scikit-learn function is used to scale features so that each has mean = 0 and standard deviation = 1?

A) MinMaxScaler B) Normalizer C) StandardScaler D) RobustScaler

Question 6

In a confusion matrix for churn prediction, a "False Negative" means:

A) The model predicted churn, and the customer did not churn B) The model predicted no churn, and the customer did not churn C) The model predicted no churn, and the customer did churn D) The model predicted churn, and the customer did churn

Question 7

You are building a model to flag high-risk loan applications for manual review. A false negative (predicting "safe" when the loan will default) costs the bank $50,000. A false positive (flagging a safe loan) costs $500 in review time. Which metric should you optimize?

A) Accuracy B) Precision C) Recall D) Specificity

Question 8

When performing train/test split, you fit a StandardScaler on the training data and then transform both the training and test sets using that fitted scaler. Why is it wrong to fit the scaler on the full dataset before splitting?

A) The scaler cannot handle test data B) It would allow test data statistics to influence model training, causing data leakage C) The scaler requires more data than a single split provides D) StandardScaler does not support fitting on subsets of data

Question 9

A Random Forest classifier with 100 trees reports these feature importances:

login_frequency:    0.34
days_since_purchase: 0.28
account_age_days:   0.19
mrr:                0.12
support_tickets:    0.07

Which feature would you examine first if your goal is to design a customer engagement intervention?

A) mrr — because it represents the most revenue at stake B) login_frequency — because it has the highest importance score C) account_age_days — because older customers are always lower risk D) support_tickets — because it is the most actionable

Question 10

Cross-validation with 5 folds produces R² scores of: 0.78, 0.81, 0.74, 0.83, 0.76. What is the mean R² and standard deviation?

A) Mean = 0.784, Std = 0.034 B) Mean = 0.782, Std = 0.034 C) Mean = 0.784, Std = 0.038 D) Mean = 0.780, Std = 0.030

Question 11

You have trained a logistic regression churn model with a threshold of 0.50. Your retention team can only follow up with 60 customers per month. Your dataset has 800 customers. What is the correct approach?

A) Retrain the model with fewer customers B) Lower the threshold to 0.30 to increase the number of customers flagged C) Sort customers by predicted churn probability descending and select the top 60 D) Raise the threshold to 0.90 to improve precision

Question 12

Which of the following is an example of feature engineering?

A) Splitting the dataset into training and test sets B) Calculating "days since last purchase" from a raw purchase date column C) Choosing between logistic regression and random forest D) Printing feature importances after training

Part B: True / False with Justification (4 questions)

For each statement, write True or False and provide a one-sentence justification.

Question 13

True or False: A model with higher training accuracy than test accuracy is always overfitting and should be discarded.

Question 14

True or False: Logistic regression outputs probabilities between 0 and 1, making it suitable for both ranking customers by risk and for binary classification at a chosen threshold.

Question 15

True or False: Decision trees require feature scaling (StandardScaler or similar) before training because they are sensitive to the absolute magnitude of feature values.

Question 16

True or False: R² can be negative, which indicates the model performs worse than simply predicting the mean of the target variable for every observation.

Part C: Short Answer (4 questions)

Question 17

Explain the difference between precision and recall. Give a business example where high recall is more important than high precision, and a separate example where high precision is more important than high recall.

Question 18

What is the purpose of cross-validation, and why is it preferable to a single train/test split for assessing model performance?

Question 19

Priya is building a customer churn model for Acme Corp. She trains a Random Forest on historical customer data and achieves AUC = 0.89. Sandra Chen asks her: "What does 0.89 AUC mean in plain English?" Write the explanation Priya should give.

Question 20

You have built a regression model to predict annual contract value (ACV) from prospect characteristics. The model has MAE = $8,400 and RMSE = $22,600. The average ACV in your dataset is $45,000. Interpret these two metrics and explain what the gap between MAE and RMSE tells you about the model's error distribution.

Answer Key

Part A: Multiple Choice

Question 1 — Answer: B

Predicting "yes or no" is a binary classification problem. The output is a category (two classes), not a continuous number. Linear regression produces continuous numeric output and is not appropriate for categorical prediction tasks. Clustering is unsupervised (no labels), and time series decomposition addresses temporal patterns in sequential data.

Question 2 — Answer: B

When training accuracy is substantially higher than test accuracy, the model has overfit — it has memorized patterns specific to the training data that do not generalize. A gap of 0.41 between R² scores is not normal; it is a diagnostic sign that the model is too complex relative to the data available. Underfitting (choice A) would produce low accuracy on both training and test sets, not high training accuracy.

Question 3 — Answer: C

max_depth=None allows the tree to grow until every training sample is correctly classified, which guarantees 100% training accuracy but produces a model that memorizes noise rather than learning patterns. Constraining max_depth to 4 or 5 forces the tree to capture only the strongest signals, which typically improves generalization. Adding more training data (choice B) can help but does not address the fundamental issue of an unconstrained tree.

Question 4 — Answer: C

If 950 of 1,000 customers are non-churners, a model that always predicts "not churn" will be correct 950/1,000 = 95% of the time. This illustrates why accuracy is a misleading metric for imbalanced datasets — high accuracy can coexist with complete failure to identify the minority class (churners). The model's recall for the churn class would be 0.0.

Question 5 — Answer: C

StandardScaler transforms each feature to zero mean and unit standard deviation (z-score normalization). MinMaxScaler scales to a [0, 1] range. Normalizer scales each sample to unit norm (not each feature). RobustScaler uses median and interquartile range, which is more resistant to outliers but does not guarantee zero mean and unit variance.

Question 6 — Answer: C

A False Negative is a case where the model predicted the negative class (no churn) but the true label was the positive class (churn). The model missed a real churner. In a confusion matrix: - True Positive: predicted churn, actually churned - True Negative: predicted no churn, actually did not churn - False Positive: predicted churn, actually did not churn - False Negative: predicted no churn, actually did churn

Question 7 — Answer: C

When the cost of a false negative is dramatically higher than the cost of a false positive ($50,000 vs. $500 — a 100:1 ratio), recall is the metric to optimize. Recall measures what fraction of actual positives (real defaults) the model captures. Precision measures what fraction of predicted positives are actually positive — useful when false positives are costly, but here false negatives are the primary concern.

Question 8 — Answer: B

Fitting the scaler on the full dataset before splitting means the scaler's mean and standard deviation are computed using test data statistics. This leaks information about the test set into the preprocessing step, giving an artificially optimistic evaluation. In production, you will not have future data available when fitting the scaler — so the scaler should only ever see training data.

Question 9 — Answer: B

login_frequency has the highest importance score (0.34), meaning the Random Forest found it to be the most predictive feature for distinguishing churners from non-churners. For designing an engagement intervention, the most predictive feature is the most valuable starting point — low login frequency is both a strong churn signal and potentially actionable (you can build interventions to increase engagement).

Question 10 — Answer: A

Mean = (0.78 + 0.81 + 0.74 + 0.83 + 0.76) / 5 = 3.92 / 5 = 0.784

Standard deviation: deviations from mean are -0.004, +0.026, -0.044, +0.046, -0.024. Squared: 0.000016, 0.000676, 0.001936, 0.002116, 0.000576. Variance = 0.00532 / 5 = 0.001064. Std = sqrt(0.001064) ≈ 0.0326, which rounds to 0.034.

Answer A (Mean = 0.784, Std = 0.034) is correct.

Question 11 — Answer: C

The correct approach is to generate churn probability scores for all 800 customers, sort them descending by probability, and select the top 60. This maximizes the number of true churners reached within the capacity constraint. Lowering the threshold (choice B) would increase the count of flagged customers beyond 60, which exceeds capacity. Raising the threshold (choice D) would reduce the count below 60, leaving capacity unused.

Question 12 — Answer: B

Feature engineering is the process of creating new informative features from raw data. Calculating "days since last purchase" from a raw date column creates a numeric feature that is directly usable by a model and encodes meaningful business information (recency). Splitting data (choice A) is data preparation. Choosing an algorithm (choice C) is model selection. Printing importances (choice D) is model evaluation.

Part B: True / False

Question 13 — Answer: False

Some gap between training and test accuracy is expected and normal — the key question is whether the gap is small enough that the test performance is still useful for the business purpose; only when the gap is large (the model memorizes training data without generalizing) should it be discarded.

Question 14 — Answer: True

Logistic regression applies the sigmoid function to produce output values strictly between 0 and 1, which can be interpreted as probabilities; these probability scores can be used directly to rank customers from highest to lowest risk, and a threshold (default 0.5 or any chosen value) converts them to binary class predictions.

Question 15 — Answer: False

Decision trees split data based on feature value thresholds (greater than / less than comparisons), so they are invariant to monotonic transformations like scaling — the relative ordering of values is what matters, not their magnitude, which means StandardScaler is unnecessary for decision tree models (though it does not hurt).

Question 16 — Answer: True

R² = 1 - (SS_residual / SS_total), where SS_total is the variance of the target and SS_residual is the sum of squared prediction errors; when a model's predictions are so poor that its SS_residual exceeds SS_total, R² becomes negative, indicating the model is worse than the trivial baseline of always predicting the mean.

Part C: Short Answer

Question 17 — Model Answer:

Precision is the fraction of customers the model flagged as churners who actually churned: out of all positive predictions, how many were correct? Recall is the fraction of actual churners the model successfully identified: out of all true churners, how many did the model catch?

High recall is more important than high precision in medical screening — you want to catch every cancer case even if that means some false alarms, because missing a true case has catastrophic consequences. High precision is more important than high recall in fraud detection for account freezing — you want to avoid freezing legitimate accounts (false positives cause severe customer harm), even if you miss some fraudulent transactions.

Question 18 — Model Answer:

Cross-validation splits the data into k folds and trains k separate models, each evaluated on a different fold of held-out data. This produces k performance estimates from k different test sets, giving a more stable and representative assessment of how the model will generalize.

A single train/test split is sensitive to which samples happen to land in the test set — a lucky or unlucky split can produce an estimate that is significantly better or worse than the model's true performance. Cross-validation averages across multiple splits, reducing this variance. The standard deviation across folds also provides a measure of how sensitive the model's performance is to which data it sees — high variance across folds is a warning sign.

Question 19 — Model Answer:

Priya might explain it this way: "Imagine I randomly select one customer who actually churned and one who did not. AUC = 0.89 means that 89% of the time, the model correctly assigns a higher churn probability to the customer who churned. A completely random guess would score 0.50. A perfect model would score 1.00. So 0.89 means the model is quite good at distinguishing churners from non-churners — significantly better than chance, though not perfect. It does not mean 89% of predictions are correct; it means the model's ranking of customers from high to low churn risk is accurate 89% of the time."

Question 20 — Model Answer:

MAE = $8,400 means the model's typical prediction error is about $8,400 — on an average contract value of $45,000, that is roughly an 18.7% typical error. RMSE = $22,600 is a much larger number.

The gap between MAE and RMSE is diagnostic: RMSE penalizes large errors more than MAE does (by squaring them before averaging). A large RMSE relative to MAE indicates the model makes occasional very large errors — the error distribution has heavy tails. Most predictions are close (MAE ~$8,400), but some are wildly off (pulling RMSE up to $22,600). For a sales forecasting use case, those outlier predictions are important to investigate: they may represent a certain type of deal or prospect segment the model has not learned to handle correctly.