Chapter 7 Quiz: Supervised Learning -- Classification

DataField.Dev

Chapter 7 Quiz: Supervised Learning -- Classification

Multiple Choice

Question 1. Which of the following best describes the purpose of classification in a business context?

(a) Predicting a continuous numerical value, such as next quarter's revenue.
(b) Assigning observations to discrete categories to inform a specific business action.
(c) Grouping customers into clusters based on shared characteristics.
(d) Reducing the number of features in a dataset.

Question 2. A company's churn model achieves 95 percent accuracy on a dataset where 95 percent of customers do not churn. Which statement is most accurate?

(a) The model is performing excellently and should be deployed immediately.
(b) The model may be predicting "no churn" for every customer, which would render it useless despite the high accuracy score.
(c) Accuracy above 90 percent always indicates a reliable model.
(d) The model should be retrained with more data to push accuracy above 98 percent.

Question 3. What does the sigmoid function do in logistic regression?

(a) It converts predicted probabilities into class labels.
(b) It maps a linear combination of features to a probability value between 0 and 1.
(c) It selects the most important features for the model.
(d) It prevents the model from overfitting.

Question 4. A decision tree classifier is trained with no constraints on depth or minimum leaf size. On the training data, it achieves 99.8 percent accuracy. On the test data, it achieves 71 percent accuracy. What is the most likely explanation?

(a) The training data contains errors.
(b) The model is underfitting -- it is too simple to capture the patterns.
(c) The model is overfitting -- it has memorized the training data rather than learning generalizable patterns.
(d) The test data is from a different distribution than the training data.

Question 5. Which of the following is the primary mechanism by which random forests reduce overfitting compared to individual decision trees?

(a) They use deeper trees than individual decision trees.
(b) They combine many diverse trees (trained on different bootstrap samples with random feature subsets) and aggregate their predictions.
(c) They apply stronger regularization to each individual tree.
(d) They automatically remove outliers from the training data.

Question 6. How does gradient boosting differ from random forests in how it builds its ensemble of trees?

(a) Gradient boosting builds trees independently and averages them; random forests build trees sequentially.
(b) Gradient boosting builds trees sequentially, with each tree correcting the errors of the previous ones; random forests build trees independently.
(c) Gradient boosting uses deeper trees; random forests use shallower trees.
(d) There is no fundamental difference; they are two names for the same approach.

Question 7. In a churn prediction model, a false positive means:

(a) The model correctly predicted that a customer would churn.
(b) The model predicted that a customer would churn, but the customer actually stayed.
(c) The model predicted that a customer would stay, but the customer actually churned.
(d) The model correctly predicted that a customer would stay.

Question 8. A model has the following confusion matrix results: TP = 120, FP = 80, FN = 30, TN = 770. What is the precision of this model?

(a) 120 / (120 + 30) = 0.80
(b) 120 / (120 + 80) = 0.60
(c) 120 / (120 + 770) = 0.13
(d) (120 + 770) / (120 + 80 + 30 + 770) = 0.89

Question 9. Using the same confusion matrix from Question 8, what is the recall?

(a) 120 / (120 + 30) = 0.80
(b) 120 / (120 + 80) = 0.60
(c) 120 / (120 + 770) = 0.13
(d) (120 + 770) / (120 + 80 + 30 + 770) = 0.89

Question 10. A bank must explain every lending decision to regulators. Which classification algorithm would be most appropriate for the primary model?

(a) XGBoost, because it achieves the highest accuracy.
(b) A deep neural network, because it can learn complex patterns.
(c) Logistic regression or a constrained decision tree, because they produce interpretable outputs that can be explained to regulators.
(d) Random forest, because it is the most robust to overfitting.

Question 11. What is the purpose of one-hot encoding in classification?

(a) To scale all features to the range 0-1.
(b) To convert categorical features into a numerical format that machine learning algorithms can process.
(c) To balance the classes in the target variable.
(d) To reduce the dimensionality of the feature space.

Question 12. SMOTE (Synthetic Minority Over-sampling Technique) addresses class imbalance by:

(a) Removing examples from the majority class until both classes are equal.
(b) Creating synthetic examples of the minority class by interpolating between existing minority class examples.
(c) Adjusting the classification threshold to favor the minority class.
(d) Weighting the loss function to penalize misclassification of the minority class more heavily.

Question 13. An AUC-ROC score of 0.50 indicates that the model:

(a) Achieves perfect separation between the two classes.
(b) Is performing no better than random guessing.
(c) Has 50 percent accuracy.
(d) Has a 50/50 split between precision and recall.

Question 14. When splitting data into training and test sets for a churn prediction model that uses time-based features, which approach is most appropriate?

(a) Random splitting, ensuring equal churn rates in both sets.
(b) Temporal splitting, where the training set contains earlier observations and the test set contains later observations.
(c) Alphabetical splitting by customer name.
(d) Stratified splitting by loyalty tier.

Question 15. Which of the following is the best explanation for why you should fit a feature scaler (e.g., StandardScaler) on the training data only, then apply it to the test data?

(a) Fitting on the full dataset is computationally more expensive.
(b) Fitting on the full dataset introduces data leakage -- the test set statistics influence the training process, making performance estimates unreliable.
(c) Fitting on the training data only produces better-scaled features.
(d) There is no practical difference; fitting on the full dataset is equally valid.

Short Answer

Question 16. Professor Okonkwo states: "The model doesn't make the decision. It informs the decision." In two to three sentences, explain what she means and why this distinction matters for a business deploying a churn prediction model.

Question 17. Explain the precision-recall tradeoff using the Athena churn prediction example. In your answer, specify: (a) what precision measures in this context, (b) what recall measures, (c) why you typically cannot maximize both simultaneously, and (d) which metric Athena should prioritize given that a retention offer costs $20 and a lost customer costs $340.

Question 18. A colleague argues: "We should always use the most complex model available because it will always perform better." Provide a counterargument of three to four sentences, citing specific concepts from this chapter.

Question 19. Describe two ways that domain expertise (business knowledge) influences the classification workflow beyond just interpreting results. Provide specific examples from the Athena churn prediction case.

Question 20. Athena's VP of Operations initially pushes back on the churn model, asking: "What do we do with a list of likely churners?" Explain why this pushback was valuable and how the team resolved it. What would have happened if the model had been deployed without addressing this concern?

Scenario Analysis

Question 21. You are building a classification model to predict whether manufacturing equipment will fail within the next 7 days (predictive maintenance). Consider the following: - (a) Is this a binary or multiclass classification problem? - (b) Describe the cost asymmetry. Is a false positive (predicting failure that doesn't happen) or a false negative (missing an actual failure) more costly? Why? - (c) Given your answer to (b), should the model's threshold be set above or below 0.5? Explain. - (d) Name two features that might be predictive in this domain (not mentioned in the chapter).

Question 22. Two data scientists present competing churn models to Athena's leadership:

Data Scientist A presents an XGBoost model with AUC = 0.87 and says: "This model is 87 percent accurate at distinguishing churners from non-churners."
Data Scientist B presents a logistic regression model with AUC = 0.82 and says: "This model identifies 75 percent of actual churners, and when it flags someone, it's right 55 percent of the time. Here are the top five risk factors it uses."

If you were the CMO, which presentation would you find more useful, and why? Which common mistake does Data Scientist A make in their presentation?

Question 23. A fintech startup has 200 labeled examples of loan defaults (positive class) and 9,800 examples of non-defaults (negative class). They train a random forest classifier and achieve the following results: - Accuracy: 97.5% - Precision: 0.40 - Recall: 0.15 - AUC-ROC: 0.72

(a) Explain why the accuracy appears high but the model is performing poorly for the business objective.
(b) Suggest three specific strategies to improve the model's performance on the default (positive) class.
(c) If the startup implemented only one change immediately, which would you recommend and why?

Question 24. A retail company discovers that their churn prediction model performs well overall (AUC = 0.84) but performs significantly worse for customers acquired through the mobile app (AUC = 0.67) compared to customers acquired through the website (AUC = 0.88). - (a) What are two possible explanations for this performance disparity? - (b) How should the company handle this issue before deploying the model? - (c) What does this situation illustrate about the concept of "overall accuracy" as discussed in this chapter?

Question 25. Consider the complete lifecycle of Athena's churn prediction model from initial scoping (Chapter 6) through deployment (this chapter). Map each of the following events to the appropriate stage of the ML project lifecycle: - (a) Ravi's team debates whether to define churn as 90 days or 180 days of inactivity. - (b) The team discovers that Bronze-tier customers are underrepresented in the training data. - (c) The XGBoost model achieves AUC = 0.85 on the held-out test set. - (d) The VP of Operations asks: "What do we do with a list of likely churners?" - (e) Three months after deployment, the model's precision drops by 15 percent.

Answer key for selected questions is available in Appendix B: Answers to Selected Exercises.