Chapter 2 Quiz
The Machine Learning Workflow
Test your understanding of the ML lifecycle, problem framing, data leakage, baselines, and evaluation. Answers are in Appendix B.
Question 1
Which of the following best describes why problem framing is considered the most important step in the ML workflow?
(a) It determines which algorithm to use, and algorithm choice is the primary driver of model performance.
(b) It defines what the model predicts, how success is measured, and what action is taken — errors at this stage invalidate everything downstream.
(c) It is the most time-consuming step, so getting it right saves the most calendar time.
(d) It is the step where most data leakage is introduced, so getting it right prevents leakage.
Question 2
A model predicts customer churn with 94% accuracy on a dataset where 6% of customers churn. What can you conclude?
(a) The model is performing well — 94% accuracy is high.
(b) You cannot conclude anything from accuracy alone; a model that predicts "no churn" for everyone would achieve 94% accuracy.
(c) The model is definitely overfitting because the accuracy is close to 100%.
(d) The model needs more training data to improve beyond 94%.
Question 3
What is a "stupid baseline" and why is it useful?
(a) A model that intentionally predicts the wrong class, used to measure the worst-case scenario.
(b) A trivially simple model (e.g., predict majority class) that sets the minimum performance floor any useful model must beat.
(c) A model trained on a random subset of features, used to estimate feature importance.
(d) A model trained on the test set, used to establish the upper bound on performance.
Question 4
You are building a churn prediction model. Which of the following features would constitute target leakage?
(a) tenure_months — how long the customer has been subscribed
(b) cancellation_reason — the reason the customer gave for canceling
(c) hours_watched_last_30d — hours of content watched in the past 30 days
(d) plan_type — the customer's current subscription plan
Question 5
What is the difference between offline evaluation and online evaluation?
(a) Offline evaluation uses the training set; online evaluation uses the test set.
(b) Offline evaluation measures model performance on held-out historical data; online evaluation measures business impact in production through experiments like A/B tests.
(c) Offline evaluation is done without a computer; online evaluation uses real-time compute.
(d) Offline evaluation uses cross-validation; online evaluation uses a single train/test split.
Question 6
A data scientist fits a StandardScaler on the entire dataset (train + test combined), then performs a train/test split. What type of leakage is this?
(a) Target leakage
(b) Temporal leakage
(c) Train/test contamination
(d) This is not leakage — scaling is a preprocessing step, not a modeling step.
Question 7
Why is a temporal split often more appropriate than a random split for evaluating ML models?
(a) Temporal splits always produce larger training sets.
(b) Temporal splits prevent the model from being tested on data from time periods it has already seen during training, which mirrors how the model will be used in production.
(c) Random splits cause data leakage, while temporal splits never do.
(d) Temporal splits are required by all ML frameworks; random splits are deprecated.
Question 8
In the three-way train/validation/test split, what is the purpose of the validation set?
(a) It provides a final, unbiased estimate of model performance.
(b) It is used for hyperparameter tuning, feature selection, and model comparison — decisions that consume it as an unbiased estimator.
(c) It is a backup copy of the training data in case the original is corrupted.
(d) It is used to calibrate the model's predicted probabilities.
Question 9
A model achieves AUC-ROC of 0.98 on its first evaluation. According to the chapter, what should your first reaction be?
(a) Celebrate — this is excellent performance.
(b) Report the results to stakeholders immediately.
(c) Be suspicious and investigate for data leakage, because real-world prediction problems rarely achieve this level without leakage.
(d) Add more features to push performance to 0.99.
Question 10
Which of the following is NOT one of the five problem framing questions described in Section 2.2?
(a) What are we predicting?
(b) What algorithm should we use?
(c) What information is available at prediction time?
(d) What action will be taken based on the prediction?
Question 11
A model is deployed and initially performs well. After three months, performance degrades significantly. Which of the following is the most likely explanation?
(a) The model's code has a memory leak that causes predictions to degrade over time.
(b) The relationship between features and the target has changed (concept drift), or the input feature distributions have shifted (data drift).
(c) The test set was too small, and the initial evaluation was unreliable.
(d) The model was overfit from the beginning, but it took three months for the overfitting to manifest.
Question 12
Sculley et al. (2015) described the "hidden technical debt" in ML systems. According to the chapter, approximately what fraction of a production ML system is the actual model code?
(a) About 50%
(b) About 25%
(c) About 5%
(d) About 80%
Question 13 (Short Answer)
Explain why the stratify parameter in train_test_split is important for the StreamFlow churn problem specifically. What could go wrong without it?
Question 14 (Short Answer)
A team builds a churn model with three features: hours_watched_last_30d, days_since_last_login, and hours_watched_next_30d. The model achieves AUC-ROC of 0.94. Identify the problematic feature, explain why it is problematic, and predict what will happen when this model is deployed to production.
Question 15 (Short Answer)
You deploy a model using batch scoring (monthly predictions). Your stakeholder says: "The model predicted a 15% churn probability for subscriber #12345 on March 1st. On March 3rd, they called to cancel. The model was wrong — it should have predicted a higher probability."
Is the stakeholder's criticism valid? Explain why or why not, and describe how you would communicate model performance to this stakeholder in a way that correctly sets expectations.
Question 16
A team trains five different models on the same training data, evaluates each on the validation set, selects the best one, and then evaluates the selected model on the test set. The test AUC is 0.82, while the validation AUC was 0.85. Which of the following best explains the gap?
(a) The test set is too small to produce reliable estimates.
(b) Model selection using the validation set introduces optimism bias; the test set provides a less biased estimate of true generalization performance.
(c) The model is overfitting to the training data, and the test set reveals this.
(d) The validation and test sets have different data distributions due to random chance.
Question 17 (Short Answer)
Explain the difference between a "stupid baseline" (like majority class prediction) and a "business heuristic baseline" (like the retention team's existing rule). Why is it important to compute both? In what scenario could a model beat the stupid baseline but not the business heuristic, and what would you recommend in that case?
Question 18
Which of the following is the strongest signal that your model may have data leakage?
(a) The model has high variance across cross-validation folds.
(b) The model achieves AUC-ROC of 0.97 on the test set, far exceeding published benchmarks for the same problem.
(c) The model's performance improves when you add more training data.
(d) The model's most important feature has a moderate correlation with the target variable.
Question 19
A canary deployment involves:
(a) Training the model on the most recent data and evaluating on older data.
(b) Deploying the model to a small percentage of users first, monitoring metrics, and gradually ramping to full deployment.
(c) Running the model in a sandbox environment that simulates production traffic.
(d) Deploying two competing models simultaneously and letting users choose which predictions they prefer.
Question 20 (Short Answer)
StreamFlow's engineering team changes how they log watch events. Previously, one event was logged per continuous viewing session. Now, one event is logged per episode watched. Explain how this change could affect a deployed churn model that uses sessions_last_30d as a feature. What monitoring signal would detect this problem? What would you do about it?