Quiz: Chapter 1

DataField.Dev

Quiz: Chapter 1

From Analysis to Prediction

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

A marketing analyst computes the average click-through rate for email campaigns sent in Q3 2024, broken down by subject line length. This is an example of:

A) Predictive modeling
B) Inferential statistics
C) Descriptive analytics
D) Unsupervised learning

Answer: C) Descriptive analytics. The analyst is summarizing what happened in historical data with no attempt to explain causality or predict future outcomes.

Question 2 (Multiple Choice)

Which of the following best describes the bias-variance tradeoff?

A) More training data always reduces both bias and variance
B) Increasing model complexity tends to decrease bias but increase variance
C) Bias and variance are independent --- reducing one has no effect on the other
D) High bias means the model memorizes the training data

Answer: B) Increasing model complexity tends to decrease bias but increase variance. A more flexible model can capture more complex patterns (lower bias) but becomes more sensitive to the specific training data (higher variance).

Question 3 (Short Answer)

Explain the difference between a training set and a test set. Why is it important to evaluate a model's performance on the test set rather than the training set?

Answer: The training set is the portion of data used to fit the model's parameters --- the model learns from it. The test set is a portion of data withheld from training, used solely to evaluate how well the model generalizes to unseen data. Evaluating on the training set measures how well the model memorizes, not how well it predicts. A model that overfits will perform well on training data but poorly on test data, and only test set evaluation reveals this gap.

Question 4 (Multiple Choice)

A model has a training accuracy of 99.2% and a test accuracy of 63.8%. This model is most likely:

A) Underfitting
B) Overfitting
C) Well-calibrated
D) Biased toward the majority class

Answer: B) Overfitting. The large gap between training accuracy (99.2%) and test accuracy (63.8%) indicates the model has memorized the training data but fails to generalize to new data --- the hallmark of overfitting (high variance).

Question 5 (Multiple Choice)

You are building a model to predict whether a customer will purchase a specific product. Which feature would constitute target leakage?

A) The customer's purchase history from the prior 12 months
B) The customer's age and geographic region
C) Whether the customer added the product to their cart (in the session where the purchase decision is being predicted)
D) The customer's account creation date

Answer: C) Whether the customer added the product to their cart. Adding a product to the cart is part of the purchase process itself --- it occurs as part of the outcome being predicted. Using it as a feature means the model is effectively using the target to predict the target. Options A, B, and D are all legitimately available before the prediction point.

Question 6 (Short Answer)

Define supervised learning and unsupervised learning. Give one example of each from the anchor examples introduced in this chapter (StreamFlow, Metro General, or TurbineTech).

Answer: In supervised learning, the model learns from labeled data where the correct answer (target variable) is provided for each training example. Example: predicting whether a StreamFlow subscriber will churn within 30 days (binary classification with a known label). In unsupervised learning, the model discovers structure in data without labels. Example: clustering TurbineTech's wind turbines by their operating characteristics to identify groups with similar behavior patterns, without pre-defining what the groups should be.

Question 7 (Multiple Choice)

Which statement best captures the "ML mindset" described in this chapter?

A) A model is useful if it achieves a high R-squared on the training data
B) A model is useful if it correctly specifies the data-generating process
C) A model is useful if it makes predictions accurate enough to drive better decisions than the current alternative
D) A model is useful if all its coefficients are statistically significant at the 0.05 level

Answer: C) A model is useful if it makes predictions accurate enough to drive better decisions than the current alternative. This captures the pragmatic, engineering-oriented worldview of ML: models do not need to be "true" or perfect --- they need to be better than the status quo and create measurable value.

Question 8 (Multiple Choice)

StreamFlow has a monthly churn rate of 8.2%. A model that predicts "no churn" for every subscriber achieves what accuracy?

A) 8.2%
B) 50.0%
C) 82.0%
D) 91.8%

Answer: D) 91.8%. If 8.2% of subscribers churn, then 91.8% do not. Predicting "no churn" for everyone is correct 91.8% of the time. This is the "stupid baseline" that any useful model must beat --- and it illustrates why accuracy alone is misleading for imbalanced classification problems.

Question 9 (Short Answer)

A colleague says: "My model has an R-squared of 0.92, so it's a great predictor." Explain why this statement might be misleading. What additional information would you need to evaluate the model's predictive performance?

Answer: An R-squared of 0.92 computed on training data measures how well the model explains the variance in the data it was fit to --- not how well it predicts new data. A highly overfit model can achieve near-perfect training R-squared while performing terribly on unseen data. To evaluate predictive performance, you need the R-squared (or another metric like MAE or RMSE) computed on a held-out test set that the model was never trained on. The gap between training and test R-squared reveals the degree of overfitting.

Question 10 (Multiple Choice)

Which of the following is NOT a valid framing of TurbineTech's predictive maintenance problem?

A) Binary classification: Will this turbine fail within 72 hours?
B) Regression: How many hours until this turbine fails?
C) Anomaly detection: Is this turbine's sensor pattern abnormal?
D) Clustering: Which turbines are most similar to each other?
E) All of the above are valid framings

Answer: E) All of the above are valid framings. Binary classification predicts imminent failure. Regression estimates remaining useful life. Anomaly detection flags unusual behavior without requiring labeled failure data. Clustering groups turbines by operating characteristics, which can inform maintenance scheduling. The "right" framing depends on the business context and available data, not on a single correct answer.

Question 11 (Short Answer)

Explain the difference between prediction and inference using the Metro General Hospital example. Give a specific question that represents each.

Answer: Prediction asks "what will happen?" --- for Metro General, this is "Will this patient be readmitted within 30 days of discharge?" The goal is an accurate forecast for each individual patient. Inference asks "why does it happen?" --- for Metro General, this is "Does scheduling a follow-up appointment within 7 days of discharge reduce the probability of readmission?" The goal is to understand causal mechanisms. Prediction optimizes for accuracy on new data; inference optimizes for understanding the data-generating process. A model can be excellent at one and poor at the other.

Question 12 (Multiple Choice)

The six-question "framing checklist" introduced in this chapter includes all of the following EXCEPT:

A) What is the prediction target?
B) What algorithm should we use?
C) What features are available at prediction time?
D) How will the prediction be used?
E) What is the observation unit?

Answer: B) What algorithm should we use? The framing checklist focuses on the problem definition: business question, prediction target, observation unit, prediction horizon, available features, and how the prediction will be used. Algorithm selection comes later, after the problem is properly framed. Jumping to "which algorithm?" before framing is one of the most common mistakes in data science.

Question 13 (Multiple Choice)

A model with high bias and low variance will tend to:

A) Perform well on training data but poorly on test data
B) Perform poorly on both training and test data
C) Perform well on both training and test data
D) Perform differently each time it is retrained on different samples

Answer: B) Perform poorly on both training and test data. High bias means the model is too simple to capture the real patterns --- it underfits. Both training and test error are high. The model's predictions are consistently wrong (low variance) but consistently wrong in the same way (high bias). This is distinct from overfitting (high variance), where training error is low but test error is high.

Question 14 (Short Answer)

What is generalization in the context of machine learning? Why is it the central goal of predictive modeling?

Answer: Generalization is a model's ability to perform well on data it was not trained on --- data drawn from the same underlying distribution but never seen during training. It is the central goal because the entire purpose of predictive modeling is to make accurate predictions about the future (or about new observations), not to describe the past. A model that only works on data it has already seen is just a lookup table. Every technique in ML --- train/test splits, cross-validation, regularization --- exists to improve or measure generalization.

Question 15 (Multiple Choice --- Select Two)

Which TWO of the following scenarios describe a problem best solved with supervised learning?

A) Grouping customers into segments based on purchasing behavior, with no predefined segment labels
B) Predicting the sale price of a house based on its features
C) Detecting unusual network traffic patterns without labeled examples of attacks
D) Classifying emails as spam or not-spam using a labeled training dataset
E) Reducing a 500-feature dataset to its 10 most important principal components

Answer: B) and D). Both have a defined target variable (sale price, spam label) and labeled training data. Option A is clustering (unsupervised), Option C is anomaly detection (typically unsupervised), and Option E is dimensionality reduction (unsupervised). Supervised learning requires labeled examples where the correct answer is known.

This quiz supports Chapter 1: From Analysis to Prediction. Return to the chapter to review concepts.