Chapter 29 Quiz: Evaluating Models

Q: Precision is defined as: - (A) TP / (TP + FN) - (B) TP / (TP + FP) - (C) TN / (TN + FP) - (D) (TP + TN) / Total

Correct: (B) Precision = TP / (TP + FP). It answers: "Of everything the model flagged as positive, what fraction was actually positive?" High precision means few false alarms. - (A) is recall (sensitivity, true positive rate) - (C) is specificity (true negative rate) - (D) is accuracy

Q: In 5-fold cross-validation, what fraction of the data is used for testing in each fold? - (A) 50% - (B) 30% - (C) 20% - (D) 10%

Correct: (C) In k-fold cross-validation, the data is split into k equal parts. Each fold uses one part (1/k) for testing and the remaining k-1 parts for training. With k=5, each fold uses 1/5 = 20% for testing and 80% for training. Over all 5 folds, every sample is tested exactly once.

Q: True or False: A model can have high accuracy but low recall for the positive class.

True. This is exactly the accuracy paradox. If 99% of samples are negative, a model that always predicts negative has 99% accuracy but 0% recall for the positive class. High accuracy and low recall coexist whenever the majority class dominates the accuracy calculation.

Contributors to Introduction to Data Science

Chapter 29 Quiz: Evaluating Models

Instructions: This quiz tests your understanding of Chapter 29. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. A fraud detection model achieves 99.8% accuracy on a dataset where 0.2% of transactions are fraudulent. What does this accuracy most likely indicate?

(A) The model is excellent at detecting fraud
(B) The model is predicting "not fraud" for almost every transaction
(C) The model has perfect precision for the fraud class
(D) The dataset is too small for meaningful evaluation

Answer

**Correct: (B)** When 99.8% of transactions are non-fraudulent, a model that simply predicts "not fraud" for every transaction achieves 99.8% accuracy. The high accuracy is driven entirely by the majority class and tells us nothing about the model's ability to detect fraud. This is the accuracy paradox with imbalanced classes — accuracy is dominated by whichever class has more examples.

Question 2. In a confusion matrix, a False Negative (FN) represents:

(A) A case that was actually negative but predicted as positive
(B) A case that was actually positive but predicted as negative
(C) A case that was correctly predicted as negative
(D) A case that was correctly predicted as positive

Answer

**Correct: (B)** A False Negative is "false" because the prediction was wrong, and "negative" because the model predicted the negative class. The actual class is positive, but the model missed it. In medical terms, this is a missed diagnosis. In fraud detection, it's fraud that slipped through. - **(A)** describes a False Positive (FP) - **(C)** describes a True Negative (TN) - **(D)** describes a True Positive (TP)

Question 3. Precision is defined as:

(A) TP / (TP + FN)
(B) TP / (TP + FP)
(C) TN / (TN + FP)
(D) (TP + TN) / Total

Answer

**Correct: (B)** Precision = TP / (TP + FP). It answers: "Of everything the model flagged as positive, what fraction was actually positive?" High precision means few false alarms. - **(A)** is recall (sensitivity, true positive rate) - **(C)** is specificity (true negative rate) - **(D)** is accuracy

Question 4. A medical screening test has precision = 0.20 and recall = 0.95. This means:

(A) The test is useless because precision is so low
(B) The test catches 95% of sick patients but generates many false alarms
(C) The test is correct 20% of the time and wrong 80%
(D) The test misses 95% of sick patients

Answer

**Correct: (B)** Recall = 0.95 means the test catches 95% of truly sick patients — excellent for a screening test where missing sick patients is dangerous. Precision = 0.20 means that of all patients flagged as positive, only 20% actually have the disease — lots of false alarms. But in medical screening, this is often acceptable: the false positives get additional testing (inconvenient but not dangerous), while the true positives get potentially life-saving treatment. Low precision with high recall is appropriate for screening tests. - **(A)** ignores that low precision is acceptable in screening contexts - **(C)** confuses precision with accuracy - **(D)** reverses what recall means

Question 5. The F1 score of a model with precision = 0.90 and recall = 0.10 is approximately:

(A) 0.50
(B) 0.18
(C) 0.90
(D) 0.45

Answer

**Correct: (B)** F1 = 2 × (0.90 × 0.10) / (0.90 + 0.10) = 2 × 0.09 / 1.00 = 0.18 The harmonic mean heavily penalizes the extreme imbalance between precision and recall. The arithmetic mean would be (0.90 + 0.10)/2 = 0.50, which would misleadingly suggest a mediocre model. The F1 of 0.18 correctly reveals that the model is nearly useless — it has high precision because it almost never predicts positive, so the rare predictions it does make are usually correct, but it misses 90% of actual positives.

Question 6. What does AUC (Area Under the ROC Curve) measure?

(A) The accuracy of the model at the default threshold
(B) The probability that the model ranks a random positive example higher than a random negative example
(C) The total number of correct predictions divided by total predictions
(D) The harmonic mean of precision and recall

Answer

**Correct: (B)** AUC represents the probability that a randomly chosen positive example receives a higher predicted probability than a randomly chosen negative example. It evaluates the model's *ranking* ability across all possible thresholds, making it threshold-independent. An AUC of 1.0 means the model perfectly separates positives from negatives; 0.5 means it's no better than random.

Question 7. In 5-fold cross-validation, what fraction of the data is used for testing in each fold?

(A) 50%
(B) 30%
(C) 20%
(D) 10%

Answer

**Correct: (C)** In k-fold cross-validation, the data is split into k equal parts. Each fold uses one part (1/k) for testing and the remaining k-1 parts for training. With k=5, each fold uses 1/5 = 20% for testing and 80% for training. Over all 5 folds, every sample is tested exactly once.

Question 8. Why is stratified k-fold cross-validation preferred over regular k-fold for classification problems?

(A) It's faster to compute
(B) It ensures each fold maintains approximately the same class distribution as the full dataset
(C) It uses more data for training
(D) It produces higher accuracy scores

Answer

**Correct: (B)** Stratified k-fold preserves the class proportions in each fold. Without stratification, a fold might randomly contain very few (or zero) examples of the minority class, making evaluation on that fold unreliable. This is especially important with imbalanced classes: if your dataset is 5% positive and you use regular k-fold, one fold might have 0% positive examples, producing meaningless metrics for that fold.

Question 9. A model has R-squared = 0.0 on the test set. This means:

(A) The model is making perfectly wrong predictions
(B) The model explains no more variance than simply predicting the mean for everyone
(C) The model has zero accuracy
(D) The model's predictions and actual values are negatively correlated

Answer

**Correct: (B)** R² = 0 means the model's predictions are no better than always predicting the mean of the target variable. The model has learned nothing useful. R² can be negative (worse than the mean), zero (equal to the mean), or positive (better than the mean, up to 1.0 for perfect predictions). - **(A)** would correspond to a very negative R² - **(C)** R² is a regression metric, not accuracy - **(D)** negative correlation would produce negative R²

Question 10. Which of the following is the most reliable way to compare two models?

(A) Compare accuracy on the training set
(B) Compare accuracy on a single test set
(C) Compare cross-validated F1 scores (or the metric appropriate for your problem)
(D) Compare the number of features each model uses

Answer

**Correct: (C)** Cross-validated scores are more reliable than a single train/test split because they average over multiple data splits, reducing the effect of random variation. Using an appropriate metric (F1 for imbalanced classification, rather than accuracy) ensures you're measuring what actually matters. Training set performance (A) tells you nothing about generalization. A single test set (B) is susceptible to the "luck of the draw." Number of features (D) doesn't directly measure model quality.

Section 2: True/False (4 questions, 4 points each)

Question 11. True or False: A model can have high accuracy but low recall for the positive class.

Answer

**True.** This is exactly the accuracy paradox. If 99% of samples are negative, a model that always predicts negative has 99% accuracy but 0% recall for the positive class. High accuracy and low recall coexist whenever the majority class dominates the accuracy calculation.

Question 12. True or False: If Model A has a higher AUC than Model B, then Model A is guaranteed to have higher accuracy at the default threshold of 0.5.

Answer

**False.** AUC measures the model's ranking ability *across all thresholds*, not its performance at any specific threshold. A model with higher AUC is a better ranker overall, but it might have lower accuracy at one particular threshold (like 0.5) than a model with lower AUC. AUC and accuracy at a specific threshold measure different things.

Question 13. True or False: Increasing the classification threshold from 0.5 to 0.7 will always increase precision.

Answer

**True** (in almost all practical cases). A higher threshold means the model requires stronger evidence before predicting positive. Fewer items are flagged as positive, so the ones that are flagged tend to be more confidently positive, meaning precision increases. Recall decreases because fewer true positives meet the higher threshold. There are degenerate edge cases where this doesn't hold, but for well-behaved probability-producing classifiers, the trade-off is consistent.

Question 14. True or False: RMSE is always greater than or equal to MAE for the same set of predictions.

Answer

**True.** RMSE squares the errors before averaging and then takes the square root, which gives extra weight to large errors. By the mathematical property that the root-mean-square of a set of numbers is always greater than or equal to the arithmetic mean of their absolute values (a consequence of Jensen's inequality), RMSE >= MAE. They are equal only when all errors are the same magnitude.

Section 3: Short Answer (3 questions, 6 points each)

Question 15. Explain the precision-recall trade-off using a concrete example. Describe a scenario where you would deliberately sacrifice precision to gain recall, and explain why.

Answer

Consider screening blood donations for hepatitis. Precision = "of the donations we flag as contaminated, how many actually are?" Recall = "of all contaminated donations, how many do we catch?" You would sacrifice precision to gain recall here — flag more donations as suspicious, even if many turn out to be safe, because the cost of a false negative (contaminated blood reaching a patient) is catastrophic, while the cost of a false positive (discarding a safe donation) is just a wasted unit of blood. Concretely: lowering the classification threshold from 0.5 to 0.2 might increase recall from 0.90 to 0.99 (catching almost all contaminated blood) while decreasing precision from 0.80 to 0.50 (half the flagged donations are actually safe). That's a worthwhile trade-off because the consequences of missing contamination far outweigh the cost of extra testing.

Question 16. What is the difference between using a single train/test split and k-fold cross-validation for model evaluation? When is a single split acceptable, and when is cross-validation essential?

Answer

A single train/test split gives one estimate of model performance that depends on which specific samples landed in each set — a different random seed produces a different score. K-fold cross-validation averages over k different splits, giving a more stable and reliable estimate. The standard deviation across folds also tells you how much performance varies. A single split is acceptable when: you have a very large dataset (tens of thousands+ samples) where any reasonable split gives similar results, or when computational cost makes cross-validation impractical. Cross-validation is essential when: the dataset is small or medium (hundreds to low thousands), when you're comparing models that might have similar performance, or when you need to report reliable performance estimates (e.g., in a paper or stakeholder presentation).

Question 17. A colleague shows you two models: Model A has accuracy = 0.88 and F1 = 0.45. Model B has accuracy = 0.82 and F1 = 0.72. Which model would you recommend, and what does the discrepancy between accuracy and F1 tell you about the dataset?

Answer

Model B is almost certainly better, despite lower accuracy. The discrepancy between accuracy and F1 tells you the dataset has imbalanced classes. Model A's high accuracy (0.88) likely comes from correctly predicting the majority class, while its low F1 (0.45) reveals poor performance on the minority class. Model B sacrifices some overall accuracy to be much better at handling both classes (F1 = 0.72). The fact that accuracy and F1 disagree is itself diagnostic: it signals class imbalance. In balanced datasets, accuracy and F1 tend to track each other closely. When they diverge, it means the model is doing well on one class and poorly on another — and accuracy hides this by weighting all correct predictions equally.

Section 4: Applied Scenarios (2 questions, 5 points each)

Question 18. You've built a model to predict whether students will drop out of an online course. The confusion matrix is:

              Predicted: Stay    Predicted: Drop
Actual: Stay      420               80
Actual: Drop       30               70

Calculate accuracy, precision (for the "Drop" class), recall (for the "Drop" class), and F1 (for the "Drop" class). Then answer: if the university can only afford to send intervention emails to 150 students per semester, is this model's performance acceptable? What would you recommend?

Answer

- TP = 70 (correctly predicted drop), FP = 80 (predicted drop, actually stayed), FN = 30 (predicted stay, actually dropped), TN = 420 (correctly predicted stay) - Accuracy = (420 + 70) / 600 = 490/600 = 0.817 - Precision (Drop) = 70 / (70 + 80) = 70/150 = 0.467 - Recall (Drop) = 70 / (70 + 30) = 70/100 = 0.700 - F1 (Drop) = 2 × (0.467 × 0.700) / (0.467 + 0.700) = 0.654/1.167 = 0.560 The model flags 150 students as potential dropouts (70 TP + 80 FP), which exactly matches the university's intervention capacity. Precision of 0.467 means that of the 150 emails sent, about 70 go to students who actually need help and 80 go to students who would have stayed anyway. Recall of 0.700 means 30 students who will drop out don't receive intervention. Recommendation: The model is a reasonable starting point — it catches 70% of dropouts and happens to produce exactly the number of interventions the university can handle. However, raising the threshold could improve precision (fewer wasted emails) at the cost of recall. The university should decide: is it better to reach fewer students with higher confidence, or more students with more wasted effort?

Question 19. You're comparing three models for a customer churn prediction problem:

Model	Accuracy	Precision	Recall	F1	AUC
Logistic Regression	0.85	0.62	0.58	0.60	0.82
Decision Tree	0.83	0.55	0.71	0.62	0.79
Random Forest	0.87	0.68	0.64	0.66	0.86

Which model would you recommend and why? Consider that the company wants to proactively contact customers who might churn to offer them retention deals (each contact costs $50, and each retained customer is worth $500).

Answer

The Random Forest is the best choice overall: it has the highest accuracy, precision, F1, and AUC. Its recall (0.64) is moderate — it catches 64% of churners. However, the Decision Tree deserves consideration because it has the highest recall (0.71). Given the economics — retaining a customer is worth $500 while each contact costs $50 — the value of catching a churner ($500) is 10x the cost of a false alarm ($50). This asymmetry favors recall over precision. The Decision Tree catches more churners (recall = 0.71 vs. 0.64 for Random Forest), and the extra false positives are cheap ($50 each). A sophisticated recommendation: deploy the Random Forest but lower its classification threshold to increase recall toward 0.71+, even at the cost of precision. Or better yet, use the Random Forest's predicted probabilities to rank all customers by churn risk and contact the top N, where N is determined by the budget.

Section 5: Code Analysis (1 question, 6 points)

Question 20. This code attempts to do cross-validation but contains a subtle methodological error. Identify the problem and explain why it produces overly optimistic results.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Scale the ENTIRE dataset first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Then cross-validate
model = LogisticRegression()
scores = cross_val_score(model, X_scaled, y, cv=5)
print(f"Mean accuracy: {scores.mean():.3f}")

Answer

The problem is **data leakage**. The `StandardScaler` is fit on the entire dataset (`X`) before cross-validation begins. This means the scaler learns the mean and standard deviation from *all* samples, including those that will later be used as test samples in each fold. In each cross-validation fold, the test samples' scaling is influenced by knowledge of the test data itself (their values contributed to the overall mean and standard deviation). This gives the model a subtle advantage it wouldn't have in production, where it would only know the training data's statistics. The correct approach is to scale inside each fold, so the scaler only sees training data:

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
scores = cross_val_score(pipe, X, y, cv=5)

Using a `Pipeline` ensures that scaling is fit only on training data within each fold and then applied to the test data — preventing leakage. This is exactly what [Chapter 30](../chapter-30-ml-workflow/index.md) is about.