Quiz: Chapter 17

Class Imbalance and Cost-Sensitive Learning


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

A churn prediction dataset has 50,000 subscribers, of which 4,100 churned (8.2%). A model that predicts "no churn" for every subscriber achieves:

  • A) 0% accuracy
  • B) 8.2% accuracy
  • C) 50% accuracy
  • D) 91.8% accuracy

Answer: D) 91.8% accuracy. If the model predicts the majority class (no churn) for every example, it correctly classifies all 45,900 non-churners and misclassifies all 4,100 churners. Accuracy = 45,900 / 50,000 = 91.8%. This demonstrates why accuracy is meaningless for imbalanced problems --- a model with zero predictive value still appears to perform well.


Question 2 (Multiple Choice)

SMOTE generates synthetic minority-class examples by:

  • A) Duplicating random minority examples
  • B) Adding Gaussian noise to minority examples
  • C) Interpolating between a minority example and one of its k nearest minority-class neighbors
  • D) Randomly generating points within the bounding box of the minority class

Answer: C) Interpolating between a minority example and one of its k nearest minority-class neighbors. SMOTE picks a minority example, finds its k nearest neighbors among other minority examples (default k=5), randomly selects one neighbor, and creates a new point at a random position on the line segment between the original and the neighbor. This produces unique synthetic examples that lie in the same region of feature space as the real minority class.


Question 3 (Short Answer)

A colleague applies SMOTE to the full training set before running 5-fold cross-validation. Why does this produce an inflated performance estimate?

Answer: When SMOTE is applied before splitting into folds, synthetic examples generated from real examples in the training portion may be nearly identical to real examples that land in the validation fold. This is because SMOTE interpolates between neighbors --- if point A is in the training fold and point B is in the validation fold, a synthetic point created from A and B will be very similar to B. The model effectively "sees" validation data during training, which is a form of data leakage that inflates performance estimates.


Question 4 (Multiple Choice)

For a manufacturing failure prediction problem where a missed failure costs $500,000 and a false alarm costs $5,000, the break-even precision is:

  • A) 0.01 (1%)
  • B) 0.10 (10%)
  • C) 0.50 (50%)
  • D) 0.99 (99%)

Answer: A) 0.01 (1%). Break-even precision = FP_cost / (FP_cost + FN_cost) = $5,000 / ($5,000 + $500,000) = 0.0099, approximately 1%. Any model with precision above 1% saves money on average, because the cost of one missed failure ($500K) outweighs the cost of 99 false alarms (99 x $5K = $495K).


Question 5 (Multiple Choice)

Which technique does NOT require retraining the model?

  • A) SMOTE
  • B) class_weight='balanced'
  • C) Threshold tuning on the precision-recall curve
  • D) Random undersampling

Answer: C) Threshold tuning on the precision-recall curve. Threshold tuning uses the model's existing predicted probabilities and simply changes the decision boundary. All other options change the training process: SMOTE and undersampling change the training data, and class_weight changes the loss function.


Question 6 (Short Answer)

A model achieves AUC-PR of 0.45 on a dataset with 8% positive rate. Is this good or bad? How do you know?

Answer: An AUC-PR of 0.45 is moderately good for this imbalance level. The baseline AUC-PR for a random classifier equals the positive rate, which is 0.08. An AUC-PR of 0.45 is about 5.6 times the random baseline, indicating the model has learned meaningful patterns. However, it also means there is substantial room for improvement --- a perfect classifier would have AUC-PR of 1.0. For comparison, AUC-ROC on the same data might report 0.85+, which would look much more impressive but would be less informative about the minority class.


Question 7 (Multiple Choice)

StreamFlow's churn scenario has FN=$180 and FP=$5. The profit-optimal classification threshold is likely:

  • A) Close to 0.50 (the default)
  • B) Higher than 0.50
  • C) Much lower than 0.50
  • D) Exactly equal to the positive rate (0.082)

Answer: C) Much lower than 0.50. When false negatives are much more expensive than false positives (36:1 cost ratio), the optimal threshold shifts far below 0.50 to prioritize recall over precision. The model should flag any subscriber with even a modest probability of churning, because the cost of missing a churner ($180) far exceeds the cost of a wasted retention offer ($5). In practice, the optimal threshold is often in the 0.03-0.05 range for this cost structure.


Question 8 (Short Answer)

Explain why class_weight='balanced' in scikit-learn is equivalent to a specific form of cost-sensitive learning. What cost ratio does it implicitly assume?

Answer: class_weight='balanced' sets the weight for each class inversely proportional to its frequency: weight = n_samples / (n_classes * n_samples_in_class). For a dataset with 8.2% positive rate, this gives positive examples about 11x the weight of negative examples. This implicitly assumes that the cost ratio (FN:FP) equals the imbalance ratio (11:1). If the true business cost ratio differs --- for example, 36:1 for StreamFlow --- then class_weight='balanced' is either too aggressive or not aggressive enough.


Question 9 (Multiple Choice)

For which model type does SMOTE provide the least benefit compared to class_weight?

  • A) Logistic Regression
  • B) k-Nearest Neighbors
  • C) Support Vector Machine with RBF kernel
  • D) Gradient Boosted Trees

Answer: D) Gradient Boosted Trees. Decision trees split on feature thresholds, and the optimal split location is determined by impurity reduction. SMOTE creates points along line segments between neighbors, which is geometrically meaningful for distance-based models (k-NN, SVM) and linear models (logistic regression), but tree splits only care about axis-aligned boundaries. Class weights achieve a similar effect for trees by adjusting impurity gain calculations, without the overhead and potential noise of synthetic data generation.


Question 10 (Short Answer)

A colleague says "I used random undersampling and my recall went from 0.40 to 0.85, so it clearly works." What important information is missing from this claim, and what metric would you check?

Answer: The claim is missing precision. Random undersampling typically increases recall at the expense of precision, because the model sees a more balanced training set and becomes more liberal in predicting the positive class. The colleague should report precision, F1, and AUC-PR alongside recall. If precision dropped from 0.60 to 0.08, the model may be producing an unacceptable number of false positives. The most informative check is AUC-PR, which measures ranking quality independent of any threshold and reveals whether the model is actually better at distinguishing the classes or just predicting "positive" more often.


Question 11 (Multiple Choice)

When tuning the classification threshold, you should optimize on:

  • A) The training set
  • B) The test set
  • C) A held-out validation set
  • D) The full dataset

Answer: C) A held-out validation set. Optimizing the threshold on the training set leads to overfitting (the threshold is tuned to noise in the training data). Optimizing on the test set violates the purpose of the test set as an unbiased estimate of generalization performance. A held-out validation set, or the out-of-fold predictions from cross-validation, provides an honest estimate of how the threshold will perform on unseen data.


Question 12 (Short Answer)

A dataset has an imbalance ratio of 232:1 (equipment failure prediction). Would you recommend SMOTE or threshold tuning as the primary strategy? Justify in 3-4 sentences.

Answer: Threshold tuning should be the primary strategy. At 232:1 imbalance, SMOTE would need to generate roughly 230 synthetic examples for every real failure, creating a training set dominated by interpolated data that may not represent actual failure modes. The resulting synthetic examples risk introducing noise, especially if the minority class is not well-clustered in feature space. Threshold tuning, by contrast, works with the model's existing probability estimates and directly optimizes for the extreme cost asymmetry that typically accompanies such severe imbalance. Cost-weighted training (via sample_weight) is a reasonable complement to threshold tuning.


Question 13 (Multiple Choice)

The break-even precision formula is:

  • A) FN_cost / FP_cost
  • B) FP_cost / FN_cost
  • C) FP_cost / (FP_cost + FN_cost)
  • D) FN_cost / (FP_cost + FN_cost)

Answer: C) FP_cost / (FP_cost + FN_cost). Break-even precision is the minimum precision at which the model's predictions add value. At this precision, the expected cost of acting on a positive prediction equals the expected benefit. For StreamFlow: $5 / ($5 + $180) = 0.027. Any precision above 2.7% means the retention offers are profitable on average.


Question 14 (Short Answer)

Hospital readmission rates are 15% for Group A and 28% for Group C. If you use a single classification threshold of 0.20 for both groups, which group will have higher recall? Explain why.

Answer: Group C will have higher recall. If the model is well-calibrated, patients in Group C (28% base rate) will tend to have higher predicted probabilities than patients in Group A (15% base rate), because they genuinely have higher readmission risk. A threshold of 0.20 will therefore capture a larger fraction of Group C's actual positives. This means the model catches more readmissions in the group with higher base rate while underserving the group with lower base rate --- which may or may not be the desired behavior depending on the intervention goals.


Question 15 (Multiple Choice)

You are comparing four imbalance-handling strategies. Which result ordering (by expected profit) is most likely for a problem with 8% positive rate and 36:1 FN:FP cost ratio?

  • A) SMOTE > threshold tuning > class_weight > default
  • B) Threshold tuning > custom cost weights > SMOTE > default
  • C) Default > class_weight > SMOTE > threshold tuning
  • D) Class_weight > SMOTE > threshold tuning > default

Answer: B) Threshold tuning > custom cost weights > SMOTE > default. When the cost ratio (36:1) significantly exceeds the imbalance ratio (11:1), threshold tuning directly optimizes for the cost structure and typically produces the highest expected profit. Custom cost weights encode the cost ratio into the loss function, making them the second-best option. SMOTE uses the imbalance ratio implicitly and does not account for the cost asymmetry. The default model at threshold 0.50 treats both error types equally, which is worst for this cost structure.


This quiz covers Chapter 17: Class Imbalance and Cost-Sensitive Learning. Return to the chapter for review.