Exercises: Chapter 13

DataField.Dev

Exercises: Chapter 13

Tree-Based Methods

Exercise 1: Gini Impurity by Hand (Conceptual)

A node in a decision tree contains 100 samples: 70 belong to class A and 30 belong to class B.

a) Calculate the Gini impurity of this node. Show your work.

b) The tree considers splitting on a feature temperature. The split produces: - Left child: 50 samples (45 class A, 5 class B) - Right child: 50 samples (25 class A, 25 class B)

Calculate the Gini impurity of each child node and the weighted average impurity after the split.

c) Calculate the information gain of this split.

d) A different split on the feature pressure produces: - Left child: 60 samples (55 class A, 5 class B) - Right child: 40 samples (15 class A, 25 class B)

Calculate the information gain of this split. Which split is better --- temperature or pressure?

e) Repeat parts (b) through (d) using entropy instead of Gini impurity. Does the ranking of splits change?

Exercise 2: Overfitting Demonstration (Code)

Using the breast cancer dataset from scikit-learn, demonstrate overfitting in decision trees.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

a) Train an unrestricted DecisionTreeClassifier(random_state=42). Report training accuracy, test accuracy, tree depth, and number of leaves.

b) Train trees with max_depth values of 1, 2, 3, 4, 5, 7, 10, 15, and 20. For each, report training accuracy and 5-fold cross-validation accuracy.

c) At which depth does the cross-validation accuracy peak? What is the gap between training and CV accuracy at that depth versus the unrestricted tree?

d) Use cost_complexity_pruning_path on the unrestricted tree and find the optimal ccp_alpha via 5-fold cross-validation. How many leaves does the optimal pruned tree have?

e) Compare the test accuracy of: (i) the unrestricted tree, (ii) the tree with optimal max_depth, and (iii) the cost-complexity-pruned tree. Which performs best?

Exercise 3: Random Forest Hyperparameters (Code)

Using the breast cancer dataset from Exercise 2:

a) Train a RandomForestClassifier with n_estimators=200, random_state=42, oob_score=True. Report OOB accuracy and test accuracy.

b) Vary n_estimators over [10, 25, 50, 100, 200, 500]. Plot OOB accuracy versus number of trees. At what point does performance plateau?

c) Vary max_features over [2, 4, 'sqrt', 'log2', 10, 20, None]. Which value produces the best OOB accuracy? How does it compare to sqrt(30) (the default)?

d) Vary min_samples_leaf over [1, 2, 5, 10, 25, 50]. What is the relationship between min_samples_leaf and the average number of leaves per tree?

e) Using the best combination from (b), (c), and (d), report the final test accuracy. How much did tuning improve over the defaults?

Exercise 4: Feature Importance Comparison (Code)

Using the same breast cancer data and your tuned Random Forest from Exercise 3:

a) Compute impurity-based (MDI) feature importance. List the top 5 features.

b) Compute permutation importance on the test set with n_repeats=20. List the top 5 features.

c) Do the top 5 lists agree? For any features that appear in one list but not the other, explain why the ranking might differ.

d) The breast cancer dataset has 30 features, many of which are correlated (e.g., mean radius, mean perimeter, and mean area all measure size). How does correlation between features affect each importance method differently?

e) Train a logistic regression on the scaled breast cancer data. Use the absolute values of the coefficients as a measure of feature importance. Compare the top 5 features from logistic regression to the top 5 from permutation importance. Do they agree?

Exercise 5: Scaling Invariance Proof (Code)

Prove that decision trees are invariant to monotonic feature transformations.

a) Train a DecisionTreeClassifier(max_depth=5, random_state=42) on the breast cancer data. Record all predictions on the test set.

b) Apply StandardScaler to the data and train the same tree. Are the test predictions identical?

c) Apply np.log1p() to all features (since they are non-negative) and train the same tree. Are the test predictions identical?

d) Apply np.square() to all features and train the same tree. Are the test predictions identical?

e) Now repeat parts (a) through (d) with LogisticRegression(random_state=42, max_iter=10000). Are the predictions identical across transformations? Explain why trees are invariant to these transformations but logistic regression is not.

Exercise 6: Out-of-Bag Error Analysis (Code)

a) Train a RandomForestClassifier(n_estimators=500, oob_score=True, random_state=42) on the breast cancer training set. Report the OOB accuracy.

b) Run 5-fold cross-validation on the same training set with the same hyperparameters. Report the mean CV accuracy.

c) Compare the OOB accuracy to the CV accuracy. Are they close? Which one is typically the more pessimistic estimate?

d) Access the oob_decision_function_ attribute. This gives the OOB predicted probabilities for each training sample. Calculate the OOB AUC-ROC.

e) For each training sample, determine how many trees included it in their bootstrap sample and how many left it out. What is the average number of trees that exclude each sample? (Hint: the theoretical value is n_estimators * 0.368.)

Exercise 7: Random Forest vs. Single Tree Stability (Code)

This exercise demonstrates why ensembles are more stable than single trees.

a) Create 20 bootstrap samples from the breast cancer training data (use resample with random_state=0 through random_state=19). For each bootstrap sample, train a DecisionTreeClassifier(max_depth=5, random_state=42) and record its test accuracy.

b) Calculate the mean and standard deviation of the 20 test accuracies.

c) Repeat part (a) but train a RandomForestClassifier(n_estimators=200, max_depth=5, random_state=42) on each bootstrap sample.

d) Calculate the mean and standard deviation of the 20 Random Forest test accuracies.

e) Compare the standard deviations from (b) and (d). Which model has lower variance? By what factor? Explain why the ensemble's variance is lower despite being composed of the same base model.

Exercise 8: Threshold Tuning for Imbalanced Churn (Code + Analysis)

The StreamFlow churn dataset has ~8.4% churn rate. The Random Forest's default threshold (0.5) may not be optimal.

a) Using the StreamFlow Random Forest from the chapter, compute predicted probabilities on the test set. Plot a histogram of predicted probabilities, coloring by true label. What do you notice about the distribution for churned vs. retained customers?

b) Vary the classification threshold from 0.1 to 0.9 in steps of 0.05. For each threshold, compute precision, recall, and F1. Plot all three on the same chart.

c) At what threshold is F1 maximized? How does it compare to the F1 at the default 0.5 threshold?

d) Suppose the business values recall more than precision (it is worse to miss a churning customer than to falsely flag a retained one). What threshold would you recommend? Justify your choice with a cost argument.

e) Compute the AUC-ROC. Does the AUC change when you change the threshold? Why or why not?

Exercise 9: Building Intuition with a Toy Dataset (Conceptual + Code)

Create a 2D dataset where a decision tree outperforms logistic regression.

import numpy as np
np.random.seed(42)
n = 500
X = np.random.randn(n, 2)
# XOR-like pattern: class 1 if both features have the same sign
y = ((X[:, 0] > 0) == (X[:, 1] > 0)).astype(int)

a) Train a LogisticRegression and a DecisionTreeClassifier(max_depth=4). Report test accuracies.

b) Why does logistic regression fail on this dataset? (Hint: draw the decision boundary.)

c) Train a RandomForestClassifier(n_estimators=200, max_depth=4) on the same data. Is it better than the single tree?

d) Add 10 noise features (np.random.randn(n, 10)) to the dataset. Retrain all three models. How does each model handle the irrelevant features?

e) Explain, in your own words, why tree-based models naturally capture interaction effects like XOR without explicit feature engineering.

Exercise 10: Practical Pipeline (Code + Analysis)

Build a complete Random Forest pipeline from scratch using the wine quality dataset.

from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)

a) Perform a 70/30 train-test split with stratify=y and random_state=42. Report class distributions.

b) Train a RandomForestClassifier with default hyperparameters. Report test accuracy and the classification report.

c) Use GridSearchCV with 5-fold cross-validation to tune n_estimators, max_depth, and min_samples_leaf. Report the best parameters and the best CV score.

d) Evaluate the tuned model on the test set. How much did tuning improve over the defaults?

e) Compute both impurity-based and permutation-based feature importance. Identify the top 3 features. Do they agree with domain knowledge about what distinguishes wine varieties?

These exercises support Chapter 13: Tree-Based Methods. Return to the chapter for full context.