Exercises: Chapter 11

DataField.Dev

Exercises: Chapter 11

Linear Models Revisited

Exercise 1: Why OLS Fails (Conceptual)

A colleague shows you the following results from an unregularized linear regression on a dataset with 500 observations and 400 features:

Metric	Value
Training R-squared	0.997
Test R-squared	0.142
Max coefficient magnitude	847.3
Min coefficient magnitude	0.002
Condition number of X^T X	4.2 x 10^14

a) List three specific problems indicated by these results.

b) Which single number in the table is the clearest indicator that this model should not be deployed? Explain why.

c) Your colleague argues: "The training R-squared is nearly perfect, so the model has learned the relationships in the data." Write a two-sentence rebuttal.

d) If you added 100 more pure-noise features (random values with no relationship to the target), what would happen to the training R-squared? What would happen to the test R-squared?

Exercise 2: Ridge vs. Lasso (Conceptual)

For each scenario, recommend Ridge, Lasso, or Elastic Net. Justify your choice.

a) You have 50 features, and domain knowledge tells you that almost all of them are relevant. Several features are highly correlated (e.g., monthly_revenue, annual_revenue, and lifetime_revenue).

b) You have 500 features generated by one-hot encoding, interaction terms, and polynomial features. You suspect that fewer than 30 are truly predictive.

c) You have 100 features in groups of 5. Within each group, the features are strongly correlated (e.g., five different measures of "engagement"). You want to select important groups while keeping stability within groups.

d) You are building a model for a regulated industry where you must document every feature used. You need the final model to use fewer than 20 features.

Exercise 3: Feature Scaling Experiment (Coding)

Run the following code to create a dataset with features on different scales. Then complete the exercises below.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score

np.random.seed(42)
n = 3000

df = pd.DataFrame({
    'salary': np.random.normal(75000, 25000, n),           # Range: ~0 - 150000
    'years_experience': np.random.uniform(0, 30, n),        # Range: 0 - 30
    'satisfaction_score': np.random.uniform(1, 5, n),        # Range: 1 - 5
    'commute_minutes': np.random.exponential(25, n),         # Range: 0 - ~150
    'has_stock_options': np.random.choice([0, 1], n),        # Range: 0 - 1
})

logit = (
    -1.5
    + 0.00003 * df['salary']
    + 0.08 * df['years_experience']
    + 0.4 * df['satisfaction_score']
    - 0.015 * df['commute_minutes']
    - 0.5 * df['has_stock_options']
    + np.random.normal(0, 0.8, n)
)
df['left_company'] = (1 / (1 + np.exp(-logit)) > 0.5).astype(int)

X = df.drop('left_company', axis=1)
y = df['left_company']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

a) Fit a LogisticRegression(C=1.0, penalty='l1', solver='saga', max_iter=5000, random_state=42) without scaling. Print the coefficients. Which feature has the largest absolute coefficient? Which has the smallest?

b) Now fit the same model with StandardScaler. Print the coefficients. How did the ranking change?

c) Fit with MinMaxScaler instead. Print the coefficients. Are they the same as StandardScaler? Explain why or why not.

d) Compare the cross-validated AUC for all three approaches (no scaling, StandardScaler, MinMaxScaler). Which performs best?

e) Add an outlier: set the salary of the first observation to $10,000,000. Re-run all three approaches. Which scaler is most affected by the outlier?

Exercise 4: Coefficient Path Interpretation (Coding + Conceptual)

Generate the following dataset and plot the Lasso coefficient path.

import numpy as np
from sklearn.linear_model import lasso_path
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 1000

# 3 signal features, 7 noise features
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)
noise = np.random.randn(n, 7)

y = 3 * x1 - 2 * x2 + 0.5 * x3 + np.random.randn(n) * 0.5
X = np.column_stack([x1, x2, x3, noise])

feature_names = ['x1_signal', 'x2_signal', 'x3_signal'] + [f'noise_{i}' for i in range(7)]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

alphas, coefs, _ = lasso_path(X_scaled, y, alphas=np.logspace(-3, 1, 100))

a) Plot the coefficient path (coefficients vs. log10(alpha)). Label each line with the feature name.

b) At what alpha value do the noise features first reach exactly zero?

c) At what alpha value do signal features start getting zeroed out?

d) Identify the alpha range where only the signal features have non-zero coefficients. This is the "sweet spot" for Lasso on this dataset.

e) The true coefficients are 3, -2, and 0.5. Do the Lasso coefficients at the optimal alpha recover these values exactly? If not, why not? (Hint: think about bias.)

Exercise 5: The C Parameter (Coding)

Using the StreamFlow-like dataset from the chapter, complete the following:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

# Use the dataset from the chapter or generate your own
# X_train, X_test, y_train, y_test should be defined

C_values = np.logspace(-4, 4, 50)

a) For each C value, fit a logistic regression with L1 penalty. Record: (1) cross-validated AUC, (2) number of non-zero coefficients, (3) training time.

b) Plot three subplots on one figure: AUC vs. log10(C), non-zero features vs. log10(C), and training time vs. log10(C).

c) Identify the C value that maximizes cross-validated AUC. How many features does the model use at that C?

d) Find the smallest C (strongest regularization) that achieves AUC within 0.01 of the maximum. How many features does this "parsimonious" model use?

e) A stakeholder says: "We want the most accurate model possible." Another says: "We want a model with fewer than 20 features." Can both be satisfied? Show the tradeoff with your plots.

Exercise 6: Elastic Net Mixing (Coding)

Build three Elastic Net models with alpha=0.01 and l1_ratio values of 0.1, 0.5, and 0.9. Use the dataset from Exercise 3 or the chapter.

a) For each model, report: cross-validated R-squared, number of non-zero coefficients, and the coefficient for each feature.

b) Create a bar chart comparing the coefficient values across the three models. Put features on the x-axis and group the bars by l1_ratio.

c) Which l1_ratio produces the sparsest model? Which produces the most stable coefficients (smallest standard deviation across cross-validation folds)?

d) Explain in plain language: why does l1_ratio=0.1 keep more features than l1_ratio=0.9?

Exercise 7: Logistic Regression for Metro General (Coding)

Using the Metro General readmission dataset from Case Study 2 (or a similar simulated dataset), build a logistic regression pipeline.

a) Train models with L1, L2, and Elastic Net penalties. Compare their AUC on a held-out test set.

b) For the L1 model, which features are zeroed out? Do any of the zeroed features surprise you?

c) Use LogisticRegressionCV to find the optimal C for the L2 model. What C does it select?

d) A hospital administrator asks: "Which patients should we prioritize for follow-up calls?" Generate a ranked list of the top 20 highest-risk patients in the test set. For each patient, show the risk score and the top 3 contributing risk factors (using coefficient * feature value).

e) Calculate the model's precision and recall at different thresholds (0.1, 0.2, 0.3, 0.4, 0.5). If the hospital can only make 200 follow-up calls per month (out of ~2,300 discharges), what threshold should they use?

Exercise 8: Multicollinearity Diagnosis (Coding + Conceptual)

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from numpy.linalg import cond

np.random.seed(42)
n = 500

# Create features with varying degrees of correlation
x1 = np.random.randn(n)
x2 = x1 + np.random.randn(n) * 0.5      # Moderately correlated with x1
x3 = x1 + np.random.randn(n) * 0.01     # Highly correlated with x1
x4 = np.random.randn(n)                   # Independent
x5 = np.random.randn(n)                   # Independent

y = 2 * x1 + 1 * x4 + np.random.randn(n) * 0.5
X = np.column_stack([x1, x2, x3, x4, x5])

a) Compute the correlation matrix of X. Which pairs have correlation above 0.9?

b) Compute the condition number of X^T X. Is it above or below the "concerning" threshold of 30?

c) Fit OLS (LinearRegression) 10 times on different random 80% subsets of the data. For each fit, record the coefficients. Compute the standard deviation of each coefficient across the 10 fits. Which coefficients are unstable?

d) Repeat part (c) with Ridge(alpha=1.0). How does Ridge affect the stability of the unstable coefficients?

e) Repeat with Ridge(alpha=100.0). What happens to the stable coefficients (x4, x5)? Is there such a thing as too much regularization?

Exercise 9: The Scaling-Regularization Interaction (Conceptual)

Explain each of the following without code. Each answer should be 3--5 sentences.

a) Why does regularization without scaling create a bias toward features with large numeric ranges?

b) A model has two features: age (range 18--90) and is_premium_member (range 0--1). Without scaling, which feature will regularization penalize more aggressively? Is this desirable?

c) After scaling with StandardScaler, a feature has mean 0 and standard deviation 1. What does a coefficient of 0.5 mean in practical terms?

d) You fit a model with StandardScaler in the pipeline and report coefficients. A colleague fits the same model without scaling and gets different coefficients. Both models have the same AUC. How is this possible?

e) You are using RobustScaler (which uses median and IQR instead of mean and std). Under what data conditions would this produce better results than StandardScaler?

Exercise 10: Production Pipeline (Coding)

Build a complete, production-grade logistic regression pipeline for a dataset of your choice (you may use the StreamFlow dataset, Metro General dataset, or any tabular classification dataset).

Your pipeline must include:

a) A ColumnTransformer that applies StandardScaler to numeric features and OneHotEncoder to categorical features.

b) LogisticRegressionCV with cross-validated C selection.

c) Evaluation on a held-out test set with all five standard metrics (accuracy, precision, recall, F1, AUC).

d) Serialization with joblib.dump() and a demonstration that the loaded model produces identical predictions.

e) A coefficient table showing feature names, coefficients, and odds ratios, sorted by absolute value.

Wrap the entire thing in a function that takes a DataFrame, a target column name, and lists of numeric and categorical columns as arguments. This function should be reusable for any tabular classification task.

Challenge Exercise: Model Selection Simulation (Synthesis)

Design a simulation to answer this question: Under what conditions does logistic regression with L1 outperform Ridge, and vice versa?

np.random.seed(42)

# Parameters to vary:
# - n: number of observations (100, 500, 2000)
# - p: number of features (10, 50, 200)
# - s: number of truly predictive features (5, 10, 30)
# - rho: correlation between features (0, 0.3, 0.8)
# - noise: noise level (0.5, 1.0, 2.0)

a) Generate datasets for all combinations of parameters above (3 x 3 x 3 x 3 x 3 = 243 scenarios). For each scenario, fit both Ridge and Lasso with cross-validated alpha selection. Record the test AUC for each.

b) Create a heatmap or summary table showing when Lasso wins vs. when Ridge wins.

c) Identify the parameter(s) that most strongly predict which method is better. Write a one-paragraph "rule of thumb" for choosing between Ridge and Lasso.

d) At what sparsity level (s/p ratio) does Lasso consistently outperform Ridge? At what correlation level (rho) does Ridge consistently outperform Lasso?

These exercises support Chapter 11: Linear Models Revisited. Return to the chapter for reference.