Answers to Selected Exercises and Quiz Questions

This appendix provides worked answers for selected exercises and quiz questions from each chapter. Not every question is answered — the intent is to give you enough footholds to check your reasoning without removing the productive struggle of working through problems yourself.

Chapter 1: From Analysis to Prediction

Exercise 1.2: The hospital readmission problem is a binary classification task. The target variable is readmitted_within_30_days (1 = yes, 0 = no). The observation unit is a patient discharge event — not the patient themselves, since a patient may have multiple discharges.

Exercise 1.4: The key insight is that a descriptive analysis ("What drove churn last quarter?") focuses on coefficient interpretation and statistical significance, while a predictive model ("Who will churn next month?") focuses on generalization accuracy. The same dataset can serve both purposes, but the workflow, evaluation criteria, and acceptable tradeoffs differ fundamentally.

Quiz 1.1: B — Overfitting. A model with 99% training accuracy and 62% test accuracy has memorized the training data rather than learning generalizable patterns.

Quiz 1.3: C — Bias-variance tradeoff. Adding more features without regularization increases model complexity, reducing bias but increasing variance.

Quiz 1.5: A — Supervised learning. Predicting a continuous target variable (house price) from labeled training data is a regression task, which falls under supervised learning.

Chapter 2: The ML Workflow

Exercise 2.1: The "stupid baseline" for the StreamFlow churn problem predicts the majority class (not churned) for every subscriber. With an 8.2% churn rate, this baseline achieves 91.8% accuracy — but it identifies zero churners. Any useful model must beat this baseline on recall or precision for the minority class.

Exercise 2.3: Target leakage is present because cancellation_reason is only populated after a customer churns. Including it as a feature gives the model direct access to the target variable. The fix: remove any feature that is unavailable at prediction time.

Quiz 2.2: D — Data leakage. If the feature average_monthly_revenue_next_quarter uses future data, it creates temporal leakage. The model appears to perform well but will fail in production where future data doesn't exist.

Quiz 2.4: B — An iterative cycle. The ML lifecycle is not a linear pipeline. Models are retrained, features are revised, evaluation criteria shift as business needs evolve.

Quiz 2.5: A — Problem framing. Defining the wrong target variable, wrong observation unit, or wrong success metric wastes everything downstream.

Chapter 3: Experimental Design and A/B Testing

Exercise 3.1: Using statsmodels.stats.power.TTestIndPower().solve_power(effect_size=0.05, alpha=0.05, power=0.8), the required sample size per group is approximately 6,280. With 14M monthly users, ShopSmart can easily run this test in a few days — but should run for at least 2 full weeks to capture weekly seasonality.

Exercise 3.3: The PM made two errors: (1) peeking at results daily without adjusting for multiple comparisons, and (2) stopping the test early when a favorable result appeared. The correct approach: set the test duration in advance based on power analysis, and only evaluate at the pre-specified end date.

Quiz 3.1: C — Statistical power. Power is the probability of detecting a real effect. Low power means you might conclude "no effect" when the treatment actually works.

Quiz 3.3: B — Bonferroni correction. With 5 simultaneous tests, divide alpha by 5: the new threshold is 0.01 per test.

Quiz 3.5: D — Practical significance differs from statistical significance. A 0.02% conversion lift might be statistically significant with millions of users but not worth the engineering cost to implement.

Chapter 4: The Math Behind ML

Exercise 4.2: Gradient descent from scratch for linear regression:

import numpy as np

def gradient_descent(X, y, lr=0.01, epochs=1000):
    m, n = X.shape
    theta = np.zeros(n)
    for _ in range(epochs):
        predictions = X @ theta
        errors = predictions - y
        gradient = (2 / m) * (X.T @ errors)
        theta -= lr * gradient
    return theta

The gradient of MSE with respect to theta is (2/m) * X^T(X*theta - y). Each iteration moves theta in the direction that reduces the loss.

Exercise 4.4: Log-loss for the three predictions: For y=1, p=0.9: -log(0.9) = 0.105. For y=0, p=0.3: -log(0.7) = 0.357. For y=1, p=0.6: -log(0.6) = 0.511. Average log-loss = (0.105 + 0.357 + 0.511) / 3 = 0.324.

Quiz 4.2: A — The learning rate is too high. Diverging loss (increasing each epoch) is the classic symptom. Reduce the learning rate by a factor of 10.

Quiz 4.4: C — L1 regularization. Adding the sum of absolute values of coefficients to the loss function drives some coefficients to exactly zero, producing a sparse model.

Chapter 5: SQL for Data Scientists

Exercise 5.1:

WITH monthly_usage AS (
    SELECT
        subscriber_id,
        DATE_TRUNC('month', event_date) AS month,
        SUM(hours_watched) AS total_hours
    FROM usage_events
    GROUP BY subscriber_id, DATE_TRUNC('month', event_date)
),
usage_with_lag AS (
    SELECT
        subscriber_id,
        month,
        total_hours,
        LAG(total_hours, 1) OVER (
            PARTITION BY subscriber_id ORDER BY month
        ) AS prev_month_hours,
        LAG(total_hours, 2) OVER (
            PARTITION BY subscriber_id ORDER BY month
        ) AS two_months_ago_hours
    FROM monthly_usage
)
SELECT
    subscriber_id,
    month,
    total_hours,
    total_hours - prev_month_hours AS usage_change_1m,
    total_hours - two_months_ago_hours AS usage_change_2m
FROM usage_with_lag
WHERE month = '2024-03-01';

Exercise 5.3: The anti-join finds subscribers with zero support tickets:

SELECT s.subscriber_id
FROM subscribers s
LEFT JOIN support_tickets t
    ON s.subscriber_id = t.subscriber_id
WHERE t.ticket_id IS NULL;

Quiz 5.1: B — Window functions. ROW_NUMBER() OVER (PARTITION BY subscriber_id ORDER BY event_date DESC) assigns 1 to the most recent event per subscriber. Filter to row_num = 1 to deduplicate.

Quiz 5.4: C — Create an index on the filtered column. If the query filters on event_date but no index exists, the database performs a full sequential scan.

Chapter 6: Feature Engineering

Exercise 6.2: The genre_diversity_score can be computed as the entropy of the subscriber's viewing distribution across genres. A subscriber who watches only action movies has entropy 0; a subscriber spread equally across 10 genres has high entropy. In Python: scipy.stats.entropy(genre_proportions).

Quiz 6.1: A — Domain knowledge. Feature engineering transforms domain expertise into computable variables. Without understanding the business, you cannot create the features that matter most.

Quiz 6.3: D — All feature engineering statistics (means, standard deviations, encodings) must be computed on training data only, then applied to test data. Computing on the full dataset leaks test information into training.

Chapter 7: Handling Categorical Data

Exercise 7.1: For subscription_plan (Free, Basic, Standard, Premium): ordinal encoding (0, 1, 2, 3) is appropriate because the tiers have a natural ordering. For primary_genre (20 genres): target encoding with smoothing, since one-hot would add 20 sparse columns and the genres have varying churn rates. For country (100+ countries): frequency encoding or target encoding, since one-hot creates too many columns.

Quiz 7.2: B — Target encoding leaks the target variable into the features. To prevent overfitting, use leave-one-out encoding or compute target encoding within cross-validation folds.

Quiz 7.4: A — One-hot encoding. With only 4 categories, OHE adds minimal dimensionality and makes no assumptions about ordering.

Chapter 8: Missing Data Strategies

Exercise 8.2: The missing usage data is likely MAR (missing at random) — users who didn't watch anything have null viewing records, and their probability of being missing correlates with engagement level (an observed variable). The key insight: creating a binary no_activity_last_7_days indicator is more informative than imputing the missing usage with the mean. The missingness itself is a strong churn signal.

Quiz 8.1: C — MNAR (missing not at random). Patients with the most severe symptoms are the least likely to complete the survey. The missingness depends on the value that would have been recorded.

Quiz 8.3: B — KNN imputation. It uses the local neighborhood structure to fill missing values, capturing relationships between features that mean/median imputation ignores.

Chapter 9: Feature Selection

Exercise 9.1: Permutation importance ranks features by how much model performance drops when the feature is randomly shuffled. The top 5 features for churn are likely: days_since_last_login, usage_change_1m, support_tickets_last_90d, tenure_months, and avg_hours_last_30d. Features with near-zero permutation importance can be safely removed.

Quiz 9.2: A — Filter methods. Correlation matrices and mutual information scores evaluate features independently of any model. They are fast but may miss feature interactions.

Quiz 9.5: D — Feature selection must occur inside cross-validation. Selecting features on the full dataset, then cross-validating, causes information leakage — the selected features "saw" the test data.

Chapter 10: Building Reproducible Data Pipelines

Exercise 10.1: The ColumnTransformer assembles different processing for each feature type:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from category_encoders import TargetEncoder

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', TargetEncoder(smoothing=10))
])

preprocessor = ColumnTransformer([
    ('numeric', numeric_pipeline, numeric_features),
    ('categorical', categorical_pipeline, categorical_features),
    ('ordinal', OrdinalEncoder(categories=[plan_order]), ['subscription_plan'])
])

Quiz 10.3: B — joblib.dump(pipeline, 'model.joblib'). Joblib handles numpy arrays more efficiently than pickle and is the recommended serialization method for scikit-learn objects.

Chapter 11: Linear Models Revisited

Exercise 11.2: As alpha increases in Lasso, coefficients shrink toward zero and eventually reach zero. The coefficient path plot shows which features are eliminated first (least important) and which survive the longest (most important). This is effectively automated feature selection.

Quiz 11.1: C — Feature scaling. Regularization penalizes coefficient magnitude, so unscaled features (one in [0,1], another in [0,100000]) are penalized unequally. StandardScaler before regularization is mandatory.

Quiz 11.4: A — Logistic regression with regularization. It's fast, interpretable, and provides calibrated probabilities. It serves as the baseline that all other models must beat.

Chapter 12: Support Vector Machines

Exercise 12.1: With C=0.01, the margin is wide and some points are misclassified (high bias, low variance). With C=1000, the margin is narrow and nearly every training point is classified correctly (low bias, high variance). The optimal C balances these extremes.

Quiz 12.3: B — The kernel trick. Computing dot products in the transformed space is equivalent to mapping data to higher dimensions without explicitly computing the transformation. This makes non-linear classification tractable.

Chapter 13: Tree-Based Methods

Exercise 13.2: The single decision tree achieves 98% training accuracy but only 71% test accuracy — classic overfitting. The Random Forest with 500 trees achieves 85% training accuracy and 82% test accuracy. Averaging across trees reduces variance while maintaining low bias.

Quiz 13.1: D — Gini impurity. scikit-learn's DecisionTreeClassifier uses Gini impurity by default. Gini = 1 - sum(p_i^2) for each class proportion p_i.

Quiz 13.4: A — Permutation importance is more reliable than impurity-based importance because it is not biased toward high-cardinality features. Use permutation_importance() from scikit-learn.

Chapter 14: Gradient Boosting

Exercise 14.1: With early stopping, set n_estimators=10000 (a high ceiling) and early_stopping_rounds=50. Training stops when validation performance hasn't improved for 50 consecutive rounds. The actual number of trees used is typically much less than 10,000.

Exercise 14.3: LightGBM trains fastest due to histogram-based splitting. CatBoost handles the categorical genre feature natively without manual encoding. XGBoost is the most battle-tested with the widest API. Performance differences are usually within 1-2%.

Quiz 14.2: C — Learning rate. Lower learning rates (0.01-0.1) with more trees generally outperform higher learning rates with fewer trees. The first hyperparameter to tune.

Quiz 14.5: B — Early stopping prevents overfitting by monitoring validation performance and stopping when it plateaus, without requiring you to manually set the exact number of iterations.

Chapter 15: Naive Bayes and Nearest Neighbors

Exercise 15.1: Gaussian Naive Bayes achieves 78% accuracy on the churn data — surprisingly competitive for a model with such a strong assumption. The independence assumption is clearly violated (usage features are correlated), yet the classifier's ranking ability (AUC) is reasonable because the marginal distributions are informative.

Quiz 15.3: D — The curse of dimensionality. As features increase, distances between points converge, and "nearest" becomes meaningless. KNN works best with fewer than ~20 informative features.

Chapter 16: Model Evaluation Deep Dive

Exercise 16.2: Stratified k-fold ensures each fold preserves the 8.2% churn rate. Without stratification, some folds might have 5% churn and others 12%, leading to highly variable performance estimates. Use StratifiedKFold(n_splits=5, shuffle=True, random_state=42).

Exercise 16.4: The leakage is in the account_status feature, which is set to "cancelled" for churned customers. This feature perfectly predicts the target because it IS the target in another form. Remove it and rerun — the model's AUC drops from 0.99 to 0.83.

Quiz 16.1: B — AUC-PR. For imbalanced datasets, AUC-PR focuses on the minority class (churners) and is more informative than AUC-ROC, which can be misleadingly optimistic.

Quiz 16.3: C — Group k-fold. When a subscriber has multiple monthly observations, regular k-fold may split one subscriber's data across train and test, causing leakage. Group k-fold keeps all of a subscriber's data in the same fold.

Chapter 17: Class Imbalance

Exercise 17.2: Threshold tuning on the precision-recall curve yields the best results. At threshold 0.35 (instead of default 0.5), recall increases from 0.61 to 0.78 while precision drops only from 0.72 to 0.64. This is acceptable because the cost of missing a churner ($2,400 annual revenue) far exceeds the cost of a wasted retention offer ($50).

Quiz 17.1: A — Accuracy is misleading. A model predicting "no churn" for everyone achieves 91.8% accuracy but catches zero churners.

Quiz 17.4: C — SMOTE must be applied only inside cross-validation, on the training fold only. Applying it before splitting contaminates the validation fold with synthetic data derived from it.

Chapter 18: Hyperparameter Tuning

Exercise 18.1: The Optuna study with 100 trials improves AUC from 0.836 (defaults) to 0.851 (tuned). The first 20 trials capture 80% of the improvement. The most important hyperparameters: learning_rate and max_depth. subsample and colsample_bytree have minimal impact.

Quiz 18.2: A — Random search. For search spaces with more than 3-4 hyperparameters, random search is more efficient than grid search because it explores the space more evenly.

Quiz 18.5: D — Better features. Improving feature engineering yields 5-20% gains; hyperparameter tuning yields 0.5-5%. Always invest in features first.

Chapter 19: Model Interpretation

Exercise 19.1: The SHAP summary plot shows days_since_last_login as the most important feature globally, with high values (many days since login) pushing predictions toward churn. usage_change_1m is second: negative values (declining usage) push toward churn. These align with domain intuition — disengaged users leave.

Quiz 19.3: B — SHAP values sum to the difference between the model's prediction for that observation and the average prediction. This is the Shapley additivity property.

Chapter 20: Clustering

Exercise 20.2: The silhouette plot reveals that k=4 produces the most balanced, well-separated clusters. The four segments: "Power Users" (high usage, long tenure), "Casual Browsers" (low usage, medium tenure), "New Enthusiasts" (high usage, short tenure), and "At-Risk" (declining usage, medium tenure). The "At-Risk" segment has a 23% churn rate vs. 5% for "Power Users."

Quiz 20.1: C — K-Means assumes spherical, equal-variance clusters. It fails on non-convex shapes and clusters of different densities.

Chapter 21: Dimensionality Reduction

Exercise 21.1: The first 5 principal components explain 72% of the total variance. The scree plot shows a clear elbow at component 5. In the 2D PCA projection, churners and retained subscribers overlap substantially, confirming that churn is not linearly separable — justifying non-linear models.

Quiz 21.3: A — t-SNE cluster distances are not meaningful. Two clusters that appear far apart in a t-SNE plot might be close in the original space. Never interpret inter-cluster distances.

Chapter 22: Anomaly Detection

Exercise 22.1: Isolation Forest with contamination=0.02 identifies 847 anomalous vibration readings from the TurbineTech data. Cross-referencing with maintenance logs: 73% of flagged readings occurred within 14 days of an actual bearing failure. The precision-at-k for k=100 is 0.81.

Quiz 22.2: B — Reconstruction error. An autoencoder trained on normal data produces low reconstruction error for normal inputs and high error for anomalies (patterns it hasn't learned to reconstruct).

Chapter 23: Association Rules

Exercise 23.1: The rule {Action, Sci-Fi} -> {Thriller} has support=0.08, confidence=0.62, lift=2.3. Lift > 1 confirms a positive association. Interpretation: subscribers who watch both Action and Sci-Fi are 2.3 times more likely to watch Thriller than a random subscriber. This informs content recommendation.

Quiz 23.3: C — Lift. Support and confidence alone can be misleading if the consequent is already very common. Lift corrects for baseline popularity.

Chapter 24: Recommender Systems

Exercise 24.2: Item-based collaborative filtering outperforms user-based CF on this dataset (NDCG@10: 0.42 vs. 0.37) because item-item similarities are more stable than user-user similarities — items don't change, but user preferences drift over time.

Quiz 24.1: A — The cold start problem. Collaborative filtering cannot recommend for new users (no history) or new items (no ratings). Content-based methods or hybrid approaches address this.

Chapter 25: Time Series

Exercise 25.2: The ADF test on the raw monthly churn rate gives a p-value of 0.23 (non-stationary). After first-order differencing, the p-value drops to 0.001 (stationary). The ACF shows a significant spike at lag 12 (annual seasonality), suggesting SARIMA.

Quiz 25.3: B — Walk-forward validation. Random k-fold splits for time series allow future data to leak into training. TimeSeriesSplit respects temporal ordering.

Chapter 26: NLP Fundamentals

Exercise 26.1: The TF-IDF + logistic regression classifier achieves 0.79 F1 on support ticket urgency classification. The most informative features (highest TF-IDF coefficients for "urgent"): "cancel", "billing", "charge", "refund", "outage". These make domain sense.

Quiz 26.2: C — TF-IDF. It down-weights common words (like stop words) and up-weights distinctive words automatically, capturing which words are most informative for each document.

Chapter 27: Geospatial Data

Exercise 27.1: The choropleth map reveals that churn rates are highest in the Northeast (12-15%) and lowest in the Southeast (4-6%). Possible explanations: stronger competition in the Northeast, different content preferences, network quality differences.

Quiz 27.3: A — CRS mismatch. If one dataset uses EPSG:4326 and another uses EPSG:3857, the spatial join will produce incorrect results. Always transform to a common CRS first.

Chapter 28: Large Datasets

Exercise 28.1: Processing 50M rows of StreamFlow event logs with Polars takes 12 seconds. The equivalent pandas operation takes 3 minutes and 45 seconds. The Polars lazy API with expression syntax avoids materializing intermediate DataFrames.

Quiz 28.2: B — Lazy evaluation. Both Dask and Polars defer execution until .compute() or .collect() is called, allowing them to optimize the execution plan.

Chapter 29: Software Engineering

Exercise 29.2: A minimal pytest test for the feature engineering pipeline:

def test_pipeline_output_shape(sample_data, fitted_pipeline):
    result = fitted_pipeline.transform(sample_data)
    assert result.shape[0] == sample_data.shape[0]
    assert not np.any(np.isnan(result))

Quiz 29.1: C — Refactor notebooks into importable modules. Functions defined in notebooks cannot be tested, versioned, or reused. Move transformation logic to src/features/ and import it.

Chapter 30: Experiment Tracking

Exercise 30.1: Using MLflow autologging:

import mlflow
mlflow.sklearn.autolog()

with mlflow.start_run(run_name="xgb_tuned_v2"):
    model.fit(X_train, y_train)
    # Parameters, metrics, and model artifact are logged automatically

Quiz 30.3: A — Model Registry. It provides versioning (v1, v2, v3), stage management (Staging -> Production -> Archived), and lineage tracking (which experiment produced this model).

Chapter 31: Model Deployment

Exercise 31.1: The FastAPI /predict endpoint core logic:

@app.post("/predict", response_model=ChurnPrediction)
def predict(features: CustomerFeatures):
    X = pd.DataFrame([features.dict()])
    proba = model.predict_proba(X)[0, 1]
    return ChurnPrediction(
        churn_probability=round(proba, 4),
        risk_level="high" if proba > 0.7 else "medium" if proba > 0.3 else "low"
    )

Quiz 31.2: B — Pydantic validation. FastAPI uses Pydantic models to validate incoming request data automatically, returning clear 422 errors for malformed inputs.

Chapter 32: Monitoring in Production

Exercise 32.1: PSI for avg_hours_last_30d between training (Jan-Jun) and production (Jul-Dec) is 0.31 — above the 0.25 threshold, indicating significant drift. Usage patterns shifted after a major content library change in August. The model should be retrained on more recent data.

Quiz 32.1: C — Concept drift. The relationship between features and churn changed (e.g., new competitor entered the market, making previously loyal users start churning). This is harder to detect than data drift because the features may look the same while the target relationship changes.

Chapter 33: Fairness and Responsible ML

Exercise 33.2: The churn model's true positive rate (recall) is 0.82 for subscribers aged 18-30 and 0.64 for subscribers aged 50+. This violates equalized odds. The model is less effective at identifying at-risk older subscribers, potentially because there is less training data for that demographic.

Quiz 33.1: D — The impossibility theorem. When base rates differ between groups, you cannot simultaneously achieve demographic parity, equalized odds, and calibration. You must choose which fairness criterion to optimize.

Chapter 34: The Business of Data Science

Exercise 34.1: ROI calculation: The model identifies 500 at-risk subscribers per month. Retention offer cost: $50 each = $25,000/month. At 65% precision, 325 are true churners. If the offer retains 40% of them, 130 subscribers are saved. Each is worth $200/month in revenue. Monthly revenue saved: $26,000. Net monthly benefit: $1,000. Annual net benefit: $12,000 plus the compounding effect of retained subscribers staying longer. The break-even precision is approximately 62.5%.

Quiz 34.3: B — Expected value framework. Multiply the probability and cost/benefit of each outcome (TP, FP, TN, FN) to compute the economic value per prediction.

Chapter 35: Capstone

Exercise 35.1: This is an integration exercise with no single answer. The capstone project should demonstrate: (1) problem framing with clear business metrics, (2) reproducible pipeline from SQL to model, (3) proper evaluation with stratified CV, (4) SHAP-based interpretation, (5) fairness audit, (6) deployed API with monitoring, and (7) ROI justification. The portfolio write-up should tell the story — not just show the code.

Chapter 36: The Road to Advanced

Exercise 36.1: The gap analysis should identify 3-5 areas where the reader's current skills end and the next level begins. Common gaps: deep learning (neural networks, transformers), causal inference (beyond A/B tests), distributed computing (Spark), advanced NLP (BERT, GPT fine-tuning), and MLOps (CI/CD for ML, infrastructure as code). Prioritize based on your career goals and current role.

Quiz 36.3: A — Deep learning is not always better than gradient boosting for tabular data. For structured/tabular problems, XGBoost and LightGBM remain competitive or superior. Deep learning excels on unstructured data: images, text, audio.