Chapter 27 Exercises

Section 27.2: scikit-learn Pipelines

Exercise 1: Basic Pipeline Construction

Build a scikit-learn Pipeline that chains a SimpleImputer (median strategy), StandardScaler, and LogisticRegression. Fit it on synthetic prediction market data with features [poll_average, market_price, volume_24h, days_to_resolution] and binary outcome labels. Print the pipeline's parameters and the model's coefficients after fitting.

Exercise 2: ColumnTransformer with Mixed Types

Create a ColumnTransformer that applies StandardScaler to numeric features [poll_average, market_price, volume_24h] and OneHotEncoder to categorical features [event_type, market_platform]. Combine it with a GradientBoostingClassifier in a full pipeline. Generate synthetic data with at least 500 rows and fit the pipeline. Report the number of features after transformation.

Exercise 3: Custom Transformer

Implement a custom transformer MomentumTransformer that computes the percentage change between the current market price and a lagged market price column (market_price_lag1). The transformer should add a new column price_momentum. Verify that it works inside a scikit-learn Pipeline and that fit returns self.

Exercise 4: Pipeline Serialization and Verification

Train a prediction pipeline (using any combination of transformers and a classifier) on synthetic data. Serialize it using joblib.dump. Load the serialized pipeline and verify that predictions on a test set are numerically identical (within floating-point tolerance) to the original pipeline's predictions. Report the file size of the serialized pipeline.

Exercise 5: Pipeline with Cross-Validation

Create a pipeline with a ColumnTransformer and CalibratedClassifierCV wrapping a GradientBoostingClassifier. Use cross_val_score with scoring='neg_brier_score' to evaluate the pipeline on 5-fold cross-validation. Report the mean and standard deviation of the Brier score across folds.

Exercise 6: Grid Search over Pipeline Parameters

Set up a GridSearchCV over a pipeline that includes StandardScaler and GradientBoostingClassifier. Search over the following parameter grid: classifier__n_estimators: [100, 200, 300], classifier__learning_rate: [0.01, 0.05, 0.1], classifier__max_depth: [3, 4, 5]. Use Brier score as the scoring metric. Report the best parameters and best score.


Section 27.3: Feature Stores

Exercise 7: Feature Store Implementation

Implement a SimpleFeatureStore class using a Python dictionary (no database required). It should support: (a) registering features with names and descriptions, (b) ingesting feature values with entity IDs and timestamps, (c) retrieving the latest feature values for a given entity, and (d) retrieving point-in-time feature values. Write tests for each method.

Exercise 8: Point-in-Time Join

Given two DataFrames --- entity_df with columns [market_id, observation_date, outcome] and feature_df with columns [market_id, feature_date, poll_average, market_price] --- implement a point-in-time join using pd.merge_asof. Verify that no future features leak into the training data by checking that feature_date <= observation_date for every row.

Exercise 9: Feature Freshness Monitoring

Extend the PredictionMarketFeatureStore from the chapter to include a check_freshness method that returns a report of all features whose most recent update is older than a specified threshold (e.g., 24 hours). Generate test data with some stale and some fresh features and verify the method correctly identifies stale features.

Exercise 10: Feature Store with Parquet Backend

Implement a feature store that stores features as Parquet files (one file per feature group). Implement ingest, get_training_features, and get_online_features methods. Compare the read performance against the SQLite-backed implementation for 100,000 feature values.


Section 27.4: Experiment Tracking with MLflow

Exercise 11: Basic MLflow Logging

Write a script that trains three different models (Logistic Regression, Random Forest, Gradient Boosting) on the same dataset and logs each to MLflow with parameters, Brier score, log loss, and AUC-ROC. Use MlflowClient to programmatically retrieve and compare the three runs, printing the best model by Brier score.

Exercise 12: Custom Metric Logging

Train a model and log step-based metrics to MLflow: for each of 10 increasing training set sizes (100, 200, ..., 1000 samples), log the validation Brier score. After the run, retrieve the metric history and plot the learning curve using the logged values.

Exercise 13: Artifact Logging

Train a prediction market model and log the following artifacts to MLflow: (a) the serialized model (joblib), (b) a calibration plot (PNG), (c) a feature importance bar chart (PNG), (d) a JSON file containing the feature names and their importance scores. Verify that all artifacts can be retrieved from the MLflow tracking server.

Exercise 14: Experiment Comparison

Create an MLflow experiment with at least 10 runs using different hyperparameter combinations. Write a function that retrieves all runs, sorts them by Brier score, and generates a summary table showing the top 5 runs with their parameters and metrics.


Section 27.5: Model Versioning and Registry

Exercise 15: Model Lifecycle Management

Using MLflow's Model Registry, register a model, transition it from "None" to "Staging", validate it (compute Brier score on a test set), and promote it to "Production". Then train a new model, register it, and if it is better than the current production model, promote it and archive the old one. Write the complete workflow as a script.

Exercise 16: Model Rollback Procedure

Implement a rollback_model function that: (a) loads the current production model, (b) loads the previously archived model, (c) promotes the archived model back to production, and (d) archives the current production model. Test this function by simulating a scenario where a newly deployed model performs worse.

Exercise 17: A/B Testing Framework

Implement a complete A/B testing framework that: (a) routes 90% of traffic to control (current model) and 10% to treatment (new model), (b) records predictions and outcomes for both groups, (c) computes the Brier score for each group after 200 predictions, and (d) determines whether the treatment model is statistically significantly better using a permutation test. Run the framework on synthetic data and report the results.


Section 27.6: Automated Training Pipelines

Exercise 18: Scheduled Retraining Simulation

Simulate a scheduled retraining scenario: generate 365 days of synthetic prediction market data. Every 30 days, retrain the model on the most recent 180 days of data, validate on the most recent 30 days, and log the validation Brier score. Plot the Brier score over time and identify any periods where the model degraded.

Exercise 19: Trigger-Based Retraining

Implement a RetrainTrigger class that monitors the rolling Brier score of a deployed model and triggers retraining when the 30-day rolling Brier score exceeds 1.5 times the baseline Brier score. Test it on synthetic data where the data distribution shifts at day 180.

Exercise 20: Validation Gates

Implement a ValidationGate class with configurable quality thresholds: max_brier_score, min_auc, max_calibration_error, and must_beat_production (boolean). The gate should accept or reject a candidate model based on all criteria. Write tests for each threshold independently and for the combination.


Section 27.7: Model Monitoring and Drift Detection

Exercise 21: PSI Computation

Implement the Population Stability Index (PSI) from scratch. Test it on the following scenarios: (a) identical distributions (PSI should be near 0), (b) slightly shifted distribution (shift mean by 0.5 std, PSI should be moderate), (c) dramatically shifted distribution (shift mean by 2 std, PSI should be high). Verify your implementation against the threshold values in the chapter.

Exercise 22: KS Test for Feature Drift

Implement a function that takes a reference feature distribution and a current feature distribution, runs the Kolmogorov-Smirnov test, and returns a drift report including the statistic, p-value, and a yes/no drift determination. Test it on 10 features where 3 have been artificially shifted.

Exercise 23: Concept Drift Simulation

Simulate concept drift in a prediction market model: (a) train a model on data where poll_average > 0.5 predicts outcome = 1, (b) generate new data where the relationship has reversed (poll_average > 0.5 now predicts outcome = 0), (c) run the drift monitor on the new data and verify that concept drift is detected. Plot the rolling Brier score to visualize the drift.

Exercise 24: Multi-Metric Drift Dashboard

Build a DriftDashboard class that monitors: PSI for predictions, KS test for each feature, Wasserstein distance for predictions, and rolling Brier score (when labels are available). The dashboard should produce a summary report with severity levels (green/yellow/red) for each metric. Test it on synthetic data with gradual drift.


Section 27.8: Model Serving

Exercise 25: FastAPI Prediction Server

Implement a FastAPI prediction server with the following endpoints: /predict (single prediction), /predict/batch (batch predictions), /health (health check), and /model-info (returns model version, feature names, and training date). Write integration tests that call each endpoint and verify the response schema.

Exercise 26: Latency Benchmarking

Using the FastAPI server from Exercise 25, benchmark the prediction latency by sending 1000 sequential requests and 100 concurrent requests (using asyncio or concurrent.futures). Report the mean, median, P95, and P99 latency for both sequential and concurrent scenarios.

Exercise 27: Graceful Degradation

Implement a prediction server with a primary model and a fallback model. The primary model should intentionally raise an exception 10% of the time (simulating intermittent failures). Verify that the server: (a) uses the primary model when it succeeds, (b) falls back to the secondary model when the primary fails, (c) logs all fallback events, and (d) reports the fallback rate in the /health endpoint.


Section 27.9-27.10: CI/CD and Governance

Exercise 28: Data Validation with Pandera

Define a Pandera schema for prediction market training data with the following constraints: poll_average in [0, 1] with < 5% nulls, market_price in [0, 1], volume_24h >= 0, days_to_resolution in [0, 3650], outcome in {0, 1}. Test the schema against valid and invalid DataFrames and verify that appropriate errors are raised.

Exercise 29: Reproducibility Checklist

Write a ReproducibilityChecker class that verifies: (a) all random seeds are set, (b) all dependencies are pinned (read from requirements.txt), (c) data hash matches a reference hash, (d) model produces identical predictions on a reference input before and after serialization. Run the checker on a training pipeline and report pass/fail for each criterion.

Exercise 30: Complete MLOps Pipeline

Build a complete end-to-end MLOps pipeline that: (a) ingests synthetic data into a feature store, (b) fetches training data with point-in-time correctness, (c) trains a model using a scikit-learn pipeline, (d) logs the experiment to MLflow, (e) validates against quality gates, (f) registers the model in the MLflow Model Registry, (g) serves predictions via a simple prediction function, and (h) monitors for drift using PSI. Run the pipeline twice with different data and verify that model versions are managed correctly. This exercise integrates all concepts from the chapter.