Chapter 27 Key Takeaways

Core Principle

A model in a notebook generates zero profit. The distance from "it works on my machine" to "it runs reliably in production, 24/7" is vast -- and it is the distance that separates hobbyist data science from professional prediction market trading.

The Big Ideas

1. Training-Serving Skew Is the Silent Killer

The most common production ML failure is a mismatch between the features used during training and the features used during inference. If training uses a 7-day rolling average of polls and serving uses a 5-day average (because someone copied the wrong code), the model will silently produce degraded predictions.

Prevention: Encapsulate all preprocessing into a scikit-learn Pipeline and serve with the same Pipeline object used during training.

2. scikit-learn Pipelines Are the Foundation

A Pipeline chains preprocessing steps and a model into a single serializable object:

pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([...])),
    ('classifier', GradientBoostingClassifier())
])
pipeline.fit(X_train, y_train)       # Fit everything
predictions = pipeline.predict(X_new) # Apply everything
joblib.dump(pipeline, 'model.pkl')   # Serialize everything

Key capabilities: - ColumnTransformer applies different transformations to numeric vs. categorical features. - Custom transformers (via BaseEstimator + TransformerMixin) add domain-specific logic. - GridSearchCV and cross_val_score work directly on pipelines. - Serialized pipelines guarantee that training and serving use identical preprocessing.

3. Feature Stores Solve the Feature Management Problem

A feature store is a centralized repository for feature definitions, computations, and values. It solves three problems:

Problem Solution
Feature inconsistency Single source of truth for each feature
Training-serving skew Same store serves training and inference
Point-in-time leakage Temporal joins prevent future data leaking into training

The two interfaces: offline (batch retrieval for training with point-in-time correctness) and online (low-latency lookup for real-time inference).

4. MLflow Tracks Experiments Systematically

MLflow's tracking API logs every training run with: - Parameters: Hyperparameters and configuration. - Metrics: Brier score, log-loss, AUC, calibration error. - Artifacts: Serialized models, plots, feature importance files. - Tags: Metadata for search and organization.

This creates an auditable history of every model ever trained, enabling comparison, reproduction, and rollback.

5. The Model Registry Manages the Lifecycle

A model registry stores versioned models and manages their lifecycle through stages:

$$\text{None} \rightarrow \text{Staging} \rightarrow \text{Production} \rightarrow \text{Archived}$$

Key rules: - New models enter Staging and must pass validation gates before Production. - Only one model version is in Production at a time. - Archived models are retained for rollback. - Every transition is logged with timestamps and reasons.

6. Validation Gates Prevent Bad Models from Deploying

Automated quality checks that a candidate model must pass before promotion to Production:

Gate Criterion
Brier score Must be below a threshold (e.g., 0.25)
AUC-ROC Must exceed a minimum (e.g., 0.65)
Calibration error ECE must be below 0.05
Must-beat-production Must outperform the current production model

If any gate fails, the candidate is rejected and the current production model continues serving.

7. Drift Detection Is Non-Negotiable

Prediction market models face two types of drift:

  • Data drift: Input feature distributions change (e.g., a new candidate enters the race, changing polling distributions).
  • Concept drift: The relationship between features and outcomes changes (e.g., polls become less predictive as the election approaches).

Detection methods:

Method What It Detects Threshold
PSI (Population Stability Index) Distribution shift in predictions < 0.1 stable, 0.1-0.25 moderate, > 0.25 high
KS test Feature distribution changes p < 0.05 indicates drift
Rolling Brier score Performance degradation > 1.5x baseline triggers retraining
Wasserstein distance Overall distribution shift Task-specific threshold

8. Retraining Should Be Triggered, Not Scheduled

Scheduled retraining (e.g., "retrain every 30 days") is wasteful when the world is stable and insufficient when the world changes suddenly. Trigger-based retraining fires when monitoring detects evidence of degradation:

IF rolling_brier_30d > 1.5 * baseline_brier THEN trigger_retrain()

This adapts to the pace of change in the underlying data.

9. CI/CD for ML Automates Quality Assurance

A CI/CD pipeline for ML ensures that every code change is tested, every model is validated, and every deployment is reversible:

  1. Code push triggers automated unit and integration tests.
  2. Training pipeline runs on fresh data.
  3. Validation gates check model quality.
  4. Staging deployment allows canary testing.
  5. Production deployment with monitoring and rollback.

10. Governance and Reproducibility Are Not Optional

Every prediction must be traceable to a specific model version, trained on specific data, with specific hyperparameters. This requires:

  • Pinned dependency versions (requirements.txt).
  • Data versioning (hashes or DVC).
  • Deterministic training (fixed random seeds).
  • Audit logs for all model transitions.

Key Code Patterns

# scikit-learn Pipeline with ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features),
    ])),
    ('classifier', GradientBoostingClassifier())
])

# MLflow experiment tracking
import mlflow
with mlflow.start_run():
    mlflow.log_params(params)
    mlflow.log_metric("brier_score", brier)
    mlflow.sklearn.log_model(pipeline, "model")

# PSI drift detection
def compute_psi(reference, current, n_bins=10):
    bins = np.linspace(min(ref.min(), cur.min()), max(ref.max(), cur.max()), n_bins+1)
    ref_pct = np.histogram(reference, bins)[0] / len(reference)
    cur_pct = np.histogram(current, bins)[0] / len(current)
    return np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))

Key Formulas

Formula Purpose
$\text{PSI} = \sum_i (p_i^{cur} - p_i^{ref}) \ln(p_i^{cur} / p_i^{ref})$ Measure prediction distribution shift
$\text{KS} = \sup_x \lvert F_{ref}(x) - F_{cur}(x) \rvert$ Detect feature distribution changes
$\text{ECE} = \sum_b \frac{n_b}{N} \lvert \bar{p}_b - \bar{y}_b \rvert$ Measure calibration error
$\text{Brier} = \frac{1}{N}\sum_i (p_i - y_i)^2$ Overall forecast accuracy

Decision Framework

Question Recommendation
How to prevent training-serving skew? scikit-learn Pipeline + feature store
How to track experiments? MLflow with parameters, metrics, and artifacts
How to version models? MLflow Model Registry with staging/production stages
When to retrain? Triggered by drift detection, not by schedule
What drift metric? PSI for predictions, KS for features, rolling Brier for performance
How to deploy safely? Validation gates + canary/shadow deployment + rollback
What MLOps level to target? Level 2 (CI/CD) minimum; Level 3 (full automation) ideal

The One-Sentence Summary

Encapsulate preprocessing in pipelines, store features centrally, track experiments systematically, monitor for drift continuously, and automate the entire train-validate-deploy-monitor cycle so that your prediction market models run reliably without constant human intervention.