Chapter 27 Key Takeaways
Core Principle
A model in a notebook generates zero profit. The distance from "it works on my machine" to "it runs reliably in production, 24/7" is vast -- and it is the distance that separates hobbyist data science from professional prediction market trading.
The Big Ideas
1. Training-Serving Skew Is the Silent Killer
The most common production ML failure is a mismatch between the features used during training and the features used during inference. If training uses a 7-day rolling average of polls and serving uses a 5-day average (because someone copied the wrong code), the model will silently produce degraded predictions.
Prevention: Encapsulate all preprocessing into a scikit-learn Pipeline and serve with the same Pipeline object used during training.
2. scikit-learn Pipelines Are the Foundation
A Pipeline chains preprocessing steps and a model into a single serializable object:
pipeline = Pipeline([
('preprocessor', ColumnTransformer([...])),
('classifier', GradientBoostingClassifier())
])
pipeline.fit(X_train, y_train) # Fit everything
predictions = pipeline.predict(X_new) # Apply everything
joblib.dump(pipeline, 'model.pkl') # Serialize everything
Key capabilities:
- ColumnTransformer applies different transformations to numeric vs. categorical features.
- Custom transformers (via BaseEstimator + TransformerMixin) add domain-specific logic.
- GridSearchCV and cross_val_score work directly on pipelines.
- Serialized pipelines guarantee that training and serving use identical preprocessing.
3. Feature Stores Solve the Feature Management Problem
A feature store is a centralized repository for feature definitions, computations, and values. It solves three problems:
| Problem | Solution |
|---|---|
| Feature inconsistency | Single source of truth for each feature |
| Training-serving skew | Same store serves training and inference |
| Point-in-time leakage | Temporal joins prevent future data leaking into training |
The two interfaces: offline (batch retrieval for training with point-in-time correctness) and online (low-latency lookup for real-time inference).
4. MLflow Tracks Experiments Systematically
MLflow's tracking API logs every training run with: - Parameters: Hyperparameters and configuration. - Metrics: Brier score, log-loss, AUC, calibration error. - Artifacts: Serialized models, plots, feature importance files. - Tags: Metadata for search and organization.
This creates an auditable history of every model ever trained, enabling comparison, reproduction, and rollback.
5. The Model Registry Manages the Lifecycle
A model registry stores versioned models and manages their lifecycle through stages:
$$\text{None} \rightarrow \text{Staging} \rightarrow \text{Production} \rightarrow \text{Archived}$$
Key rules: - New models enter Staging and must pass validation gates before Production. - Only one model version is in Production at a time. - Archived models are retained for rollback. - Every transition is logged with timestamps and reasons.
6. Validation Gates Prevent Bad Models from Deploying
Automated quality checks that a candidate model must pass before promotion to Production:
| Gate | Criterion |
|---|---|
| Brier score | Must be below a threshold (e.g., 0.25) |
| AUC-ROC | Must exceed a minimum (e.g., 0.65) |
| Calibration error | ECE must be below 0.05 |
| Must-beat-production | Must outperform the current production model |
If any gate fails, the candidate is rejected and the current production model continues serving.
7. Drift Detection Is Non-Negotiable
Prediction market models face two types of drift:
- Data drift: Input feature distributions change (e.g., a new candidate enters the race, changing polling distributions).
- Concept drift: The relationship between features and outcomes changes (e.g., polls become less predictive as the election approaches).
Detection methods:
| Method | What It Detects | Threshold |
|---|---|---|
| PSI (Population Stability Index) | Distribution shift in predictions | < 0.1 stable, 0.1-0.25 moderate, > 0.25 high |
| KS test | Feature distribution changes | p < 0.05 indicates drift |
| Rolling Brier score | Performance degradation | > 1.5x baseline triggers retraining |
| Wasserstein distance | Overall distribution shift | Task-specific threshold |
8. Retraining Should Be Triggered, Not Scheduled
Scheduled retraining (e.g., "retrain every 30 days") is wasteful when the world is stable and insufficient when the world changes suddenly. Trigger-based retraining fires when monitoring detects evidence of degradation:
IF rolling_brier_30d > 1.5 * baseline_brier THEN trigger_retrain()
This adapts to the pace of change in the underlying data.
9. CI/CD for ML Automates Quality Assurance
A CI/CD pipeline for ML ensures that every code change is tested, every model is validated, and every deployment is reversible:
- Code push triggers automated unit and integration tests.
- Training pipeline runs on fresh data.
- Validation gates check model quality.
- Staging deployment allows canary testing.
- Production deployment with monitoring and rollback.
10. Governance and Reproducibility Are Not Optional
Every prediction must be traceable to a specific model version, trained on specific data, with specific hyperparameters. This requires:
- Pinned dependency versions (requirements.txt).
- Data versioning (hashes or DVC).
- Deterministic training (fixed random seeds).
- Audit logs for all model transitions.
Key Code Patterns
# scikit-learn Pipeline with ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
pipeline = Pipeline([
('preprocessor', ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features),
])),
('classifier', GradientBoostingClassifier())
])
# MLflow experiment tracking
import mlflow
with mlflow.start_run():
mlflow.log_params(params)
mlflow.log_metric("brier_score", brier)
mlflow.sklearn.log_model(pipeline, "model")
# PSI drift detection
def compute_psi(reference, current, n_bins=10):
bins = np.linspace(min(ref.min(), cur.min()), max(ref.max(), cur.max()), n_bins+1)
ref_pct = np.histogram(reference, bins)[0] / len(reference)
cur_pct = np.histogram(current, bins)[0] / len(current)
return np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
Key Formulas
| Formula | Purpose |
|---|---|
| $\text{PSI} = \sum_i (p_i^{cur} - p_i^{ref}) \ln(p_i^{cur} / p_i^{ref})$ | Measure prediction distribution shift |
| $\text{KS} = \sup_x \lvert F_{ref}(x) - F_{cur}(x) \rvert$ | Detect feature distribution changes |
| $\text{ECE} = \sum_b \frac{n_b}{N} \lvert \bar{p}_b - \bar{y}_b \rvert$ | Measure calibration error |
| $\text{Brier} = \frac{1}{N}\sum_i (p_i - y_i)^2$ | Overall forecast accuracy |
Decision Framework
| Question | Recommendation |
|---|---|
| How to prevent training-serving skew? | scikit-learn Pipeline + feature store |
| How to track experiments? | MLflow with parameters, metrics, and artifacts |
| How to version models? | MLflow Model Registry with staging/production stages |
| When to retrain? | Triggered by drift detection, not by schedule |
| What drift metric? | PSI for predictions, KS for features, rolling Brier for performance |
| How to deploy safely? | Validation gates + canary/shadow deployment + rollback |
| What MLOps level to target? | Level 2 (CI/CD) minimum; Level 3 (full automation) ideal |
The One-Sentence Summary
Encapsulate preprocessing in pipelines, store features centrally, track experiments systematically, monitor for drift continuously, and automate the entire train-validate-deploy-monitor cycle so that your prediction market models run reliably without constant human intervention.