Appendix G: Templates and Checklists
Practical templates for every phase of a data science project. Print these, pin them to your wall, paste them into your project documentation. They encode the hard-won lessons from this book into repeatable processes.
These templates assume you have read the relevant chapters. They are not substitutes for understanding — they are guardrails against forgetting.
1. ML Project Checklist
Use this for every project from problem framing to production monitoring. Cross-reference: Chapters 1, 2, 34, 35.
Phase 1: Problem Framing
- [ ] Business question defined in one sentence
- [ ] Success metric agreed upon with stakeholders (not just model metric — business metric)
- [ ] ML is the right tool (have you ruled out simpler approaches: rules, heuristics, SQL queries?)
- [ ] Target variable defined precisely (what are you predicting, at what grain, over what time window?)
- [ ] Observation unit defined (one row = one ___?)
- [ ] Prediction horizon defined (how far in advance does the prediction need to be made?)
- [ ] Action plan: what will the business DO with the prediction?
- [ ] Cost of errors articulated: what happens when the model is wrong (false positive vs. false negative)?
- [ ] Baseline established: what is the current decision process, and how good is it?
- [ ] Ethical review: could this model cause harm? Who is affected? Are protected groups involved?
Phase 2: Data Collection and Understanding
- [ ] Data sources identified and access confirmed
- [ ] Data dictionary obtained or created
- [ ] Data freshness verified (when was this data collected? Is it representative of the current state?)
- [ ] Row count and feature count documented
- [ ] Target variable distribution examined (class balance for classification, distribution shape for regression)
- [ ] Missing data patterns cataloged (MCAR, MAR, or MNAR?)
- [ ] Data leakage risks identified (features that would not be available at prediction time)
- [ ] Join logic validated (if multiple tables — are you introducing duplication?)
- [ ] Sample bias assessed (does the data represent the population you will predict on?)
Phase 3: Exploratory Data Analysis
- [ ] Univariate distributions for all features (histograms, value counts)
- [ ] Target variable distribution and class balance
- [ ] Bivariate relationships: features vs. target (correlation, box plots, scatter plots)
- [ ] Feature-feature correlations (multicollinearity check)
- [ ] Outliers identified and decision made (keep, cap, remove, investigate)
- [ ] Temporal patterns checked (if time is involved — is there drift?)
- [ ] Domain expert consulted on surprising patterns
- [ ] EDA findings documented (not just run in a notebook and forgotten)
Phase 4: Feature Engineering and Data Preparation
- [ ] Feature engineering plan based on EDA and domain knowledge
- [ ] Categorical encoding strategy chosen per feature (one-hot, ordinal, target encoding)
- [ ] Missing data strategy chosen per feature (imputation method, indicator flags)
- [ ] Numerical scaling applied if required by algorithm (standardization, normalization)
- [ ] Feature interactions created where domain knowledge suggests them
- [ ] Temporal features engineered correctly (no future leakage)
- [ ] Pipeline built (all transformations reproducible, not manual)
- [ ] Train/validation/test split created BEFORE any data-dependent transformations
- [ ] Transformations fit ONLY on training data, applied to validation and test
Phase 5: Model Training and Selection
- [ ] Baseline model trained (logistic regression or mean prediction)
- [ ] Multiple model families tried (linear, tree-based, gradient boosting at minimum)
- [ ] Cross-validation used for model comparison (not a single train/test split)
- [ ] Appropriate metric selected based on business context
- [ ] Overfitting checked (training score vs. validation score gap)
- [ ] Learning curves plotted (does the model need more data or more features?)
- [ ] Feature importance examined (do the important features make domain sense?)
- [ ] Hyperparameter tuning performed on top 1-2 candidates
- [ ] Final model evaluated on held-out test set ONCE
Phase 6: Model Interpretation and Validation
- [ ] SHAP values computed for global and local interpretation
- [ ] Top features align with domain expectations (if not — investigate, do not ignore)
- [ ] Edge cases tested (what does the model predict for extreme inputs?)
- [ ] Subgroup analysis performed (does the model work equally well across segments?)
- [ ] Fairness metrics computed for protected groups (if applicable)
- [ ] Model card drafted (what the model does, does not do, and for whom it may fail)
- [ ] Results presented to domain expert for sanity check
Phase 7: Deployment and Monitoring
- [ ] Serving infrastructure chosen (batch vs. real-time)
- [ ] API endpoint tested with sample inputs
- [ ] Input validation implemented (what happens with missing fields, wrong types, out-of-range values?)
- [ ] Prediction logging enabled
- [ ] Monitoring dashboard deployed (prediction distribution, feature drift, latency)
- [ ] Alerting configured (drift beyond threshold, error rate spike, latency degradation)
- [ ] Retraining schedule defined (or trigger-based retraining criteria)
- [ ] Rollback plan documented (how to revert to the previous model)
- [ ] A/B test plan created for comparing new model to current system
- [ ] Documentation completed (model card, data lineage, feature definitions)
2. EDA Template
What to check for every new dataset. Cross-reference: Chapter 1 (initial exploration), Chapter 6 (feature engineering context).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def eda_report(df, target_col=None):
"""Generate a standard EDA report for any DataFrame."""
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"Duplicate rows: {df.duplicated().sum():,}")
print()
# --- Data types ---
print("DATA TYPES")
print(df.dtypes.value_counts())
print()
# --- Missing data ---
print("MISSING DATA")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
missing_df = pd.DataFrame({
"count": missing,
"pct": missing_pct
})
print(missing_df[missing_df["count"] > 0].sort_values(
"pct", ascending=False
))
print()
# --- Numeric summary ---
print("NUMERIC FEATURES")
print(df.describe().T[["count", "mean", "std", "min", "max"]])
print()
# --- Categorical summary ---
cat_cols = df.select_dtypes(include=["object", "category"]).columns
if len(cat_cols) > 0:
print("CATEGORICAL FEATURES")
for col in cat_cols:
n_unique = df[col].nunique()
top_val = df[col].mode().iloc[0] if not df[col].mode().empty else "N/A"
print(f" {col}: {n_unique} unique, top = '{top_val}'")
print()
# --- Target variable ---
if target_col and target_col in df.columns:
print(f"TARGET: {target_col}")
if df[target_col].nunique() <= 20:
print(df[target_col].value_counts(normalize=True).round(3))
else:
print(df[target_col].describe())
print()
return missing_df
EDA Checklist (Non-Code)
For each dataset, answer these questions before modeling:
- What does one row represent? (customer, transaction, day, sensor reading)
- How many rows and columns? (rough order of magnitude)
- What is the target variable, and what is its distribution?
- Which features are numeric, categorical, datetime, text, or ID columns?
- How much data is missing, and is the missingness random or informative?
- Are there obvious outliers, and are they errors or real extreme values?
- Which features correlate with the target? Which correlate with each other?
- Is there a time component? If so, is there temporal leakage risk?
- Does the data contain protected attributes (race, gender, age)?
- What would a domain expert say about this data?
3. Model Comparison Template
Standardized table for comparing model performance. Cross-reference: Chapters 13, 14, 16, 18.
Comparison Table
| Model | CV Metric (mean +/- std) | Test Metric | Train Metric | Overfit? | Training Time | Inference Time | Notes |
|---|---|---|---|---|---|---|---|
| Logistic Regression (baseline) | |||||||
| Random Forest (default) | |||||||
| XGBoost (default) | |||||||
| LightGBM (default) | |||||||
| XGBoost (tuned) | |||||||
| LightGBM (tuned) |
Comparison Code
from sklearn.model_selection import cross_validate
import time
def compare_models(models, X_train, y_train, scoring, cv=5):
"""Compare models with cross-validation, timing, and overfitting check."""
results = []
for name, model in models.items():
start = time.time()
cv_results = cross_validate(
model, X_train, y_train,
scoring=scoring,
cv=cv,
return_train_score=True,
n_jobs=-1
)
elapsed = time.time() - start
results.append({
"model": name,
"cv_mean": cv_results["test_score"].mean(),
"cv_std": cv_results["test_score"].std(),
"train_mean": cv_results["train_score"].mean(),
"overfit_gap": (
cv_results["train_score"].mean()
- cv_results["test_score"].mean()
),
"fit_time": elapsed
})
return pd.DataFrame(results).sort_values("cv_mean", ascending=False)
Decision Criteria
After filling in the table, ask:
- Does any model significantly outperform the baseline?
- Is the best model overfitting (large gap between train and CV scores)?
- Is the performance difference between top models within CV standard deviation? (If so, pick the simpler model.)
- Does inference time meet production requirements?
- Is the model interpretable enough for stakeholders?
4. A/B Test Design Template
Before running any experiment. Cross-reference: Chapter 3.
Pre-Experiment Checklist
- [ ] Hypothesis: "Changing [X] will [increase/decrease] [metric] by at least [minimum detectable effect]."
- [ ] Primary metric: One metric that decides success or failure. Define it precisely.
- [ ] Guardrail metrics: Metrics that must NOT degrade (e.g., revenue, page load time, error rate).
- [ ] Minimum detectable effect (MDE): The smallest improvement that would be worth implementing.
- [ ] Statistical significance level: alpha = ___ (typically 0.05)
- [ ] Statistical power: 1 - beta = ___ (typically 0.80)
- [ ] Required sample size: Calculated from MDE, alpha, power, and baseline metric.
- [ ] Test duration: Based on required sample size and daily traffic. Minimum one full business cycle (typically one week).
- [ ] Randomization unit: User, session, device, or page view? (User is almost always correct.)
- [ ] Randomization method: How are users assigned to treatment/control?
- [ ] Novelty/primacy effects: Will you exclude the first N days from analysis?
Sample Size Calculation
from scipy import stats
import numpy as np
def ab_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
"""Calculate required sample size per group for a two-proportion z-test."""
p1 = baseline_rate
p2 = baseline_rate + mde
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
p_bar = (p1 + p2) / 2
n = ((z_alpha * np.sqrt(2 * p_bar * (1 - p_bar))
+ z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))**2
/ (p2 - p1)**2)
return int(np.ceil(n))
# Example: baseline conversion 5%, want to detect 0.5% absolute lift
n = ab_sample_size(0.05, 0.005)
print(f"Required per group: {n:,}")
Post-Experiment Checklist
- [ ] Sample ratio mismatch check (are groups equal in size, within expected variance?)
- [ ] Primary metric: point estimate and confidence interval
- [ ] Statistical significance assessed (p-value or Bayesian posterior)
- [ ] Practical significance assessed (is the effect large enough to matter?)
- [ ] Guardrail metrics checked (no degradation?)
- [ ] Subgroup analysis (does the effect vary across segments?)
- [ ] Decision documented with rationale
5. Deployment Readiness Checklist
Before promoting a model to production. Cross-reference: Chapters 29, 30, 31, 32.
Code Quality
- [ ] Code passes linting (flake8, ruff, or black)
- [ ] Unit tests for feature engineering functions
- [ ] Integration tests for the full prediction pipeline
- [ ] No hardcoded paths, credentials, or magic numbers
- [ ] Dependencies pinned in
requirements.txtorpyproject.toml - [ ] Code reviewed by at least one other person
Model Validation
- [ ] Model performance meets agreed threshold on held-out test data
- [ ] Performance validated on recent data (not just historical)
- [ ] Subgroup analysis shows acceptable performance across segments
- [ ] Fairness audit completed (if protected groups are involved)
- [ ] Edge case testing completed (null inputs, extreme values, unseen categories)
- [ ] Model card written and reviewed
Infrastructure
- [ ] Serving endpoint deployed and responding (health check passes)
- [ ] Input schema validated (what happens with malformed requests?)
- [ ] Latency meets SLA (p50, p95, p99 measured)
- [ ] Throughput tested under expected load
- [ ] Fallback behavior defined (what happens when the model is unavailable?)
- [ ] Logging captures: input features, predictions, timestamps, model version
Monitoring
- [ ] Prediction distribution monitoring active
- [ ] Feature drift monitoring active
- [ ] Performance monitoring active (if ground truth is available with delay)
- [ ] Alert thresholds configured
- [ ] Dashboard accessible to the team
- [ ] Retraining trigger defined (scheduled or drift-based)
Documentation and Governance
- [ ] Model card completed (purpose, limitations, intended use, fairness audit results)
- [ ] Data lineage documented (where does each feature come from?)
- [ ] Experiment tracking entry linked (MLflow run ID or equivalent)
- [ ] Approval from model owner / business stakeholder
- [ ] Rollback procedure documented and tested
6. Model Card Template
Document what your model does, how well it does it, and where it fails. Cross-reference: Chapter 33.
MODEL CARD: [Model Name]
========================
1. MODEL DETAILS
- Model type: [e.g., XGBoost classifier]
- Version: [e.g., 2.1.0]
- Date trained: [YYYY-MM-DD]
- Training data date range: [start] to [end]
- Owner: [team or individual]
- Contact: [email or Slack channel]
2. INTENDED USE
- Primary use case: [one sentence]
- Intended users: [who consumes the predictions]
- Out-of-scope uses: [what this model should NOT be used for]
3. TRAINING DATA
- Source: [database, table, or file]
- Size: [rows, features]
- Target variable: [name and definition]
- Known limitations: [sampling bias, temporal coverage, missing populations]
4. EVALUATION RESULTS
- Primary metric: [name] = [value] on [test set description]
- Secondary metrics:
- [metric]: [value]
- [metric]: [value]
- Performance by subgroup:
| Subgroup | Metric | Value |
|---------------|--------|-------|
| [group A] | | |
| [group B] | | |
5. FAIRNESS ANALYSIS
- Protected attributes examined: [list]
- Fairness metrics:
| Metric | Group A | Group B | Threshold | Pass? |
|---------------------|---------|---------|-----------|-------|
| Demographic parity | | | | |
| Equalized odds | | | | |
- Mitigation applied: [none / reweighting / threshold adjustment / etc.]
6. LIMITATIONS AND RISKS
- [Known failure modes]
- [Populations where performance degrades]
- [Temporal assumptions — will the model degrade as the world changes?]
7. MONITORING
- Drift detection: [method and threshold]
- Retraining schedule: [frequency or trigger]
- Alerting: [who gets paged and when]
7. Fairness Audit Checklist
For any model that affects people. Cross-reference: Chapter 33.
Pre-Modeling
- [ ] Identify protected attributes in the data (race, gender, age, disability, religion, national origin)
- [ ] Identify proxy variables that correlate with protected attributes (zip code, name, school)
- [ ] Document the potential harms of model errors for different groups
- [ ] Consult affected communities or domain experts on fairness criteria
- [ ] Choose appropriate fairness definition(s) based on context — and document WHY
During Modeling
- [ ] Examine training data for historical bias (are outcomes biased by past discrimination?)
- [ ] Check representation: are all groups adequately represented in training data?
- [ ] Monitor performance metrics by subgroup during cross-validation
- [ ] If subgroup performance differs significantly, investigate root cause before applying mitigation
Post-Modeling
- [ ] Compute fairness metrics across all protected groups:
- [ ] Demographic parity (positive prediction rate by group)
- [ ] Equalized odds (TPR and FPR by group)
- [ ] Predictive parity (precision by group)
- [ ] Disparate impact ratio (four-fifths rule if applicable)
- [ ] Document the fairness-accuracy tradeoff: what accuracy would you lose to achieve fairness?
- [ ] Apply mitigation if needed:
- [ ] Pre-processing: reweighting, resampling
- [ ] In-processing: fairness constraints during training
- [ ] Post-processing: group-specific threshold adjustment
- [ ] Re-evaluate all metrics after mitigation
- [ ] Document findings in the model card
- [ ] Get sign-off from ethics review board or responsible AI team (if your organization has one)
Ongoing
- [ ] Monitor fairness metrics in production (they can drift just like performance)
- [ ] Re-audit after retraining
- [ ] Update the model card with each version
Using These Templates
These templates are starting points. Adapt them to your organization, your domain, and your risk tolerance. A model that recommends movies needs less scrutiny than a model that decides who gets a loan.
The point is not to check every box mechanically. The point is to ask the right questions before you need to, because the cost of asking after deployment is always higher.