Key Takeaways: The Machine Learning Workflow
This is your reference card for Chapter 30, the capstone of Part V. The core lesson: the workflow is as important as the algorithm. A great model inside a leaky workflow is worse than a mediocre model inside a clean one.
Key Concepts
-
Data leakage is the silent killer of ML projects. It occurs when information from outside the training set influences model training. The most common form: fitting a scaler or encoder on the entire dataset before splitting. The result: overoptimistic performance estimates that don't hold up in production.
-
Pipelines are the cure. A scikit-learn
Pipelinechains preprocessing and modeling into a single object that guarantees preprocessing steps are fit only on training data. Use pipelines for everything, always. -
ColumnTransformer handles real-world data. Real datasets have numeric, categorical, and binary features that need different preprocessing.
ColumnTransformerapplies the right transformation to the right columns, all inside the pipeline. -
GridSearchCV systematizes hyperparameter tuning. Instead of manually trying different settings, GridSearchCV tries every combination in a grid and evaluates each with cross-validation.
RandomizedSearchCVis faster for large grids. -
The test set is a sealed envelope. Hold it out at the beginning. Make all decisions using cross-validation on the training set. Touch the test set exactly once, at the very end. If you violate this, your performance estimate is unreliable.
-
Reproducibility requires deliberate effort. Set random seeds everywhere. Record library versions. Save the fitted pipeline. Document your decisions. Future-you will thank present-you.
The Complete ML Workflow
1. DEFINE THE PROBLEM
- What question? What metric? What's the baseline?
|
v
2. PREPARE THE DATA
- Load, inspect, identify numeric/categorical features
- Hold out final test set (20-30%)
|
v
3. BUILD THE PIPELINE
- ColumnTransformer for preprocessing
- Model as the final step
- Everything inside the pipeline
|
v
4. TUNE HYPERPARAMETERS
- GridSearchCV or RandomizedSearchCV
- Cross-validation on TRAINING data only
- Scoring metric matches your problem
|
v
5. EVALUATE ON TEST SET
- Touch it ONCE
- Report multiple metrics + confusion matrix
|
v
6. INTERPRET & COMMUNICATE
- Feature importance, business language
- Document limitations
|
v
7. SAVE & DEPLOY
- joblib.dump(pipeline, 'model.joblib')
- Document expected input format
Pipeline Quick Reference
# Simple pipeline (numeric features only)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier(random_state=42))
])
# Mixed features pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first', handle_unknown='ignore'),
categorical_features)
])
full_pipe = Pipeline([
('preprocessor', preprocessor),
('model', RandomForestClassifier(random_state=42))
])
# Fit, predict, score — all leak-free
full_pipe.fit(X_train, y_train)
full_pipe.predict(X_test)
full_pipe.score(X_test, y_test)
# Cross-validate — leak-free inside each fold
from sklearn.model_selection import cross_val_score
scores = cross_val_score(full_pipe, X, y, cv=5, scoring='f1')
GridSearchCV Quick Reference
from sklearn.model_selection import GridSearchCV, StratifiedKFold
param_grid = {
'model__n_estimators': [100, 200, 300],
'model__max_depth': [5, 8, 12],
'model__min_samples_leaf': [1, 5, 10]
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(
full_pipe, param_grid,
cv=cv, scoring='f1', n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print(gs.best_params_) # Best hyperparameters
print(gs.best_score_) # Best CV score
best_model = gs.best_estimator_ # Trained pipeline, ready to use
Saving and Loading Models
import joblib
# Save the entire pipeline
joblib.dump(best_model, 'my_pipeline.joblib')
# Load and use later
loaded = joblib.load('my_pipeline.joblib')
predictions = loaded.predict(new_data)
The Three Types of Data Leakage
| Type | What Leaks | How to Prevent |
|---|---|---|
| Preprocessing leakage | Test set statistics (mean, std, categories) influence training | Put all preprocessing inside a Pipeline |
| Target leakage | Features contain information caused by or derived from the target | Audit every feature: "Would I know this before making the prediction?" |
| Group leakage | Related observations (same patient, same company) appear in both train and test | Split by group using GroupKFold |
Common Mistakes to Avoid
| Mistake | Why It's Wrong | What to Do Instead |
|---|---|---|
| Preprocessing outside the pipeline | Causes leakage — test data influences training | Put everything inside the Pipeline |
| Using test set during development | Contaminates the final evaluation | Cross-validate on training set; test set touched once |
Forgetting handle_unknown='ignore' |
OneHotEncoder crashes on unseen categories | Always set this when categorical features might have unseen values |
Missing random_state |
Results are not reproducible | Set it in train_test_split, models, CV splitters |
| Wrong parameter naming in GridSearchCV | ValueError because parameter isn't found | Use stepname__paramname with double underscores |
Using scoring='accuracy' by default |
May be the wrong metric for your problem | Explicitly set scoring to match your problem's cost structure |
Part V Complete: What You Can Do Now
After Chapters 25-30, you can:
- [ ] Build three types of models: linear regression, logistic regression, and tree-based (decision tree + random forest)
- [ ] Evaluate models with the right metric: accuracy, precision, recall, F1, AUC, MAE, RMSE, R²
- [ ] Compare models fairly using cross-validation
- [ ] Prevent data leakage by putting all preprocessing inside pipelines
- [ ] Handle mixed feature types with ColumnTransformer
- [ ] Tune hyperparameters systematically with GridSearchCV
- [ ] Save and load models with joblib
- [ ] Reproduce your results with random seeds and documented environments
- [ ] Communicate your results in business language, not just technical metrics
If you checked every box, congratulations — you have a professional-grade foundation in machine learning. Part VI will teach you to communicate findings, think about ethics, collaborate with others, and build the kind of portfolio that opens doors. The hardest technical work is behind you. Now let's make it matter.