Index

Alphabetical listing of key topics with chapter references. Bold entries indicate primary coverage; regular entries indicate secondary or brief mention.

A/B testing: 3, 16, 32, 34

Accuracy: 16, 11, 13, 14, 17, 33

Adjusted Rand Index (ARI): 20

Aggregation (SQL): 5, 6, 10

ARIMA: 25, 36

Association rules: 23, 24

AUC: see ROC AUC, PR AUC

Backpropagation: 36, 4

Bagging: 13, 14

Balanced accuracy: 16, 17

Base rate: 16, 17, 33

Baseline model: 2, 11, 16, 35

Batch prediction: 31, 32

Bayesian optimization: 18

Bias (algorithmic): 33, 19, 34, 35

Bias (statistical): 4, 11, 16

Bias-variance tradeoff: 1, 4, 11, 13, 14, 18

BigQuery: 5, 28

Binary classification: 1, 11, 13, 14, 16, 17

Binning: 6, 7

Boosting: 14, 13, 18

Box plot: 1, 6, 8

Business metric: 34, 2, 35

Calibration (model): 16, 33

Cardinality (high): 7, 6, 9

Categorical encoding: 7, 6, 9, 10

CatBoost: 14, 7, 18

Causal inference: 36, 3

Chi-squared test: 9, 3

Class imbalance: 17, 16, 22, 33

Classification report: 16, 17

Clustering: 20, 21, 22

Cohen's kappa: 16

Collaborative filtering: 24, 23

Collinearity: see Multicollinearity

Confidence interval: 3, 16

Confusion matrix: 16, 17, 33, 34

Content-based filtering: 24

Correlation: 6, 9, 11, 21

Cost-benefit analysis: 34, 16, 17

Cost matrix: 17, 16, 34

Cross-validation: 16, 11, 13, 14, 18, 19

CTEs (common table expressions): 5, 10

Curse of dimensionality: 21, 9, 15, 20

Dashboard: 32, 34

Data drift: 32, 31, 35

Data leakage: 2, 5, 6, 10, 16, 25

Data pipeline: 10, 2, 5, 29, 31

Data types (Python): 6, 7, 10, 28

Davies-Bouldin index: 20

DBSCAN: 20, 22

Decision boundary: 12, 13, 15

Decision tree: 13, 14, 19

Deep learning: 36, 26, 28

Demographic parity: 33

Deployment: 31, 29, 30, 32, 35

Difference-in-differences: 36, 3

Dimensionality reduction: 21, 9, 20

Disparate impact: 33

Docker: 31, 29

Drift detection: 32, 31, 35

Dummy variable: see One-hot encoding

Early stopping: 14, 18, 36

EDA (exploratory data analysis): 1, 2, 6, 8

Elastic net: 11, 9, 18

Elbow method: 20

Embedding: 7, 24, 26, 36

Ensemble methods: 13, 14, 18

Entropy: 13, 9, 26

Equalized odds: 33

Equal opportunity: 33

Evaluation metrics: 16, 17, 33, 34

Experiment tracking: 30, 18, 29, 31, 35

Exponential smoothing: 25

F1 score: 16, 17, 18

F-beta score: 16, 17

Fairness: 33, 19, 34, 35

Fairness-accuracy tradeoff: 33, 34

False negative: 16, 17, 33, 34

False positive: 16, 17, 22, 33, 34

FastAPI: 31, 29, 35

Feature engineering: 6, 7, 8, 9, 10, 25, 26, 27, 35

Feature importance: 13, 9, 14, 19

Feature selection: 9, 6, 11, 21

Feature store: 2, 31, 10

Flask: 31

Focal loss: 17

Fraud detection: 17, 22, 16

Gaussian Naive Bayes: 15, 26

Geospatial data: 27, 6

Gini impurity: 13, 9

Git: 29, 10, 30

Gradient boosting: 14, 13, 17, 18, 19, 35

Gradient descent: 4, 11, 14, 36

Grid search: 18, 14, 16

Groupby (pandas): 5, 6, 10

Guardrail metrics: 3, 32

Hierarchical clustering: 20

Histogram-based gradient boosting: 14, 28

Hospital readmission (Metro General anchor): 1, 4, 7, 11, 16, 17, 19, 33, 34, 35

Hyperparameter tuning: 18, 14, 16

Hypothesis testing: 3, 16

Imputation: 8, 7, 10

Impossibility theorem (fairness): 33

Imbalanced data: see Class imbalance

Inference time: 14, 28, 31

Information gain: 13, 9

Interaction features: 6, 11, 14

Interpretation: see Model interpretation

Isolation Forest: 22, 20

JSON: 5, 31, 29

Jupyter notebook: 1, 2, 29, 30

K-means clustering: 20, 21, 22

K-nearest neighbors (KNN): 15, 8, 20

K-fold cross-validation: see Cross-validation

Label encoding: 7, 6

Lasso regression: 11, 9, 18

Latency: 31, 28, 32

Learning curve: 16, 4, 18

Leakage: see Data leakage

LightGBM: 14, 7, 17, 18, 28, 35

LIME: 19, 33

Linear regression: 11, 4, 6, 9

Log loss: 16, 4, 14

Log transformation: 6, 11

Logistic regression: 11, 4, 12, 16, 17

MAE (mean absolute error): 16, 11, 25

Manufacturing predictive maintenance (TurbineTech anchor): 1, 8, 17, 22, 25, 28, 32, 35

MAPE (mean absolute percentage error): 16, 25

Market basket analysis: 23

MASE (mean absolute scaled error): 25

Matthews correlation coefficient: 16

Mean encoding: see Target encoding

Merge (pandas): 5, 6, 10

Minimum detectable effect (MDE): 3

Missing data: 8, 6, 7, 10

MCAR, MAR, MNAR: 8

MLflow: 30, 18, 29, 31, 35

MLOps: 29, 30, 31, 32, 36

Model card: 33, 31, 34

Model comparison: 16, 13, 14, 18

Model decay: 32, 16, 30, 31, 35

Model deployment: see Deployment

Model governance: 33, 34

Model interpretation: 19, 13, 14, 33, 34

Model monitoring: 32, 31, 35

Model registry: 2, 30, 31

Model selection: 16, 13, 14, 18

Multiclass classification: 15, 16, 13, 14

Multicollinearity: 9, 6, 11, 21

Mutual information: 9, 6

Naive Bayes: 15, 26

NDCG (normalized discounted cumulative gain): 24

Neural network: 36, 4, 26

NLP (natural language processing): 26, 21, 36

Normalization: 4, 6, 12, 15

Novelty effect: 3, 32

Null hypothesis: 3, 16

NumPy: 4, 6, 10

Observation unit: 2, 1, 5

One-hot encoding: 7, 6, 9, 13

Ordinal encoding: 7, 6

Outlier detection: see Anomaly detection

Outliers: 6, 8, 22

Overfitting: 1, 4, 13, 14, 16, 18

P-value: 3, 16

PCA (principal component analysis): 21, 9, 20

Pandas: 1, 5, 6, 7, 8, 10, 28, 29

Partial dependence plot (PDP): 19, 14

Permutation importance: 9, 13, 19

Pipeline (scikit-learn): 10, 7, 8, 11, 18, 29, 31

Polars: 28, 10

Polynomial features: 6, 11

PostgreSQL: 5, 10, 28

Power (statistical): 3

PR AUC (precision-recall AUC): 16, 17

Precision: 16, 17, 33, 34

Precision@k: 24

Prediction interval: 25, 16

Predictive maintenance: see Manufacturing predictive maintenance

Predictive parity: 33

Problem framing: 1, 2, 34, 35

Production (ML in): 31, 32, 29, 30, 35

Progressive project (StreamFlow churn): 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 16, 17, 18, 19, 29, 30, 31, 32, 33, 34, 35

Prophet (Facebook/Meta): 25

Protected attribute: 33, 34

Pruning (tree): 13, 18

PyTorch: 36

R-squared: 16, 1, 11

Random Forest: 13, 9, 14, 17, 19

Random search: 18, 14

Ranking metrics: 24, 16

Real-time prediction: 31, 28, 32

Recall: 16, 17, 33, 34

Recall@k: 24

Recommender systems: 24, 23

Regularization: 11, 4, 12, 14, 18, 36

Reproducibility: 10, 2, 5, 18, 29, 30, 31

REST API: 31, 29

Retraining: 32, 30, 31

Ridge regression: 11, 4, 9, 18

RMSE (root mean squared error): 16, 11, 25

ROC AUC: 16, 14, 17, 18

ROC curve: 16, 17

ROI of ML: 34, 2, 35

Rollback: 31, 32

Rolling features: 6, 25

SaaS churn (StreamFlow anchor): 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 16, 17, 18, 19, 29, 30, 31, 32, 33, 34, 35

Sample size: 3, 16, 17

Scaling (feature): 4, 6, 10, 12, 15, 21

Scikit-learn: 4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21

Seasonality: 25, 6

SHAP values: 19, 14, 33, 34, 35

ShopSmart (e-commerce anchor): 3, 4, 5, 16, 20, 23, 24, 26, 34

Silhouette score: 20, 21

SMOTE: 17, 16

Software engineering: 29, 10, 30, 31

Spark (PySpark): 28, 5, 36

Specificity: 16, 33

SQL: 5, 10, 28

Stacking: 14, 18

Stakeholder communication: 34, 2, 19, 35

Standardization: see Scaling (feature)

Stationarity: 25

Statistical significance: 3, 16

Stratified sampling: 16, 17

StreamFlow: see SaaS churn

Subgroup analysis: 33, 16, 3

Support vector machine (SVM): 12, 4, 16, 21

Support vectors: 12

Survival analysis: 25, 17

t-SNE: 21, 20

Target encoding: 7, 6, 9

Target variable: 1, 2, 5, 6

Technical debt (ML): 29, 2, 32

TF-IDF: 26, 9, 21

Threshold (classification): 16, 17, 33, 34

Time series: 25, 6, 32, 36

Tokenization: 26

Train-test split: 1, 2, 16

Transformer (architecture): 36, 26

Tree-based methods: 13, 14, 19

True positive rate: see Recall

t-test: 3

TurbineTech: see Manufacturing predictive maintenance

UMAP: 21, 20

Underfitting: 1, 4, 16, 18

Unit testing: 29, 10, 31

Validation set: 16, 2, 18

Variance (bias-variance): see Bias-variance tradeoff

Variance inflation factor (VIF): 9, 11

Version control: see Git

Voting classifier: 13, 18

WAPE: 25

Window functions (SQL): 5, 6, 10, 25

Word embedding: 26, 21, 36

Word2Vec: 26

XGBoost: 14, 7, 13, 17, 18, 19, 28, 35