Glossary: Intermediate Data Science

#

"Did the retention offer work?": Three independent causal analyses say YES. - The offer reduces churn by approximately 6 percentage points for treated subscribers. - This translates to approximately 1,530 subscribers retained per year who would have otherwise canceled. - Net annual value after accounting for discount costs: approxi → Case Study 2: Causal Inference at StreamFlow --- Did the Retention Offer Work?
$537,600/month in retained revenue: Minus: 12,000 x $8 intervention cost = **$96,000/month** - Net value: **~$441,600/month**, or **$5.3M/year** → Case Study 1: StreamFlow Churn --- Building the Logistic Regression Baseline
1. Preprocessing: tokenize, lowercase, remove punctuation and stop words, stem or lemmatize. This is the unglamorous plumbing that determines whether your model sees "Running," "running," "runs," and "ran" as four different words or as one concept. → Chapter 26: NLP Fundamentals
1. Suspiciously high performance: AUC-ROC > 0.95 on a real-world problem demands investigation - Compare to published benchmarks for similar problems → Case Study 2: The Data Leakage Detective
14.2%: nearly three times the nominal rate. - The probability that at least one peek shows significance is **22.1%** --- meaning more than one in five experiments will produce a false alarm at some point during the test, even when there is no real effect. → Case Study 2: The Peeking Problem
2,735 False Alarms and That Is Fine: A maintenance engineer reviewing these results will initially balk at 2,735 unnecessary inspections. But each inspection costs $5,000, and each prevented failure saves $500,000. The model needs to be correct only once in every 100 alarms to break even. It is correct once in every 18 alarms (5.4% pre → Case Study 2: TurbineTech --- Cost-Asymmetric Failure Prediction
2. Characterize your data.: Explicit or implicit feedback? - How sparse is the user-item matrix? - How severe is the cold start problem (what fraction of users have < 5 interactions)? - Do items have content features (text, categories, images)? → Chapter 24: Recommender Systems
2. Feature importance dominance: If one feature accounts for > 25% of total importance, audit it - Ask: "Is this feature truly independent of the target?" → Case Study 2: The Data Leakage Detective
2. Primary Metric and Guardrail Metrics: Primary: 30-day churn rate among high-risk subscribers. - Guardrails: Revenue per subscriber (the discount costs money), support ticket volume, downstream renewal rate. → Chapter 3: Experimental Design and A/B Testing
2. Vectorization: convert the cleaned tokens into a numerical matrix. Bag-of-Words counts how many times each word appears. TF-IDF refines those counts by penalizing words that appear everywhere (and thus carry little information). The result is a document-term matrix where each row is a document and each column is a → Chapter 26: NLP Fundamentals
3. Modeling: feed that matrix into a classifier (logistic regression, Naive Bayes), a topic model (LDA), or a sentiment analyzer (VADER, or your own trained classifier). → Chapter 26: NLP Fundamentals
3. Randomization Design: Who is eligible? Only subscribers flagged as high-risk (churn probability > 0.7) by the model. - How do you randomize? By subscriber ID, 50/50 split. - What does control receive? No offer (standard experience). - What does treatment receive? 20% discount offer for 3 months, delivered via email and i → Chapter 3: Experimental Design and A/B Testing
3. Temporal audit: For every feature, ask: "Is this value known at the exact moment I need to make a prediction?" - Trace the feature back to its source system and verify when the value is finalized → Case Study 2: The Data Leakage Detective
312K subscribers have NULL month-over-month change: these are subscribers with less than 2 months of data. The NULL is informative: it means "new subscriber, insufficient trend data." 2. **1.85M subscribers have NULL days_since_last_ticket** — they have never filed a ticket. This is the majority of subscribers, which means "no ticket history" is the → Case Study 1: StreamFlow Feature Extraction Pipeline — From Schema to Model-Ready Table
4. Evaluation: measure whether the model actually works. For classification, the metrics from Chapter 16 apply directly. For topic modeling, coherence scores. For sentiment, both accuracy and qualitative review of misclassified examples. → Chapter 26: NLP Fundamentals
4. Train/test gap: Compare random split performance to temporal split performance - A large gap (> 0.05 AUC) suggests either temporal leakage or concept drift → Case Study 2: The Data Leakage Detective
5. Production simulation: Before deployment, run the model on the most recent data as if it were production - Compare to test set performance - If production simulation performance is much worse, investigate → Case Study 2: The Data Leakage Detective
99.66% Accuracy, $68.6 Million in Damage: This is the most dramatic demonstration of the accuracy trap in this textbook. The model is "right" 99.66% of the time, but it is wrong about the only thing that matters. In a domain where the minority class represents catastrophic outcomes, a high-accuracy model can be the most expensive model poss → Case Study 2: TurbineTech --- Cost-Asymmetric Failure Prediction

A

A domain expert at StreamFlow would tell you:: Subscribers who downgrade their plan usually cancel within 90 days - A spike in support tickets (especially "billing" category) precedes churn - Subscribers who use the API are power users and rarely churn - Failed payments that are not resolved within 7 days lead to involuntary churn - Usage declin → Chapter 5: SQL for Data Scientists — Window Functions, CTEs, and Query Optimization
Administrative Data:: Patient demographics (age, sex, insurance type) - Admission source (emergency department, transfer, scheduled) - Discharge disposition (home, home health, skilled nursing facility, etc.) - Prior admissions in the last 12 months → Case Study 2: Metro General Hospital --- When Prediction and Explanation Collide
Advantages:: Preserves local structure. Imputed values reflect the feature patterns of similar observations. - Preserves correlations between features better than simple imputation. - Works well when missingness is MAR and the feature relationships are smooth. → Chapter 8: Missing Data Strategies
Apriori Principle: If an itemset is infrequent (below min_support), then all of its supersets are also infrequent. → Chapter 23: Association Rules and Market Basket Analysis
AUC-ROC: How well does the model rank churners above non-churners? This is the primary offline metric because the business cares about ranking (who should get the retention offer first). - **Precision@K** — Of the top K subscribers the model flags, how many actually churned? With K = 15,000 (the team's capac → Chapter 2: The Machine Learning Workflow

B

Barocas, Hardt, and Narayanan (2023): the textbook. Chapters 2 and 3 give you the mathematical and legal foundations. Free online. → Further Reading: Chapter 33
batch gradient descent: it computes the gradient using *every* data point before updating the weights. With 4 customers, this is instant. With 2.4 million StreamFlow subscribers, each iteration requires a matrix multiplication with a $(2400000, 47)$ matrix. That is slow. → Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions
Before deployment:: [ ] Save reference data distributions (training data summary statistics) - [ ] Save reference prediction distribution (predictions on the validation set) - [ ] Define PSI thresholds per feature (default: 0.10 warning, 0.25 critical) - [ ] Define performance thresholds (minimum acceptable AUC, F1, pr → Chapter 32: Monitoring Models in Production
Best Practice: Always compute permutation importance on the test set, not the training set. Training-set permutation importance conflates feature importance with overfitting. A feature the model memorized will look important on training data but contribute nothing on test data. → Chapter 9: Feature Selection
bias: the regularized estimates are no longer unbiased in the statistical sense. But it dramatically reduces **variance**. The coefficients are more stable, the predictions generalize better, and the model becomes robust to multicollinearity and noise features. This is the bias-variance tradeoff from Chap → Chapter 11: Linear Models Revisited
BigQuery Note: BigQuery uses the same LAG/LEAD syntax. The main difference: `DATE_TRUNC(event_date, MONTH)` instead of `DATE_TRUNC('month', event_date)`, and `DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)` instead of `CURRENT_DATE - INTERVAL '1 month'`. → Chapter 5: SQL for Data Scientists — Window Functions, CTEs, and Query Optimization
Billing engineer:: "Failed payment retries. If the card declines and they do not update it within 48 hours, they are mentally gone." - "Users who signed up during a promotion churn at 2x the rate once the promotional price expires." → Case Study 1: StreamFlow Feature Engineering Workshop
Bootstrap sampling: each tree trains on different data 2. **Feature randomization** --- each split considers different features → Chapter 13: Tree-Based Methods
Business Context: The VP of Marketing's concern is valid. If younger subscribers receive disproportionately more retention discounts, the company is spending more on a demographic that churns more (which may be appropriate) but also potentially signaling to older subscribers that they are less valued (which is not). → Case Study 2: StreamFlow Churn Model Fairness Tradeoff

C

C = 1/alpha: Large C = weak regularization (closer to unregularized) - Small C = strong regularization (more shrinkage) → Chapter 11: Linear Models Revisited
Calibration awareness: Is a predicted 70% actually right 70% of the time? (Chapter 16 goes deep on this.) - **Decision threshold intuition** — Why the default 0.5 threshold is almost never optimal. - **Loss function understanding** — Cross-entropy loss directly comes from probability theory, as we will see in Section 4.4. → Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions
Caution: While UMAP's `transform` method is useful, the embeddings of new points are approximate and depend on the training data. If the new data is substantially different from the training data (distribution shift), the embeddings may not be meaningful. Use this for exploration, not for production feature → Chapter 21: Dimensionality Reduction
Caveat: The paired t-test on cross-validation scores has a known problem: the folds are not independent (they share training data), which violates the t-test's independence assumption. This makes the test slightly anti-conservative. A correction called the "corrected resampled t-test" (Nadeau and Bengio, 20 → Chapter 16: Model Evaluation Deep Dive
changepoints: moments where the trend changes direction or slope. This is valuable for business time series, where external events (product launches, price changes, market shifts) can alter the trajectory. → Chapter 25: Time Series Analysis and Forecasting
Checking the Assumption: In the pre-treatment period (Oct--Dec), both groups show a slight downward trend in churn, and the trends are approximately parallel. This supports the parallel trends assumption. If the Business plan had been trending downward faster than the Professional plan before the intervention, the DiD estim → Case Study 2: Causal Inference at StreamFlow --- Did the Retention Offer Work?
Churn rate reduction: The primary business metric. Does the churn rate decrease for subscribers who receive model-guided interventions compared to a control group? - **Net revenue impact** — Revenue retained from prevented churn minus the cost of retention offers. This is what the CFO cares about. - **Intervention effici → Chapter 2: The Machine Learning Workflow
class_weight vs. sample_weight: `class_weight` adjusts the loss for all samples in a class uniformly. `sample_weight` allows per-sample control. For most imbalanced problems, `class_weight='balanced'` is the right starting point. Use `sample_weight` when different samples within the same class have different importance (e.g., high → Chapter 17: Class Imbalance and Cost-Sensitive Learning
Clinical Implication: Dr. Nwosu summarizes the problem in one sentence: "We built a system to reduce readmissions, and it reduces readmissions less for Black patients than white patients. That is not acceptable." → Case Study 1: Metro General Readmission Fairness Audit
Clinical observations:: Strong social support (wife is a retired nurse, lives at home) - Simple medications (3 medications, no changes) - Clear understanding (demonstrated exercises and medication schedule) → Case Study 2: Bayes at the Hospital — Combining Model Predictions with Clinical Judgment
Colab limitations:: Sessions timeout after inactivity (free tier: ~90 minutes) - No persistent file storage (use Google Drive mount) - Limited RAM on free tier (12 GB) - No local server (cannot run FastAPI exercises natively) → Appendix D: Environment Setup Guide
Common Mistake: The error that every junior data scientist makes at least once. We name it so you can avoid it. → How to Use This Book
Completed: 36 / 36: ## Recurring Themes Tracker → Intermediate Data Science — Continuity Tracker
Concept Check: Why is ordinal encoding dangerous for nominal features in linear models? Because a linear model learns a single coefficient for the feature. If you encode `device_type` as mobile=1, desktop=2, tablet=3, smart_tv=4, the model learns that smart_tv has 4x the "effect" of mobile. That is meaningless for → Chapter 7: Handling Categorical Data
Configure the competition:: Title: "[Course Name] ML Competition - [Semester]" - Type: Private (InClass) - Team size: 1 (individual) or 2--3 (team). Individual is recommended for courses under 40 students; teams for larger classes. - Merge deadline: 1 week before the competition closes (if teams are allowed). 4. **Upload the d → In-Class Kaggle Competition Guide
Connection to What You Know: This is the same workflow as scikit-learn: load data, split, scale, fit, predict, evaluate. The syntax is different, but the logic is identical. The new element is the training loop, which you wrote by hand because neural networks require explicit gradient computation. In scikit-learn, `.fit()` hide → Chapter 36: The Road to Advanced
Constant mean: the average value does not change over time 2. **Constant variance** --- the spread of values does not change over time 3. **Constant autocovariance** --- the correlation between Y(t) and Y(t-k) depends only on the lag *k*, not on the time *t* → Chapter 25: Time Series Analysis and Forecasting
Convolutional Neural Networks (CNNs): **What:** Networks with convolutional layers that slide small filters across the input, detecting spatial patterns. - **When:** Image classification, object detection, image segmentation, medical imaging, manufacturing quality inspection. - **Key models:** ResNet, EfficientNet, YOLO (object detectio → Chapter 36: The Road to Advanced
Core Lesson: The model was always capable of reducing readmissions. It was clinician trust --- not model accuracy --- that was the bottleneck. SHAP explanations did not make the model better. They made it usable. → Case Study 2: Metro General --- SHAP for Clinicians
Core Principle: If you only learn one thing from this book, learn this: how you evaluate your model is more important than which model you choose. A mediocre model with honest evaluation will serve you better than a brilliant model with broken evaluation. Every bad model I have seen deployed in production got there → Chapter 16: Model Evaluation Deep Dive
Correction methods:: **Bonferroni correction:** Divide alpha by the number of tests. Simple and conservative. If you are testing 6 metrics at alpha = 0.05, each test uses alpha = 0.0083. This controls the family-wise error rate (FWER) --- the probability of any false positive. - **Benjamini-Hochberg (FDR):** Controls th → Chapter 3: Experimental Design and A/B Testing
covering index: it includes all the columns our queries need, so PostgreSQL can satisfy the query entirely from the index without touching the table at all (an "Index Only Scan"). This is the fastest possible read path. → Chapter 5: SQL for Data Scientists — Window Functions, CTEs, and Query Optimization
Critical Distinction: A model with higher AUC does not necessarily produce better business outcomes. A retrained churn model with AUC = 0.89 (up from 0.85) might identify the same high-risk customers but also flag too many false positives, overwhelming the customer success team. Always evaluate on the metric that matters → Chapter 32: Monitoring Models in Production
Critical Insight: Group K-fold is essential for subscription data, medical data (multiple visits per patient), sensor data (multiple readings per device), and any dataset where a single entity generates multiple rows. Failing to use group splitting will inflate your cross-validation scores and give you a model that u → Chapter 16: Model Evaluation Deep Dive
Critical Point: If you fit any preprocessing step on the full training set before passing data to `cross_val_score`, the cross-validation estimates are optimistic. The preprocessing step has already seen the validation fold. This is the most common source of inflated CV scores in practice. Pipelines prevent it by c → Chapter 10: Building Reproducible Data Pipelines
Critical Step: Always scale before PCA. PCA maximizes variance, so if one feature has a range of 0-60 and another has a range of 0-1, PCA will be dominated by the high-variance feature regardless of its importance. StandardScaler puts all features on equal footing. → Chapter 21: Dimensionality Reduction
Critical Validation: We clustered the UMAP embedding for visual labeling, but the cluster profiles use the *original* features. This is the correct workflow. Never describe clusters using UMAP coordinates --- they have no inherent meaning. Always go back to the original features to characterize what makes each cluster d → Case Study 1: StreamFlow PCA + UMAP Visualization
Critical Warning: Naive target encoding (computing means on the full training set and applying them back to the same training set) causes data leakage. Your training metrics will be inflated, and your model will underperform on new data. This is the single most common mistake in categorical encoding. → Chapter 7: Handling Categorical Data
Cross-validation variance dropped 32%: the model is more stable - **Every feature has a business interpretation** --- the retention team can understand and act on the predictions - **Monitoring burden dropped dramatically** --- 14 features to track instead of 127 → Case Study 1: StreamFlow --- When 127 Features Became 14
Customer success manager:: "They stop logging in. The silence is deafening." - "They file a support ticket and we do not resolve it quickly --- or they file multiple tickets in a short window." - "Their usage drops off a cliff. Not a gradual decline --- a sudden stop." - "They were on the annual plan and switched to monthly. → Case Study 1: StreamFlow Feature Engineering Workshop

D

Data drift: detecting when input distributions change 2. **Concept drift** --- detecting when the relationship between inputs and outputs changes 3. **Performance monitoring** --- tracking model metrics on live data 4. **Retraining strategies** --- knowing when and how to rebuild → Chapter 32: Monitoring Models in Production
Debug Challenge: Intentionally broken code. Find and fix the bug. → How to Use This Book
Debugging Insight: If you see a product from "Electronics" sitting in the middle of the "Clothing" cluster, check that product's metadata. It might have incorrect category labels in the database, or the recommendation model might be grouping it with clothing products based on purchase co-occurrence (users who buy runn → Chapter 21: Dimensionality Reduction
Decision trees are intuitive: they split data into regions using yes/no questions about features, choosing splits that maximize information gain (or minimize Gini impurity). → Chapter 13: Tree-Based Methods
Deliverable: A Jupyter notebook showing: (1) the encoding decision for each categorical feature with justification, (2) a side-by-side AUC comparison of OHE vs. target encoding for `primary_genre`, and (3) a demonstration of target encoding leakage vs. correct cross-validated target encoding. → Chapter 7: Handling Categorical Data
Deliverables:: A Jupyter notebook with the full pipeline (data to model) - A FastAPI app with `/predict` and `/health` endpoints - A one-page summary of the model's business value → Chapter 35: Capstone --- End-to-End ML System
dendrogram: a tree that shows the complete merging history. → Chapter 20: Clustering
Design Principle: Manufacturing monitoring requires asymmetric cost awareness. A false alarm (unnecessary inspection at $12,000) is annoying but recoverable. A missed failure ($340,000) is catastrophic. Set tighter drift thresholds on the features that most directly predict failure, and accept more false alarms to ca → Case Study 2: TurbineTech Seasonal Drift and Sensor Calibration
Diagnosis: The disparity is not caused by a single factor. It is the combined effect of (1) proxy variables carrying racial information, (2) representation imbalance giving the model more information about majority groups, and (3) different base rates making equal error rates mathematically impossible with a s → Case Study 1: Metro General Readmission Fairness Audit
Diagnostic plots: learning curves, validation curves, and calibration curves --- tell you what raw metrics cannot: whether more data would help, where overfitting begins, and whether predicted probabilities are trustworthy. - **Statistical tests** prevent you from chasing noise. A 0.005 AUC difference is not meaningf → Chapter 16: Model Evaluation Deep Dive
Diagnostic Question: "If I could see one more piece of information about each example, what would it be?" → Case Study 2: When Tuning Does Not Help --- The Features-First Lesson
Disadvantages:: Computationally expensive. For each missing value, the algorithm must compute distances to all complete rows. With 50,000 rows and 25 features, this can take minutes rather than milliseconds. - Sensitive to the distance metric and the value of K. - All features must be numeric (or pre-encoded). The → Chapter 8: Missing Data Strategies
Domain Knowledge: TurbineTech's maintenance engineers know that vibration above 3.5 mm/s indicates bearing wear, and above 4.5 mm/s requires immediate shutdown. The time series forecast does not replace this domain knowledge --- it augments it. The forecast says "something changed from the expected pattern." The engi → Chapter 25: Time Series Analysis and Forecasting
Domain Knowledge Alert: This result is counterintuitive. How can 12 columns outperform 2,296 columns? Because the vast majority of those 2,296 columns had too few observations to learn reliable patterns. The model was overfitting to noise in the rare codes. By grouping into 12 chapters with thousands of observations each, → Case Study 2: Metro General Hospital --- Encoding 14,000 Diagnosis Codes

E

Electronic Health Records (EHR):: Primary and secondary diagnoses (ICD-10 codes) - Procedures performed during the admission - Lab results (complete blood count, metabolic panel, hemoglobin A1c, etc.) - Vital signs at admission and discharge - Length of stay - Medications prescribed at discharge → Case Study 2: Metro General Hospital --- When Prediction and Explanation Collide
Evaluation criteria:: `make clean && make all` reproduces the model from raw data - `make test` passes all tests with zero failures - `make lint` returns zero errors - The model's AUC is within 0.01 of the original notebook's AUC → Exercises: Chapter 29
Expected drift (load):: `load_pct`: Winter heating demand increases turbine load. Expected. → Case Study 2: TurbineTech Seasonal Drift and Sensor Calibration
Expected drift (seasonal):: `ambient_temp_c`: Of course it is different --- it is winter. This is not a bug; it is physics. - `bearing_temp_c`, `oil_temp_c`, `oil_pressure_bar`: These follow ambient temperature. Expected. → Case Study 2: TurbineTech Seasonal Drift and Sensor Calibration
Expected Finding: In most cases, threshold tuning on a well-trained default model produces higher profit than resampling, because it directly optimizes for the business cost structure rather than trying to "balance" the data. SMOTE and class_weight improve recall but do so by sacrificing precision in ways that may no → Chapter 17: Class Imbalance and Cost-Sensitive Learning

F

Fairlearn documentation: the tool. It integrates with scikit-learn, computes disaggregated metrics with a single function call, and provides threshold optimization out of the box. You can run a fairness audit on your production model this afternoon. → Further Reading: Chapter 33
feature importance: a ranking of which features contribute most to predictions. But there are two methods, and they do not always agree. → Chapter 13: Tree-Based Methods
Feature importance reveals what the forest learned: but use permutation importance for reliable rankings. Impurity-based importance is fast but biased toward continuous features. → Chapter 13: Tree-Based Methods
Feature-level analysis:: PDP + ICE plots for the top 3 features - SHAP dependence plots for the top 3 features - Written description of the relationships (linear? threshold? saturating?) → Exercises: Chapter 19
Finding: The optimal threshold is typically between 0.10 and 0.20 for this cost structure because missing a churner ($220+) is much more expensive than a false alarm ($35). The current threshold of 0.20 is close to optimal but may be slightly conservative. Lowering it to 0.15 would catch additional churners → Case Study 1: StreamFlow ROI and Stakeholder Presentation

G

Global interpretation:: SHAP summary plot (dot version) - Permutation importance bar chart - A written comparison of the two methods' rankings → Exercises: Chapter 19

H

How Many Folds?: The standard choice is 5 or 10. Five folds trains on 80% of the data each time (less variance from training, more variance from smaller test set). Ten folds trains on 90% (more variance from training, less variance from larger test set). In practice, the difference is small. Use 5 folds for large da → Chapter 16: Model Evaluation Deep Dive
Hyperparameters:: `degree`: the polynomial degree (default 3) - `gamma`: scaling factor for the dot product - `coef0`: independent term (default 0) → Chapter 12: Support Vector Machines

I

Imbalance and Fairness: When your imbalance ratio differs across protected groups, the same threshold produces different recall rates for different groups. A hospital that catches 85% of Medicare readmissions but only 70% of Medicaid readmissions is providing unequal care --- and likely violating anti-discrimination requir → Chapter 17: Class Imbalance and Cost-Sensitive Learning
imblearn Pipeline vs. sklearn Pipeline: scikit-learn's `Pipeline` does not support resamplers (objects that change the number of training samples). Use `imblearn.pipeline.Pipeline` instead, which extends sklearn's pipeline to handle `fit_resample()` calls. The imblearn Pipeline ensures that resampling happens only during `fit()` (training → Chapter 17: Class Imbalance and Cost-Sensitive Learning
Important: In healthcare, the costs are not purely financial. A false negative means a patient suffers a preventable readmission. A false positive means a patient receives extra follow-up care they did not need --- which is annoying but not harmful. The cost asymmetry is even more extreme than in churn predict → Chapter 34: The Business of Data Science
impurity: the degree to which a node contains mixed classes. → Chapter 13: Tree-Based Methods
index.md: The main content. Read this first. Typically 8,000–12,000 words with embedded code, math, and visualizations. - **exercises.md** — Hands-on practice. Ranges from "apply this technique to the StreamFlow data" to "debug this intentionally broken pipeline." - **quiz.md** — Self-assessment. 10–15 multip → How to Use This Book
inertia: the sum of squared distances from each point to its assigned centroid: → Chapter 20: Clustering
Interpreting calibration: If the predicted probability roughly matches the actual frequency (the two columns are close), the model is well calibrated. If predicted probabilities are systematically lower than actual frequencies, the model is underconfident. If higher, overconfident. Gradient boosting models are generally well → Chapter 16: Model Evaluation Deep Dive
Is the effect real?: Statistical significance. Is the observed difference larger than what we would expect from random chance? 2. **How big is the effect?** --- Effect size. Even if the effect is real, is it large enough to matter? 3. **Are we confident in the direction?** --- Confidence interval. What is the plausible → Chapter 3: Experimental Design and A/B Testing

K

Kaggle limitations:: 30 GB RAM, 20 hours per week of GPU - No persistent terminal (notebook-only) - Internet access must be enabled per notebook → Appendix D: Environment Setup Guide
Key Advantage: Halving search evaluated 128 candidates but only trained the final few on the full dataset. The total computation is roughly equivalent to training 30--40 full models, compared to 128 for standard random search. On large datasets, this speedup is substantial. → Chapter 18: Hyperparameter Tuning
Key Concept: Every geospatial file carries its CRS metadata. When you load a shapefile or GeoJSON, geopandas reads the CRS automatically. When you create a GeoDataFrame from lat/lon columns, you must specify the CRS yourself. If you forget, geopandas assumes no CRS, and spatial operations will produce garbage. → Chapter 27: Working with Geospatial Data
Key concepts:: **Layer:** A transformation that takes a vector of numbers and produces another vector. A fully connected (dense) layer computes `output = activation(W @ input + b)` where W is a weight matrix and b is a bias vector. - **Activation function:** A nonlinear function applied element-wise. Without it, s → Chapter 36: The Road to Advanced
Key decisions:: `train_v1.py` archived to a git tag, then deleted. It was the original version and is never needed. - `train_v2_new_FIXED.py` was diff'd against `train_v2_new.py`: the fix was a 3-line change to handle null values. The fix was applied to the canonical `src/models/train.py`. - `predict_fast.py` was d → Case Study 2: The Technical Debt Crisis --- An ML System Nobody Can Maintain
Key Filter: In practice, lift > 1 is the minimum bar. Most practitioners filter to lift > 1.2 or higher depending on the dataset. A rule with high confidence but lift near 1.0 is misleading: the consequent is just popular, and the antecedent is not really driving the co-occurrence. Lift corrects for this base r → Chapter 23: Association Rules and Market Basket Analysis
Key Finding: The "Billing Friction" segment has the best ratio of churn rate to intervention difficulty. These subscribers want to stay. Fixing their payment issue has a high success rate and low cost. Allocating outreach budget to this segment first, before spending on harder-to-retain "Fading Away" subscribers → Case Study 2: StreamFlow Subscriber Segments and Churn Rate Differences
Key Insight: Scaling and learning rate are deeply connected. Before scaling, a learning rate of 0.01 was too large (NaN). After scaling, 0.01 was too small (slow convergence). Scaling normalizes the loss landscape so that a single learning rate works reasonably for all parameters. This is why every ML library sc → Case Study 1: Gradient Descent Debugging — When the Model Won't Converge
Key Lesson: Not all drift requires the same response. Seasonal drift is predictable and can be addressed with feature engineering (temperature normalization). Sensor calibration drift is a data quality issue that should be fixed at the source. The monitoring system detected both, but the response is different f → Case Study 2: TurbineTech Seasonal Drift and Sensor Calibration
Key Observation: Churn rate drops as genre breadth increases. But this is the obvious result --- more engagement correlates with lower churn. The interesting question is whether *specific* genre combinations predict retention *beyond* what breadth alone explains. → Case Study 2: StreamFlow Sticky Content Combinations
Key Principle: Triage is not about fixing everything. It is about fixing the things that are actively losing the company money (P0), then building the safety nets that prevent future breakage (P1), then cleaning up the rest incrementally. → Case Study 2: The Technical Debt Crisis --- An ML System Nobody Can Maintain
Key properties:: (A @ B)^T = B^T @ A^T - (A^T)^T = A → Appendix C: Math Reference for Machine Learning
Key Result: The association rule recommender typically achieves broader coverage than manual rules because it discovers cross-sell pairs that category managers missed. The hit rate depends on the quality of the rules and the test data, but the coverage advantage alone --- being able to make recommendations for → Case Study 1: ShopSmart Market Basket Analysis for Product Recommendations

L

Large Language Models (LLMs): **What:** Transformer models trained on massive text corpora (billions of parameters, trillions of tokens). - **When:** Text generation, summarization, translation, question answering, code generation, reasoning. - **The practical reality:** You will likely use LLMs through APIs (OpenAI, Anthropic, → Chapter 36: The Road to Advanced
lazy learner: it defers all computation to prediction time. It is also **instance-based** --- it learns by memorizing examples, not by building an explicit model. → Chapter 15: Naive Bayes and Nearest Neighbors
learning rate: how big a step you take each time. → Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions
Level 0: Manual Process: Data scientists work in notebooks. - Models are trained manually, evaluated manually, deployed manually. - No automation, no monitoring, no versioning. - Retraining happens when someone remembers to do it. - *Where most individual data scientists start. Where many small teams stay.* → Chapter 36: The Road to Advanced
Level 1: ML Pipeline Automation: Data pipelines are automated (Airflow, Prefect, Dagster). - Training is triggered by schedule or data arrival. - Experiment tracking is in place (MLflow, Weights & Biases). - Model deployment is scripted but not fully automated. - Monitoring exists but may not trigger automated responses. - *This is → Chapter 36: The Road to Advanced
Level 2: CI/CD for ML: Code and data changes trigger automated retraining pipelines. - Models are tested automatically (unit tests, integration tests, data validation, model performance gates). - Deployment is automated with canary releases or shadow deployment. - Monitoring triggers automated retraining when drift is det → Chapter 36: The Road to Advanced
Level 3: Full Automation with Governance: Everything in Level 2, plus: - A/B testing of model versions is automated. - Feature engineering is partially automated (feature platforms). - Model governance (approval workflows, bias audits, documentation) is integrated into the pipeline. - Hundreds of models are managed simultaneously. - *This i → Chapter 36: The Road to Advanced
Limitation: The elbow is often ambiguous. Real-world data rarely produces a sharp bend. When three people look at the same elbow plot and pick k=3, k=4, and k=5, all three are arguably correct. Use the elbow method as a starting point, not a final answer. → Chapter 20: Clustering
Limited Social Determinants:: Zip code (proxy for socioeconomic status) - Lives alone (self-reported, available for ~60% of patients) - Primary language - Has a primary care physician (yes/no) → Case Study 2: Metro General Hospital --- When Prediction and Explanation Collide
LinearSVC vs. SVC(kernel='linear'): They solve the same problem differently. `LinearSVC` uses liblinear (optimized for linear case, scales to millions of samples). `SVC(kernel='linear')` uses libsvm (general purpose, computes the kernel matrix, O(n^2) memory). For linear problems, always prefer `LinearSVC`. It is not just faster --- i → Chapter 12: Support Vector Machines
Listwise deletion is worst: not because the imputed values are bad (they do not exist), but because you have lost 74% of your data. Less data means less signal, worse generalization, and higher variance. 2. **Simple imputation is better than dropping** --- even crude mean/median imputation recovers most of the performance lost → Chapter 8: Missing Data Strategies
Local explanations:: SHAP waterfall plots for 3 representative observations (high-risk, medium-risk, low-risk) - A plain-English table translating each waterfall into "top 3 reasons" → Exercises: Chapter 19

M

Machine Learning: The difference between supervised and unsupervised learning - Linear regression (fit a line, minimize error) - Logistic regression (predict a probability, use a threshold) - Train/test split (why you need one) - Overfitting (what it is, why it is bad) → Prerequisites
Math Panic? Read This: If the sight of Greek letters triggers anxiety, you are not alone. Here is the survival strategy: read the intuition first. Read the numpy code second. If those two make sense, the notation in between is just a compact way of writing what you already understand. You do not need to memorize any formu → Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions
Math Sidebar: Deeper mathematical treatment for readers who want the formal details. Skippable without losing the main thread. → How to Use This Book
Mathematical Foundation: For a model $f$, feature $j$, and feature set $S$ that does not include $j$, the Shapley value is: > > $\phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{j\}) - f(S)]$ > > where $N$ is the set of all features. This is computationally intractable for large featur → Chapter 19: Model Interpretation
Maximum iterations reached: A safety valve. Stop after 1,000 or 10,000 steps regardless. 2. **Loss change below threshold** — If $|L_{t} - L_{t-1}| < \epsilon$ (say, $10^{-6}$), the loss has effectively stopped improving. 3. **Gradient norm below threshold** — If $\|\nabla L\| < \epsilon$, you are on flat ground. You might be → Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions
Metrics that mislead:: **Accuracy:** 99.2% accuracy sounds great, but a model that predicts "normal" for everything achieves 99.2% accuracy when 0.8% of observations are anomalous. - **F1 at a single threshold:** F1 depends on the threshold, which is a business decision. Report F1 at the threshold you actually use, not th → Chapter 22: Anomaly Detection
Milestone 3, Part 3: Apply feature selection to the 25+ features you engineered in Chapters 6-8. → Chapter 9: Feature Selection
Milestone 3, Part 4: Build the complete StreamFlow preprocessing Pipeline with ColumnTransformer. Save it with joblib. This pipeline will be reused in every subsequent chapter. → Chapter 10: Building Reproducible Data Pipelines
MLflow Tracking: Logs parameters, metrics, and artifacts for each run 2. **MLflow Projects** --- Packages ML code in a reusable, reproducible format 3. **MLflow Models** --- Provides a standard format for packaging models for deployment 4. **MLflow Model Registry** --- Manages model versions and deployment stages → Chapter 30: ML Experiment Tracking
Model Decay: The model you deploy today will degrade. When you retrain in three months, experiment tracking lets you compare the new model against the old one on the same metrics. Without it, you are starting from scratch every time. → Chapter 35: Capstone --- End-to-End ML System
Model serving: wrapping the model in a REST API with FastAPI 2. **Containerization** --- packaging the API and all its dependencies in a Docker container 3. **Cloud deployment** --- pushing the container to a cloud platform where it runs without your laptop → Chapter 31: Model Deployment

N

Non-Negotiable Rule: Feature selection must be part of the pipeline. This is not a suggestion. This is not a best practice for advanced users. This is a correctness requirement. If your feature selection step sees the test data, your performance estimate is wrong. → Chapter 9: Feature Selection
Note: In MLflow 2.9+, the `transition_model_version_stage` API is deprecated in favor of the new *aliases* system. Aliases are more flexible: instead of fixed stages, you assign arbitrary aliases like `"champion"` and `"challenger"` to model versions. The pattern below shows the modern approach: → Chapter 30: ML Experiment Tracking
Number of prior admissions in the past year: the single strongest predictor. Patients who have been admitted multiple times are at the highest risk. > 2. **Creatinine at discharge** --- elevated creatinine signals kidney impairment, which the model treats as a strong risk factor. > 3. **Ejection fraction** --- lower EF means higher readmission → Case Study 2: Metro General --- SHAP for Clinicians

O

Obermeyer et al. (2019): the healthcare algorithm study. It makes the abstract concrete: bias is not hypothetical, it is measured, and it affects millions of patients. → Further Reading: Chapter 33
Observation: The difference here is small (0.002 AUC) because scaling leakage on this dataset is mild. But on datasets with time-dependent features, target-encoded categoricals, or imputation based on global statistics, the difference can be enormous. The Pipeline approach costs nothing and prevents an entire ca → Chapter 16: Model Evaluation Deep Dive
Offer actionable next steps:: **Increase power.** Run a longer experiment or increase traffic allocation. If the true effect is 0.8%, we need a much larger sample to detect it. - **Reduce metric variance.** Use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by adjusting for pre-experiment behavior. - → Chapter 3: Experimental Design and A/B Testing
Offline (model performance):: Primary: AUC-ROC (ranking quality) - Secondary: Precision@15000 (operational relevance — of the top 15,000 flagged subscribers, how many actually churn?) - Reporting: Calibration plot, F1, confusion matrix at chosen threshold → Case Study 1: The StreamFlow Workflow in Practice
one model, multiple serving paths.: ## Track 1: Real-Time API for Discharge Planning → Case Study 2: Batch vs. Real-Time --- Choosing the Right Deployment Pattern
Online (business impact):: Primary: Monthly churn rate reduction in treatment group vs. control (A/B test) - Secondary: Net revenue impact (revenue retained minus discount cost) - Guardrail: Customer satisfaction score (do not annoy loyal subscribers with unnecessary offers) → Case Study 1: The StreamFlow Workflow in Practice
Only 19 columns: 100x fewer than OHE. 3. **Training time of 38 seconds** --- 8x faster than OHE. 4. **Handles new ICD-10 codes** at all three levels: new chapters are extremely rare (ICD-10 chapters have not changed since the standard was adopted), new 3-character categories are handled by target encoding's global m → Case Study 2: Metro General Hospital --- Encoding 14,000 Diagnosis Codes
Operational Note: Segment membership should be refreshed monthly. The retention team's CRM should store both the current segment and the previous segment, enabling detection of segment transitions. A subscriber moving from "Casual Viewer" to "Fading Away" is a high-priority intervention target --- they are still reac → Case Study 2: StreamFlow Subscriber Segments and Churn Rate Differences
Operational Reality: A model without a model card is like a drug without a label. It might be effective, but nobody knows the dosage, the side effects, or the contraindications. Model cards are not bureaucratic overhead. They are the documentation that prevents your model from being used in contexts it was never designe → Chapter 33: Fairness, Bias, and Responsible ML
out-of-bag (OOB) samples: can serve as a built-in validation set. For each training sample, collect the predictions from only the trees that did NOT include that sample in their bootstrap, and compute the error. This is the **OOB error**, and it is approximately equivalent to cross-validation --- for free. → Chapter 13: Tree-Based Methods

P

pandas: Creating and manipulating DataFrames - Filtering rows and selecting columns - `groupby()`, `merge()`, `pivot_table()` - Reading CSV, Excel, and JSON files - Basic data cleaning (renaming columns, changing dtypes, handling duplicates) → Prerequisites
Performance Note: The batch endpoint is not just a convenience wrapper. Scikit-learn's `predict_proba` is vectorized: scoring 1000 customers in one call is dramatically faster than scoring 1000 customers in 1000 separate calls. If your downstream system can collect requests and send them in batches, use the batch end → Chapter 31: Model Deployment
portfolio milestone: the model and its analysis are showable in an interview. → Syllabus: Self-Paced Format
posterior: the probability of class C given the observed features X - **P(X | C)** is the **likelihood** --- the probability of observing features X in class C - **P(C)** is the **prior** --- the probability of class C before seeing any features - **P(X)** is the **evidence** --- the probability of observing f → Chapter 15: Naive Bayes and Nearest Neighbors
Practical Advice: Start with batch. Most ML use cases do not need real-time predictions. Nightly churn scores, weekly demand forecasts, daily anomaly reports --- all batch. Move to real-time only when the business requires sub-second response times. → Chapter 36: The Road to Advanced
Practical Guidance: In most tabular datasets, if the first 2-3 components do not capture at least 40-50% of the variance, the data has no dominant low-dimensional structure, and PCA is unlikely to produce useful 2D visualizations. It may still be useful for preprocessing (reducing 100 features to 20), but do not expect → Chapter 21: Dimensionality Reduction
Practical Note: `SequentialFeatureSelector` is slow. With 13 features, 5-fold CV, and a gradient boosted model, forward selection trains 13 + 12 + 11 + ... + 6 = 57 model-CV combinations = 285 model fits. With 50 features, that number explodes. Use forward selection sparingly, and only after filter methods have nar → Chapter 9: Feature Selection
Practical Recommendation: If month-over-month ARI drops below 0.80, the segments are unstable and the A/B test targeting may be stale. Rebuild the segmentation and re-evaluate whether the underlying customer behavior has shifted or whether the features need updating. → Case Study 1: ShopSmart Customer Segmentation for Targeted A/B Tests
Practical Rule: Use Pearson correlation for a quick first look. Use mutual information when you suspect nonlinear relationships. Neither one captures feature interactions --- a feature that is useless alone but powerful in combination with another feature will score low on both measures. → Chapter 9: Feature Selection
Practical Takeaway: Association rules are traditionally a retail technique. But any domain with "basket-like" data --- streaming catalogs, insurance product bundles, SaaS feature usage, course enrollment patterns --- can be analyzed with the same framework. The trick is linking the co-occurrence patterns to a business → Case Study 2: StreamFlow Sticky Content Combinations
Practical Tip: When presenting a choropleth to stakeholders, always include a color legend with actual values, use no more than 5-7 color bins, and choose a color scale that is colorblind-safe. The `'YlOrRd'` scale in folium is a good default for "low-to-high risk" visualizations. → Case Study 1: StreamFlow Regional Churn Choropleth
Practical Warning: The KS test is extremely sensitive with large sample sizes. With 50,000 reference samples and 10,000 production samples, even a trivially small distributional difference will produce a statistically significant p-value. This is a feature of all statistical tests at scale: with enough data, everythin → Chapter 32: Monitoring Models in Production
Practitioner Guidance: The right answer depends on the context, the values of the organization, and the consequences of the decision. What is non-negotiable is this: you must *know* whether your model treats groups differently. The audit is required even if the mitigation is debated. You cannot make an informed decision a → Case Study 2: StreamFlow Churn Model Fairness Tradeoff
Practitioner Note: Use `Pipeline` with explicit names in production code and shared projects. Use `make_pipeline` in exploratory analysis and prototyping. The explicit names make debugging, logging, and hyperparameter tuning significantly easier. → Chapter 10: Building Reproducible Data Pipelines
Practitioner's Rule of Thumb: Start with C=1.0 and search over [0.001, 0.01, 0.1, 1, 10, 100]. The number of support vectors is a useful diagnostic: if the majority of your training points are support vectors, C is probably too small; if you have very few, C might be too large. But always let cross-validation decide. → Chapter 12: Support Vector Machines
Pragmatic Advice: Do not try to add type hints to your entire codebase overnight. Start with function signatures for `src/` modules. Skip notebooks entirely. Use `# type: ignore` for complex pandas operations where mypy's understanding of DataFrame types is limited. The goal is gradual improvement, not perfection. → Chapter 29: Software Engineering for Data Scientists
prior: the best estimate before the clinician adds their expertise. Clinical observations become the **evidence** that updates the prior into a **posterior** — the final, combined estimate. → Case Study 2: Bayes at the Hospital — Combining Model Predictions with Clinical Judgment
Probability: Probability as a number between 0 and 1 - Independent vs. dependent events - Conditional probability (the general idea) → Prerequisites
Problem Framing: What are we actually predicting? Why? For whom? 2. **Success Metric Definition** — How will we know the model is working? Both offline and in production. 3. **Data Collection and Validation** — Getting the data, verifying it, understanding its limitations. 4. **Baseline Establishment** — The simples → Chapter 2: The Machine Learning Workflow
Product manager:: "They only use one feature of the platform. The ones who explore multiple genres and tools stick around." - "New users who do not engage in the first two weeks are gone by month two." - "Multi-device users are stickier. If they have it on their phone and their laptop, they are invested." → Case Study 1: StreamFlow Feature Engineering Workshop
Production Practice: Model cards are living documents. Update them when you retrain the model, when you discover new failure modes, and when fairness metrics change in production. Version the model card alongside the model artifact. A model card that describes a previous version of the model is worse than useless --- it → Chapter 33: Fairness, Bias, and Responsible ML
Production Tip: Advice that matters in production but not in a homework assignment. → How to Use This Book
Project structure: standardized directory layouts that any data scientist can navigate 2. **Version control** --- git branching strategies for ML projects 3. **Testing** --- unit tests and integration tests for data pipelines 4. **Code quality** --- automated formatting, linting, and type checking 5. **Technical debt* → Chapter 29: Software Engineering for Data Scientists
Proposed A/B test design:: **Population:** Subscribers identified as at-risk by the churn model (top 25% of churn probability) - **Control:** Standard SVD recommender (engagement-optimized) - **Treatment:** Retention-aware hybrid recommender - **Primary metric:** 30-day churn rate - **Secondary metrics:** Weekly active hours, → Case Study 2: StreamFlow Content Recommendations for Churn Reduction
pruning: early stopping of trials that are performing poorly partway through training. Instead of training 2,000 boosting rounds and then discovering the configuration is bad, Optuna can stop after 200 rounds if the intermediate scores are unpromising. → Chapter 18: Hyperparameter Tuning
PSI thresholds: < 0.10 stable, 0.10--0.25 investigate, > 0.25 retrain --- are a practical starting point. - **Retraining strategies** range from scheduled (simple, predictable) to triggered (responsive, requires monitoring infrastructure) to hybrid (recommended for most systems). - **Safe deployment** --- shadow de → Chapter 32: Monitoring Models in Production
Python Programming: Writing functions with parameters and return values - Loops, list comprehensions, and dictionary comprehensions - Basic object-oriented programming (classes, methods, `__init__`) - Importing and using third-party libraries - Reading error messages and debugging with print statements or a debugger → Prerequisites

R

Random Forests fix this with double randomization: bootstrap sampling gives each tree different data, and feature randomization forces each tree to explore different features. The ensemble average is more stable and accurate than any individual tree. → Chapter 13: Tree-Based Methods
Random Oversampling and Overfitting: Random oversampling creates exact copies of minority-class examples. The model can memorize these duplicates, leading to overfitting on the minority class. This is especially problematic for models with high capacity (deep trees, neural networks). For simpler models like logistic regression, the ris → Chapter 17: Class Imbalance and Cost-Sensitive Learning
RBF SVM won: by a slim margin. 87.0% test accuracy vs. 85.2% for the others. On 54 test samples, that difference is one sample. Not statistically significant. But the SVM did not lose. → Case Study 2: SVM vs. Gradient Boosting --- The Showdown on Tabular Data
Real World vs. Kaggle: In a Kaggle competition, someone hands you a CSV. In production, the extraction query *is* part of the model. If the query changes, the model's input distribution changes. Version your SQL alongside your model code. → Chapter 35: Capstone --- End-to-End ML System
Recommended dataset characteristics:: 20,000--50,000 rows in training data (large enough to reward feature engineering, small enough to iterate quickly) - 10,000--20,000 rows in test data - 15--25 features (mix of numeric, categorical, and temporal) - Binary classification target with 10--20% positive class (moderate imbalance) - 2--3 f → In-Class Kaggle Competition Guide
Recurrent Neural Networks (RNNs) and LSTMs: **What:** Networks with loops that maintain hidden state, processing sequences one element at a time. - **When:** Time series, sequential data, historical use for NLP (now largely replaced by transformers). - **Key variants:** LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit). - **Manufactur → Chapter 36: The Road to Advanced
Recurring Theme: Real World =/= Kaggle: In a Kaggle competition, you have one dataset, one target, and one leaderboard. Testing is irrelevant because the competition ends and the code is discarded. In production, your pipeline runs on new data every day, your features evolve, your team grows, and your code must survive contact with realit → Chapter 29: Software Engineering for Data Scientists
Recurring Theme: Reproducibility: The refactored project satisfies the reproducibility test: clone the repo, install dependencies, run `make all`, and get the same results. No notebooks to run in the right order. No manual steps. No "you need to ask Sarah for the data." The code is the documentation, and the Makefile is the entry po → Chapter 29: Software Engineering for Data Scientists
Reflection:: Did any interpretation result surprise you? - Did the interpretation reveal any potential problems with the model (e.g., a feature that should not be used, an unexpected interaction)? - What would you change about the model based on what you learned? → Exercises: Chapter 19
Reproducibility: Always set `random_state` in both the model and the cross-validation splitter. Set `seed` in the Optuna sampler. Record the library versions. Tuning is stochastic; without fixed seeds, you cannot reproduce your results. This is discussed further in Chapter 10 (Reproducible Data Pipelines) and Chapte → Chapter 18: Hyperparameter Tuning
Robustness of the Optimal Threshold: The plateau between 0.02 and 0.07 is good news. It means the model is not sensitive to the exact threshold --- any value in this range produces strong results. In production, the team set the threshold to 0.04 (slightly conservative of the optimum) to provide a buffer against probability calibration → Case Study 1: StreamFlow Four-Strategy Comparison
Rule: For classification problems, always use `StratifiedKFold`. There is no reason not to. Scikit-learn's `cross_val_score` uses stratification by default when you pass a classifier, but be explicit: pass a `StratifiedKFold` object to the `cv` parameter. Explicit is better than implicit. → Chapter 16: Model Evaluation Deep Dive

S

SaaS Churn: StreamFlow, a B2C subscription streaming analytics platform ($180M ARR, 2.4M subscribers, 8.2% monthly churn). Progressive project spine. 2. **Hospital Readmission** — Metro General Hospital, 450-bed urban teaching hospital. 30-day readmission prediction with fairness constraints across patient demo → Intermediate Data Science — Master Outline
Security Warning: Never load a joblib or pickle file from an untrusted source. Both formats execute arbitrary code during deserialization. A malicious `.joblib` file can run any Python code on your machine. Only load serialized pipelines that you created or that come from a trusted, verified source. → Chapter 10: Building Reproducible Data Pipelines
Slide 3: The Action Items: "At-Risk Disengaged" customers (about 15% of the base) have 2x the churn rate. Recommended intervention: re-engagement campaign with personalized content recommendations. - "Frustrated Active" customers (about 18%) are using the platform but filing support tickets. Recommended intervention: proactiv → Case Study 1: StreamFlow PCA + UMAP Visualization
Social support: Weak.: Margaret is being discharged on 9 medications with 3 changes from her admission regimen. **Medication complexity: Complex.** - During discharge education, Margaret repeated her medication schedule correctly and asked about sodium restriction. **Patient understanding: Clear.** → Case Study 2: Bayes at the Hospital — Combining Model Predictions with Clinical Judgment
Sources of bias: where bias enters the ML pipeline 2. **Fairness metrics** --- how to quantify whether a model is fair (and for whom) 3. **The impossibility theorem** --- why you cannot satisfy all fairness definitions simultaneously 4. **Mitigation strategies** --- pre-processing, in-processing, and post-processing → Chapter 33: Fairness, Bias, and Responsible ML
SQL: `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY` - `JOIN` (inner, left, right) - Aggregate functions (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`) - Subqueries (basic) → Prerequisites
Stakeholder deliverable:: A one-page "How to Read This" guide written for a non-technical user of the model → Exercises: Chapter 19
Statistics: Mean, median, mode, standard deviation - Correlation (positive, negative, none) - Normal distribution (bell curve, 68-95-99.7 rule) - Hypothesis testing (null hypothesis, p-value — the general idea) - Confidence intervals (the general idea) → Prerequisites
StreamFlow: a subscription streaming analytics platform with 2.4 million subscribers and an 8.2% monthly churn rate. You will frame the churn prediction problem, define the target variable, and design the A/B test that will validate your model's business impact. No code yet. Just thinking — the kind of thinking → Part I: The ML Mindset
Stretch goals:: Compute a usage trend slope using `REGR_SLOPE` - Create a sessionization query - Build the query as a materialized view with a refresh strategy → Chapter 5: SQL for Data Scientists — Window Functions, CTEs, and Query Optimization
support vectors: and they are the key to everything. → Chapter 12: Support Vector Machines
surrogate model: a probabilistic model of the relationship between hyperparameters and performance. After each trial, the surrogate model updates its beliefs about which regions of the search space are promising. An **acquisition function** then decides where to sample next, balancing **exploration** (trying under-e → Chapter 18: Hyperparameter Tuning

T

Takeaway: When cardinality is moderate (10-100 categories) and the categorical feature has a meaningful relationship with the target, target encoding is typically the best choice for any model type. It compresses dimensionality, preserves signal, and handles new categories gracefully. But it demands disciplin → Case Study 1: StreamFlow Genre Encoding Showdown
Target leakage: Features that are derived from or correlated with the target variable in ways that would not exist at prediction time. → Chapter 2: The Machine Learning Workflow
task graph: a directed acyclic graph (DAG) of operations. The computation only happens when you call `.compute()`. This allows Dask to optimize the execution plan: it can fuse operations, avoid unnecessary intermediate DataFrames, and parallelize independent tasks. → Chapter 28: Working with Large Datasets
technical debt: the ongoing cost of maintaining the system. Sculley et al. (2015) famously argued that the ML code in a production system is a tiny fraction of the total system. The rest is: → Chapter 2: The Machine Learning Workflow
Temporal leakage: Using future data to predict past events. → Chapter 2: The Machine Learning Workflow
Text preprocessing: tokenization, lowercasing, stop word removal, stemming, lemmatization --- is the plumbing that determines whether your NLP pipeline works. Negation handling (preserving "not") is critical for sentiment analysis. → Chapter 26: NLP Fundamentals
The "Model Loses Money" Shock: The team is stunned. Their carefully built model, with an AUC-PR of 0.451 and 93.8% accuracy, loses money. This is the moment where the distinction between ranking quality and decision quality becomes real. The model *ranks* churners higher than non-churners (AUC-PR above baseline). But the default → Case Study 1: StreamFlow Four-Strategy Comparison
The 80/20 Rule for Deep Learning: For most data scientists working on business problems with tabular data, deep learning is a tool you should understand conceptually but will rarely implement from scratch. The exceptions: if you move into NLP, computer vision, or recommendation systems at scale, deep learning becomes your primary to → Chapter 36: The Road to Advanced
The Accuracy Trap: In Chapter 16, you learned that accuracy is misleading for imbalanced problems. Here is the proof. A model with 93.8% accuracy sounds excellent. But it is only catching 36% of the churners you are trying to find. For StreamFlow, that means the retention team contacts 118 subscribers out of the 328 w → Chapter 17: Class Imbalance and Cost-Sensitive Learning
The Conversation That Changed Everything: The VP of Product initially resisted the threshold-tuned model. "18% precision means we are sending retention offers to people who were never going to leave. That seems wasteful." The data science lead reframed: "Think of it as a marketing campaign with an 18% conversion rate and a $5 cost per conta → Case Study 1: StreamFlow Four-Strategy Comparison
The Core Finding: The optimal threshold is 0.10, not the default 0.50. At threshold 0.10, the model catches 89% of future readmissions (recall = 0.89) at a precision of only 25%. Three out of four flagged patients will not actually be readmitted --- but the cost of the unnecessary interventions ($850 each) is dwarfed → Case Study 2: Hospital Readmission --- When the Model's Precision Kills Patients
The Cost Matrix Framework: For any binary classification problem, define the cost matrix: > > | | Predicted Positive | Predicted Negative | > |---|---|---| > | **Actual Positive** | TP: benefit (or 0) | FN: cost_fn | > | **Actual Negative** | FP: cost_fp | TN: benefit (or 0) | > > The optimal strategy minimizes total cost = ( → Chapter 17: Class Imbalance and Cost-Sensitive Learning
The CRISP-DM Connection: These five questions map to the first two phases of CRISP-DM: Business Understanding and Data Understanding. Most failed data science projects skip these phases entirely. They jump from "we need AI" to "let's build a model" without establishing what problem they are solving, whether the data exists → Chapter 34: The Business of Data Science
The data:: Subscription events (signups, upgrades, downgrades, cancellations) - Usage logs (hours watched, features used, devices, genres, time of day) - Support ticket history (count, category, resolution time, satisfaction rating) - Billing history (payment method, failed payments, plan changes) - Demographi → Chapter 1: From Analysis to Prediction
The economics of ML: how to calculate the ROI of a model in dollars, not AUC points 2. **Stakeholder communication** --- how to present model results to people who do not know what a confusion matrix is 3. **Data storytelling** --- how to build presentations and dashboards that drive decisions 4. **The "we need AI" conv → Chapter 34: The Business of Data Science
The intuition: a plain English explanation, often with a physical analogy 2. **The math** — formal notation, so you can read papers and documentation 3. **The numpy code** — a computational demonstration you can run, modify, and break → Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions
The Kernel Trick: The SVM optimization problem can be reformulated so that it depends only on dot products between data points, not on the data points themselves. If we can compute the dot product between two points in the transformed space *without actually transforming them*, we get the benefits of a high-dimension → Chapter 12: Support Vector Machines
The Key Lesson: F1 and profit can disagree. F1 treats precision and recall as equally important. The business rarely does. Always compute the business metric. If someone tells you "the model has an F1 of 0.45," your response should be "what is the cost of a false negative vs. a false positive?" → Chapter 17: Class Imbalance and Cost-Sensitive Learning
The Leak Explained: The engagement velocity score was computed by the data engineering team as a rolling metric that included activity data up to the end of the billing cycle. For subscribers who churned, this window extended past the point where they had already decided to cancel and stopped using the service. The sco → Case Study 1: StreamFlow Metric Selection and the Leakage Detective
The Math: Why 63.2%? The probability that a specific sample is NOT chosen in any single draw is (1 - 1/n). Over n draws with replacement, the probability it is never chosen is (1 - 1/n)^n, which converges to 1/e = 0.368 as n grows large. So roughly 36.8% of samples are left out of each bootstrap, and 63.2% ar → Chapter 13: Tree-Based Methods
The numbers:: Annual recurring revenue: $180M - Monthly churn rate: 8.2% (industry benchmark: 5-7%) - Customer acquisition cost: $62 per subscriber - Average revenue per user: $18.40/month - Customer lifetime value at current churn: $224 → Chapter 1: From Analysis to Prediction
The Parallel Trends Assumption: DiD requires that the treatment and control groups would have followed parallel trends in the absence of treatment. If high-risk subscribers were already churning at a declining rate before the offer, the DiD estimate is biased. Always plot the pre-treatment trends to check this assumption visually. → Chapter 36: The Road to Advanced
The Pattern: Random search captures 90%+ of the available tuning improvement. Bayesian optimization polishes the remaining few percent. For StreamFlow, the practical message is clear: run 80 random search trials, take the best configuration, and move on to feature engineering or deployment. Come back for Bayesia → Case Study 1: StreamFlow Optuna Tuning --- From Defaults to Diminishing Returns
The Rule: Never compare distances *between* clusters in a t-SNE plot. If you need to compare inter-cluster distances, compute them in the original feature space using centroid distances, Mahalanobis distance, or pairwise distance distributions. → Case Study 2: The t-SNE Lies --- Common Misinterpretations
The SQL query: saved as `src/features/extract_features.sql`, version-controlled in Git 2. **The materialized view** — `mv_churn_features`, refreshed nightly at 2 AM via an Airflow DAG 3. **A data dictionary** — documenting each feature's name, type, computation logic, and expected null pattern → Case Study 1: StreamFlow Feature Extraction Pipeline — From Schema to Model-Ready Table
The Team's Recommendation: The data science team recommends the combined approach (reweighting + group-specific thresholds). The reasoning: reweighting improves the model's ability to learn patterns for underrepresented groups, and threshold adjustment equalizes the TPR so that every racial group has the same chance of receiv → Case Study 1: Metro General Readmission Fairness Audit
Theme: Real World =/= Kaggle: On Kaggle, you submit a CSV and get a score. In the real world, you submit a model and get the question: "Can you reproduce this? Can you explain why this model is better than the one we deployed last quarter? Can you trace the lineage from training data to production prediction?" Experiment trackin → Chapter 30: ML Experiment Tracking
Theme: Reproducibility: Notice that we logged the `random_state`, the `data_version` tag, the exact feature count, and the target rate. Six months from now, if someone asks "What produced the 0.8862 AUC model?", you can pull up run `xgb-search-05` and see every input and output. That is the difference between a tracked exp → Chapter 30: ML Experiment Tracking
Theme: Wrong Problem: The most expensive mistake in data science is solving the wrong problem. Chapters 1 and 34 covered this in detail. Here, we see it in the context of the full system. → Chapter 35: Capstone --- End-to-End ML System
Threshold Tuning vs. Resampling: Notice what just happened. Without any resampling, without changing the training data, without SMOTE or class weights --- just by moving the threshold from 0.50 to 0.031 --- the expected profit tripled. The model was already producing good probability estimates. The problem was never the model. The → Chapter 17: Class Imbalance and Cost-Sensitive Learning
Time Series Warning: If your data has any temporal component --- event timestamps, monthly snapshots, daily transactions --- consider whether random splitting creates temporal leakage. A model that can "see the future" during training will look great in cross-validation and fail in production. When in doubt, use `TimeSe → Chapter 16: Model Evaluation Deep Dive
Total Error = Bias + Variance + Irreducible Noise: **Bias** is the error from overly simplistic assumptions. A linear model fit to nonlinear data has high bias. It *underfits* --- it misses real patterns because the model is not flexible enough to capture them. → Chapter 1: From Analysis to Prediction
Tradeoffs: Model complexity is not free. Every additional parameter is an opportunity to overfit. On small datasets, the simplest model that captures the signal wins. On large datasets with complex patterns, the most expressive model that can be regularized effectively wins. Knowing when to deploy a simple mod → Chapter 15: Naive Bayes and Nearest Neighbors
Train/test contamination: Information from the test set leaking into the training process. → Chapter 2: The Machine Learning Workflow
Transformers: **What:** Networks that use self-attention to process entire sequences in parallel, allowing every element to attend to every other element. - **When:** NLP (BERT, GPT), increasingly time series, vision (ViT), and multi-modal tasks. - **Why they matter:** Transformers are the architecture behind GPT → Chapter 36: The Road to Advanced
Try It: Interactive prompts to modify and experiment with the code. → How to Use This Book

U

Unexpected drift (sensor calibration):: `vibration_amplitude_mm_s`: The mean dropped more than the seasonal effect alone explains. After correlating with maintenance records, the team discovers the sensor replacement. → Case Study 2: TurbineTech Seasonal Drift and Sensor Calibration
Unrestricted trees overfit catastrophically: 100% training accuracy, terrible test performance. The tree memorizes noise instead of learning patterns. → Chapter 13: Tree-Based Methods
Use both if:: You use W&B for experiment exploration and visualization during development - You use MLflow for the Model Registry and production deployment pipeline - This is more common than you might expect → Chapter 30: ML Experiment Tracking
Use Dask when:: Data does not fit on one machine (distributed computing needed) - You want to parallelize existing pandas code with minimal changes - You need lazy evaluation for a complex multi-step pipeline - Your team already knows pandas and needs to move fast → Chapter 28: Working with Large Datasets
Use MLflow if:: You are in a regulated industry (healthcare, finance, government) - Your organization requires data to stay on-premises - You need a mature Model Registry integrated with your deployment pipeline - Cost matters (MLflow is free; W&B is not for teams) → Chapter 30: ML Experiment Tracking
Use pandas when:: The data fits in memory and you need iterative, exploratory analysis. - You need complex feature engineering with conditional logic. - You need integration with scikit-learn pipelines. - You need visualization (pandas integrates with matplotlib and seaborn). → Appendix H: Frequently Asked Questions
Use Polars when:: Data fits on one machine but pandas is too slow - You need maximum single-machine performance - You are building a new pipeline (no legacy pandas code to maintain) - Your query involves complex transformations that benefit from the optimizer → Chapter 28: Working with Large Datasets
Use something else when:: **You have image, text, or sequence data.** Neural networks (CNNs, transformers) are the right tool. - **Your dataset is tiny (<200 rows).** Gradient boosting can overfit even with regularization. Logistic regression or a small Random Forest may be more stable. - **Interpretability is paramount.** A → Chapter 14: Gradient Boosting
Use SQL when:: The data lives in a database and you need to extract/filter/aggregate before bringing it into Python. - The dataset is too large to fit in memory. Let the database engine do the heavy lifting. - You need to join multiple tables. SQL joins are more readable and often faster than pandas merges. - You → Appendix H: Frequently Asked Questions
Use W&B if:: You are a small team that wants to get started in five minutes - The UI and collaboration features justify the cost - You run many hyperparameter sweeps and want built-in coordination - You value real-time team visibility over infrastructure control → Chapter 30: ML Experiment Tracking

V

Validation Rule: If a pattern (e.g., a region of concentrated churners) appears across multiple UMAP configurations and corresponds to a meaningful feature profile in the original data, it is likely real. If a pattern appears at one setting but vanishes at another, it is an artifact. Never present a UMAP finding wit → Case Study 1: StreamFlow PCA + UMAP Visualization
VIF Thresholds: The rules of thumb are well-established: > - **VIF < 2.5:** No concern. > - **VIF 2.5-5:** Moderate multicollinearity. Worth monitoring but usually not actionable. > - **VIF 5-10:** Problem. Feature is substantially explained by other features. Consider dropping or combining. > - **VIF > 10:** Big p → Chapter 9: Feature Selection
Visualization: Creating line plots, bar charts, scatter plots, and histograms with matplotlib - Using seaborn for statistical plots (heatmaps, pair plots, box plots) - Customizing labels, titles, legends, and color palettes → Prerequisites

W

War Story: Real-world anecdotes from production data science. These are the lessons that textbooks usually skip. → How to Use This Book
Warning: Grid search is mostly obsolete for high-dimensional hyperparameter spaces. If you are tuning 5 or more hyperparameters, skip ahead to random search or Bayesian optimization. Grid search will either take too long or force you to use a coarse grid that misses the good regions. → Chapter 18: Hyperparameter Tuning
weekly high-risk list "feels off": subscribers they expected to see are missing, and subscribers who seem perfectly happy are flagged. 3. The VP of Engineering asks whether the churn model should be **integrated into the mobile app** to trigger in-app retention offers in real-time. → Case Study 1: StreamFlow Complete System Architecture Walkthrough
Weekly monitoring:: [ ] Compute PSI for all features; flag any above 0.10 - [ ] Compute prediction distribution PSI - [ ] Compute performance metrics on any newly labeled data - [ ] Review the monitoring dashboard - [ ] Log results to the experiment tracker (MLflow) → Chapter 32: Monitoring Models in Production
What are we predicting?: The target variable. This must be precise. 2. **What is the observation unit?** — What does one row in the training data represent? 3. **When do we make the prediction?** — The prediction point in time. 4. **What information is available at prediction time?** — This is where data leakage lives. 5. * → Chapter 2: The Machine Learning Workflow
What Changed: The hardcoded path is gone. File paths are parameters, not constants. The reference date is not embedded in data loading --- it is a feature engineering concern. Every function has type hints, a docstring, and explicit inputs and outputs. The `main()` function provides a command-line entry point. Lo → Case Study 1: Refactoring the StreamFlow Churn Notebook
What is NOT available:: Whether the patient fills their prescriptions after discharge (pharmacy data is siloed) - Whether the patient attends their follow-up appointment (data arrives 2-4 weeks later) - Home environment assessment (only available for patients receiving home health) - Caregiver support quality - Food securi → Case Study 2: Metro General Hospital --- When Prediction and Explanation Collide
What to Look For: Good clusters produce silhouette plots where every cluster's "blade" is roughly the same width (balanced cluster sizes) and roughly the same height (similar silhouette values). If one cluster has many points below zero, those points are probably in the wrong cluster. If one blade is much thinner tha → Chapter 20: Clustering
What We Did Not Show: We did not show the PCA scree plot. We did not show perplexity experiments. We did not explain UMAP's topological foundations. The VP needed a map and action items. The technical validation (robustness checks, feature profiling, cluster stability) stays in our notebook for the technical review. → Case Study 1: StreamFlow PCA + UMAP Visualization
What would a domain expert say about this data?: ## 3. Model Comparison Template → Appendix G: Templates and Checklists
What you already know (from this book):: Text preprocessing, TF-IDF, and bag-of-words (Chapter 26) - Classification and evaluation (Chapters 11--19) - Deployment and monitoring (Chapters 31--32) → Chapter 36: The Road to Advanced
What you need to learn:: Word embeddings (Word2Vec, GloVe) --- the bridge from sparse vectors to dense representations - Transformer architecture --- self-attention, positional encoding, encoder-decoder - BERT and fine-tuning for classification, NER, and question answering - LLM prompting, retrieval-augmented generation (RA → Chapter 36: The Road to Advanced
When drift is detected:: [ ] Identify *which* features drifted and *why* (product change? data pipeline issue? seasonal effect?) - [ ] Determine if the drift is temporary or permanent - [ ] If permanent: retrain with data that includes the new distribution - [ ] If temporary (seasonal): consider retraining with a wider trai → Chapter 32: Monitoring Models in Production
When PCA Helps: PCA is most valuable as preprocessing when you have many correlated features (100+ sensor readings, genomics, NLP embeddings) and a downstream model that struggles with high dimensionality (logistic regression, SVM, kNN). Tree-based models like XGBoost handle high-dimensional data natively and rarel → Chapter 21: Dimensionality Reduction
When retraining:: [ ] Retrain on the most recent labeled data - [ ] Evaluate on a holdout set; apply the deployment gate - [ ] Deploy as shadow or canary first - [ ] Monitor the retrained model for at least one week - [ ] Promote to production only after validation passes - [ ] Update reference distributions to refle → Chapter 32: Monitoring Models in Production
When the Break-Even Precision Is 1%: The manufacturing break-even precision is 0.010. If the model's precision is above 1%, every alert saves money on average. This means you can tolerate a massive number of false alarms. In extreme cost-asymmetry domains (failure detection, fraud, security), the optimal threshold is often absurdly low → Chapter 17: Class Imbalance and Cost-Sensitive Learning
When to Use Hash Encoding: Use it when cardinality exceeds 500+, target encoding is not appropriate (e.g., unsupervised learning), and you need a fixed-dimensionality representation. Hash encoding is also useful for features that grow over time (new ICD-10 codes are added annually), because new categories are automatically ma → Chapter 7: Handling Categorical Data
When to use it:: High-dimensional data where d >> n (text classification, genomics) - When you suspect the decision boundary is approximately linear - When you need speed (linear kernels are much faster than RBF) - As a baseline before trying non-linear kernels → Chapter 12: Support Vector Machines
When to use RFE: RFE is most useful when you have 20-100 features and care about finding the precise optimal subset. For 500+ features, start with filter methods to reduce the set before running RFE. → Chapter 9: Feature Selection
When Undersampling Works: Random undersampling is most useful when you have an enormous dataset and the majority class is highly redundant. If you have 10 million negative examples and 50,000 positive examples, undersampling to 200,000 negatives (4:1 ratio) still gives the model plenty of negative examples to learn from whil → Chapter 17: Class Imbalance and Cost-Sensitive Learning
Why AUC-PR for churn?: AUC-ROC for this model is 0.684, which sounds mediocre but not terrible. AUC-PR is 0.165, which sounds much worse. But AUC-PR tells the truth: the model's ability to precisely identify churners is limited. A random baseline would have AUC-PR equal to the positive rate (0.076). The model is 2.17x bet → Chapter 16: Model Evaluation Deep Dive
Why RD Works Here: Subscribers with a model score of 0.21 and subscribers with a score of 0.19 are essentially the same --- the 0.02 difference in score reflects noise in the model, not a real difference in risk. But only the 0.21 subscriber received the offer. This creates a clean comparison. The key assumption: ther → Case Study 2: Causal Inference at StreamFlow --- Did the Retention Offer Work?
Why This Is Better: Propensity score matching creates a control group that looks like the treatment group on observed characteristics. The matched comparison removes selection bias from observed confounders. The estimate is closer to the true effect. → Case Study 2: Causal Inference at StreamFlow --- Did the Retention Offer Work?
Why This Is Not Perfect: PSM can only control for observed variables. If there is an unobserved confounder (say, subscribers who received the offer also happened to see a new feature release), PSM cannot account for it. This is why randomized experiments are the gold standard. → Case Study 2: Causal Inference at StreamFlow --- Did the Retention Offer Work?