Glossary

271 terms from Advanced Data Science

# A B C D E F G H I J K L M N O P R S T U V W Z

#

(1) Alert description
what triggered the alert and what it means in plain language. **(2) Impact assessment** — who is affected, how severely, and what business impact is expected. **(3) Diagnostic steps** — step-by-step commands, queries, and dashboard links to identify the root cause. **(4) Mitigation options** — immed → Chapter 30: Quiz
(1) Clean data
benchmark datasets are curated; production data has missing values, label noise, and inconsistencies. **(2) Static evaluation** — benchmarks are fixed; production data undergoes distribution shift over time. **(3) Unlimited tuning** — papers tune hyperparameters extensively on the benchmark; product → Chapter 37: Quiz
(1) Community and ecosystem
size and activity of contributors, tutorials, Stack Overflow answers, and third-party integrations. **(2) Hiring signal** — whether you can hire people with this technology on their resume. **(3) Migration path** — how difficult it will be to move away when the platform is eventually replaced; open → Chapter 38: Quiz
(1) Concept drift
the relationship between inputs and outcomes has changed (e.g., a pandemic changes user behavior), but the system infrastructure is functioning correctly. In traditional software, the relationship between inputs and outputs is determined by code, not learned from data, so it does not drift. **(2) Fe → Chapter 30: Quiz
(1) Credibility
a track record of technical decisions that led to good outcomes, earned slowly over years and lost quickly through a single catastrophic recommendation. Protected by honesty about uncertainty, early admission of mistakes, and never overpromising. **(2) Reciprocity** — helping other teams (reviewing → Chapter 38: Quiz
(1) Data gap
papers use clean, curated datasets; production data has missing values, label noise, and distribution shift. Example: a model trained on curated movie ratings underperforms on production data where 15% of ratings are missing. **(2) Scale gap** — papers evaluate at a single scale; production systems → Chapter 37: Quiz
(1) Data quality
feature null rates, freshness, distribution drift, training-serving skew, schema compliance. This has no software analogue because software inputs are deterministic; ML inputs are statistical. **(2) Model performance** — prediction distribution, accuracy proxies, calibration, business metric alignme → Chapter 30: Quiz
(1) Detection
the incident is identified through automated alerts, business metric anomalies, or user reports. **(2) Triage** — the incident is classified by severity (SEV-1 through SEV-4) and type (system, data, model, integration), determining the response urgency and the team that responds. **(3) Mitigation** → Chapter 30: Quiz
(1) Judgment
the ability to make good technical decisions under uncertainty, demonstrated through design reviews, RFCs, and architectural decisions that stand the test of time. Not about being right every time, but about being right more often than not, admitting errors, and learning from outcomes. **(2) Scope** → Chapter 38: Quiz
(1) Tech Lead
steers a specific team or project; in data science, owns the modeling approach for a product area (e.g., defining evaluation metrics, selecting architectures, reviewing experiment designs). **(2) Architect** — sets technical direction across teams; in data science, defines organizational standards f → Chapter 38: Quiz
(1) Technical Vision
a 1-2 page narrative describing the desired end state in terms of outcomes (not technologies), providing direction for 2-3 years. **(2) Key Bets** — the 3-5 major technical investments that move toward the vision, each justified with a business case and a technical case, sequenced by dependencies, r → Chapter 38: Quiz
(1) Temporal leakage
training on data that is chronologically after the test data. This is most common in time-series forecasting, recommendation, and fraud detection, where the model can effectively memorize future events. **(2) Preprocessing leakage** — fitting preprocessing steps (normalization, imputation, feature s → Chapter 37: Quiz
(1) Tuned
the baseline must be tuned with the same care as the proposed method, not run with default hyperparameters. **(2) Current** — the baseline must represent the current state of the art, not a historical method that has been superseded. **(3) Equivalent** — the baseline must use the same data, features → Chapter 37: Quiz
(b) Loss computation
reductions (sums/means) over large tensors accumulate rounding errors in low precision. → Chapter 7: Quiz
(d) Softmax
involves exponentials that can overflow in fp16 (max fp16 value is 65,504; $e^{11} \approx 60,000$ already approaches the limit). → Chapter 7: Quiz
(e) Batch normalization variance computation
the variance formula involves squared differences and averaging, which is numerically sensitive in low precision. → Chapter 7: Quiz
22-26%
roughly 5 times the intended rate. The more frequently the analyst checks, the higher the inflation. The root cause is that the fixed-horizon p-value is valid only at the pre-specified sample size; using it at multiple sample sizes violates the assumptions of the test. → Chapter 33: Quiz
30 hours total for Part VII.
## Deep Dive: 30 Weeks (~270 hours) → Syllabus: Self-Paced Independent Study
75x better than random
it has learned meaningful sequential patterns in user browsing behavior (e.g., within-category browsing, transition patterns between categories). This is meaningful because it demonstrates that the order of item interactions carries predictive information that a non-sequential model would miss. The → Chapter 9: Quiz

A

Acceptable reproduction criteria:
Within 1-2% of reported performance on the same dataset and evaluation protocol. - Consistent across at least three random seeds. - Achievable within the reported compute budget (within 2x). → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype
Acceptance criteria:
Suite contains $\geq 15$ expectations covering schema, completeness, value ranges, volume, and freshness - Checkpoint runs successfully on 30 days of historical data with $\leq 2$ false alarms - Pipeline halts if validation fails, with Slack notification → Chapter 28: ML Testing and Validation Infrastructure — Data Contracts, Behavioral Testing, and Great Expectations
ACID transactions
concurrent reads and writes are safe 2. **Time-travel** — query data as it existed at any point in the past 3. **Schema evolution** — add, rename, or reorder columns without rewriting data 4. **Partition evolution** — change partitioning schemes without rewriting data (Iceberg) → Chapter 25: Data Infrastructure — Feature Stores, Data Warehouses, Lakehouses, and the Plumbing Nobody Teaches
Additions beyond Standard Track:
All "Advanced Sidebar" sections read and annotated - Challenge problems (marked in exercises as "Challenge") attempted for every chapter - Two case study write-ups per part (14 total) - 5 paper reviews using the Ch. 37 framework, spaced across the 30 weeks - All 13 appendices read (especially Append → Syllabus: Self-Paced Independent Study
ADR: Model Explainability in the Serving Path
**Context:** ECOA requires adverse action reasons for every denial. SHAP computation for 500-tree XGBoost takes ~200ms per applicant on CPU, ~15ms on GPU. - **Decision:** Pre-compute SHAP values for all feature combinations at each score decile. Serve pre-computed explanations via lookup table. Fall → Case Study 2: Cross-Domain Comparison — How Integration Principles Transfer to Credit Scoring, Pharma, and Climate
Advantages over GP-based BO:
TPE handles categorical, conditional, and mixed-type hyperparameters naturally. - TPE scales to higher dimensions (50+ hyperparameters) where GPs struggle. - TPE is faster per iteration (no matrix inversion). → Chapter 22: Bayesian Optimization and Sequential Decision-Making — From Hyperparameter Tuning to Bandits
adverse action notices
specific reasons why an applicant was denied credit or offered unfavorable terms. This requirement constrains the serving architecture: → Chapter 24: ML System Design — Architecture Patterns for Real-World Machine Learning
AIR for race_hispanic = 0.83
below Meridian's internal threshold of 0.85 but above the regulatory threshold of 0.80. The 6 new states had a higher proportion of Hispanic applicants, and the model's treatment of this population was less favorable. → Case Study 2: Meridian Financial — Regulatory Monitoring for Fair Lending Compliance
Alignment scores
how relevant is each encoder position $j$ to the current decoder state? → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
always-valid confidence sequence
a sequence of confidence intervals $\text{CI}_t$ that simultaneously covers the true parameter for all time points $t$ with probability $1 - \alpha$: → Chapter 33: Rigorous Experimentation at Scale — Multi-Armed Bandits, Interference Effects, and Experimentation Platforms
arbitrary length
**Shares parameters** across time steps (the same function is applied at each step) - Maintains a **memory** that can, in principle, carry information from early positions to late positions → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
ask for
what kind of information this position is looking for. - $\mathbf{W}^K$ learns what to **advertise** — what kind of information this position has to offer. - $\mathbf{W}^V$ learns what to **provide** — the actual information transmitted when this position is attended to. → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)
ATE
The average effect across all patients (both those prescribed Drug X and those who were not): $$\text{ATE} = \mathbb{E}[Y(1) - Y(0)] = P(\text{readmitted under Drug X}) - P(\text{readmitted under standard})$$ → Case Study 1: MediCore Treatment Effect — Defining Potential Outcomes for Drug Efficacy
ATT
The average effect among patients who actually received Drug X: $$\text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid D = 1]$$ → Case Study 1: MediCore Treatment Effect — Defining Potential Outcomes for Drug Efficacy
Attention weights
normalize the alignment scores into a probability distribution: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
automatic differentiation
not numerical differentiation (finite differences) and not symbolic differentiation (formula manipulation). It is the algorithmic application of the chain rule to a recorded sequence of operations. → Chapter 6: Neural Networks from Scratch

B

backfill safety checklist
a procedure that the team follows before executing any backfill. The checklist should include at least 8 items covering: scope verification, resource planning, output isolation, comparison testing, rollback planning, communication, and monitoring. → Chapter 27: Exercises
backpropagation through time (BPTT)
it is simply standard backpropagation applied to the unrolled computation graph. → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
Batch inference
Scoring predictions for a large set of inputs. Each input is processed independently; there are no dependencies between predictions. 2. **Hyperparameter search** — Evaluating different hyperparameter configurations. Each trial trains an independent model; no information flows between trials (in grid → Chapter 5: Quiz
batch validation
it checks a complete dataset after it has been produced, typically at a pipeline checkpoint. Pandera performs **inline validation** — it checks DataFrames at Python function boundaries, raising an exception the moment invalid data enters a function. Great Expectations integrates with orchestration s → Chapter 28: Quiz
Better approaches:
**Contextual bandits** with product embeddings as the context, so that information about one product generalizes to similar products. - **Collaborative filtering bandits** that model the reward as a function of user and product embeddings, exploiting the low-rank structure of the user-product intera → Chapter 22: Quiz
Biomarker Level
mediator, would block the causal pathway. - **Insurance Status** — collider, could open spurious paths. - **Prescribing Physician** — reserved as a potential instrument for sensitivity analysis (Chapter 18). → Case Study 1: MediCore Causal DAG — Identifying Confounders and Valid Adjustment Sets
Brain Floating Point (bf16):
1 sign bit, 8 exponent bits, 7 mantissa bits - Range: $\pm 3.39 \times 10^{38}$ (same as fp32) - Precision: ~2.1 decimal digits → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge

C

calibration
do the intervals actually achieve their nominal coverage? → Chapter 23: Advanced Time Series and Temporal Models — State-Space Models, Temporal Fusion Transformers, and Probabilistic Forecasting
calibration by group
equal PPV and equal FDR across groups; (2) **equal false positive rates** — $\text{FPR}_0 = \text{FPR}_1$; and (3) **equal false negative rates** — $\text{FNR}_0 = \text{FNR}_1$. Conditions 2 and 3 together constitute equalized odds. The theorem proves that calibration and equalized odds are mathema → Chapter 31: Quiz
Candidate cell state
the new information to potentially add: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
Candidate hidden state
uses the reset gate to selectively read the previous state: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
causal model
a directed acyclic graph specifying the causal relationships between the protected attribute, all features, and the outcome. This is a much stronger requirement, because the causal graph determines which features are causally downstream of the protected attribute (and must be adjusted in the counter → Chapter 31: Quiz
Cell state update
combine forget and input: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
champion
the model currently serving traffic. A newly trained model is a **challenger**. The validation gate compares the challenger against the champion on held-out data, behavioral tests, and operational criteria before permitting the challenger to serve any traffic. → Chapter 28: ML Testing and Validation Infrastructure — Data Contracts, Behavioral Testing, and Great Expectations
client drift
local models diverge from each other because they optimize on different data distributions. → Chapter 32: Privacy-Preserving Data Science — Differential Privacy, Federated Learning, and Synthetic Data
climate sensitivity
how much does Earth's temperature change per unit of radiative forcing? → Case Study 2: Climate Natural Experiments — Estimating Causal Effects When You Cannot Randomize
co-versioned artifacts
the FAISS index and the model share a version tag. The deployment pipeline rebuilds the FAISS index whenever a new model is promoted, and both artifacts are deployed atomically. The model registry (MLflow, Chapter 29) tracks the correspondence. → Case Study 1: StreamRec Track B Implementation — Standard Integration
coincidental
a statistical artifact of testing thousands of pairwise correlations across many time series. With enough variables, some will be correlated by chance. However, it could also reflect **confounding** by a shared time trend (both variables increased over the same period due to unrelated underlying cau → Chapter 15: Quiz
Colliders:
User History: if both User Preference and past Recommendations cause User History, conditioning on User History opens a spurious path. → Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
column pruning
only the columns used in the query are read from disk, reducing I/O by the fraction of unused columns (e.g., reading 5 of 100 columns reduces I/O by ~95%); (2) **predicate pushdown** — min/max statistics per row group allow entire row groups to be skipped when they cannot match a filter condition; a → Chapter 25: Quiz
combining domain depth with consistency
spokes provide domain context while the hub ensures standards, shared infrastructure, and career development. → Chapter 39: Quiz
Common diagnostic patterns:
*Sigmoid-shaped*: over-confident at extremes, reasonable in the middle — common for uncalibrated neural networks - *Flat*: model outputs cluster in a narrow probability range regardless of true label frequency — the model has not learned to distinguish confidence levels - *Shifted*: systematically a → Appendix G: Evaluation Metrics Reference
Common forms:
**Temporal leakage.** Training on data that is chronologically after the test data. In time-series forecasting, recommendation systems, and fraud detection, this is the most common form of leakage. A model that sees tomorrow's user behavior while predicting today's engagement is not forecasting — it → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype
Common pitfalls (shared with Qini):
Both Qini and AUUC require randomized data (or credible causal estimation) to compute. You cannot evaluate an uplift model using observational data without addressing confounding. - Confidence intervals for Qini/AUUC require bootstrap or permutation methods. They are wider than for predictive metric → Appendix G: Evaluation Metrics Reference
Common pitfalls:
The *accuracy paradox*: In a dataset with 99% negatives, a model that always predicts "negative" achieves 99% accuracy while being completely useless. This is the single most common metric mistake in applied ML. - Accuracy treats all errors as equally costly. In credit scoring (Chapter 31), denying → Appendix G: Evaluation Metrics Reference
complete data lineage
the ability to trace a single decision backward through every layer of the data infrastructure, from the denial letter to the model score to the feature values to the raw data sources. → Case Study 2: Meridian Financial — Feature Lineage for Regulatory Compliance
Completed: 39 / 39
## Recurring Themes Tracker → Advanced Data Science — Continuity Tracker
Completeness
the attributions sum to $f(x) - f(x')$. Like Shapley's efficiency axiom, every unit of prediction difference is accounted for. (2) **Sensitivity** — if changing feature $j$ from the baseline value to the input value changes the output, feature $j$ receives nonzero attribution. A feature that matters → Chapter 35: Quiz
compliance theater
going through the motions of validation and documentation without actually thinking critically about the model's behavior. The DS Leadership team combats this by requiring that every MRM validation report include a section titled "What could go wrong that our current tests would not detect?" — a que → Case Study 2: Cross-Domain Comparison — Data Science Organizational Design in Pharma, Finance, Climate, and Tech
compute-bound
they are limited by the GPU's arithmetic throughput. Operations below the ridge point are **memory-bound** — they are limited by the speed of reading/writing data from HBM. For an A100 (312 TFLOP/s, 2.0 TB/s), the ridge point is 156 FLOP/byte. Large matrix multiplications (arithmetic intensity ~4096 → Chapter 26: Quiz
conditional edges
dependencies that should be followed only if a specific condition is met (e.g., "run `register_models` only if at least one model passed evaluation"). → Chapter 27: Exercises
Conditional independence
$X \perp\!\!\!\perp Y \mid Z$ — means $p(x, y \mid z) = p(x \mid z) \cdot p(y \mid z)$. Conditional independence is the structural assumption behind graphical models, naive Bayes classifiers, and the causal DAGs we will study in Part III. → Chapter 3: Probability Theory and Statistical Inference
Configuration after optimization:
Parallelism: FSDP (ZeRO-2 equivalent), NCCL backend - Local batch size per GPU: 12 - Gradient accumulation: 4 - Effective global batch size: $12 \times 4 \times 64 = 3{,}072$ - Optimizer: LAMB with warmup cosine schedule - Learning rate: $3 \times 10^{-4} \times (3072/8) = 0.1152$ (capped at 0.01 wi → Case Study 1: Climate DL — Distributed Training for a Global Weather Prediction Model
Configuration:
Parallelism: DDP (NCCL backend, NVLink) - Local batch size per GPU: 8 - Global batch size: 64 - Learning rate: $3 \times 10^{-4} \times (64/8) = 2.4 \times 10^{-3}$ (linear scaling, 8x base) - Warmup: 2,000 steps (approximately 5% of total steps) - Optimizer: LAMB (required for global batch size > 2 → Case Study 1: Climate DL — Distributed Training for a Global Weather Prediction Model
Confounder
good control. - **Item popularity** ($P$): Causes recommendation (popular items are recommended more) and engagement (popular items get more engagement). **Confounder** — good control. - **Content quality** ($Q$): Causes engagement but not recommendation (the algorithm does not observe quality direc → Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
Confounders:
User Preference: causes both Recommendation (through the algorithm's prediction) and Engagement (organic behavior). This is the primary confounder. - Item Features: may cause both Recommendation (algorithm input) and Engagement (if features like genre directly affect engagement). → Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
Consequences:
Smaller community than Airflow; fewer provider integrations available. The team will need to build custom resources for some integrations. - Team members with Airflow experience will require 1-2 weeks of ramp-up on Dagster concepts. - Dagster Cloud provides managed hosting; self-hosted deployment on → Chapter 27: ML Pipeline Orchestration — Airflow, Dagster, Prefect, and Designing Robust Data Workflows
Content Platform Recommender
The progressive project. A mid-size content streaming platform (StreamRec, ~5M users, ~200K items, $400M revenue). Collaborative filtering → deep learning → causal evaluation → production deployment → fairness audit. Spans all seven parts. 2. **Pharma Causal Inference** — MediCore Pharmaceuticals an → Advanced Data Science — Master Outline
context vector
the final hidden state. 2. A **decoder** RNN generates the output sequence one token at a time, conditioned on the context vector. → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
continuous training
scheduled or triggered retraining that keeps the model current without human intervention. → Chapter 29: Continuous Training and Deployment — CI/CD for ML, Canary Deployments, Shadow Mode, and Progressive Rollout
Correctness
predictions are in the expected range and format. (2) **Latency** — the new model meets the latency budget. (3) **Consistency** — predictions are correlated with the production model's predictions (large discrepancies suggest a bug, not a model improvement). (4) **Error handling** — the model handle → Chapter 24: Quiz
Cost per training run:
4 A100 GPUs × 0.9 hours × $3.50/GPU-hour = **$12.60 on-demand** - With spot instances: $12.60 × 0.35 = **$4.41** → Case Study 2: StreamRec — Scaling Two-Tower Training to 1.2 Billion Interactions
counterfactual
what would have happened under the alternative treatment assignment. → Chapter 16: Quiz
Cross-attention
the decoder attends to the encoder's output representations. Here, the queries come from the decoder, while the keys and values come from the encoder. This is how the decoder "reads" the input. → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)

D

Data does not fit in memory
$n$ is so large that $O(n)$ space is infeasible (e.g., billions of events per day). 2. **Data arrives continuously** — There is no "end" to the dataset; it is an unbounded stream. 3. **Single-pass requirement** — Data can only be read once (e.g., network packets, sensor readings). 4. **Approximate a → Chapter 5: Quiz
Decoder update
the context vector is concatenated with the decoder input: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
deep domain context
the DS sits with the domain team, understands the business problem intimately, and has a fast feedback loop with the decision-maker. → Chapter 39: Quiz
design effect
the multiplicative inflation in variance relative to individual randomization. For StreamRec, if friend clusters average $m = 10$ users and the ICC is $\rho = 0.05$, the design effect is $1 + 9 \times 0.05 = 1.45$. The experiment needs 45% more users to achieve the same power. → Chapter 33: Rigorous Experimentation at Scale — Multi-Armed Bandits, Interference Effects, and Experimentation Platforms
Design principles embodied:
**Dual-store feature architecture** (H.6.1): online Redis for serving, offline Parquet for training - **Hybrid serving** (H.1.1 + H.1.2): batch candidate generation + real-time re-ranking - **Graceful degradation** (H.5.2): four levels from full model to popularity baseline - **Circuit breaker** (H. → Appendix H: ML System Design Patterns
Difficulty ratings:
**Introductory** — Accessible to a reader who has completed the relevant chapters. Good first papers in a subfield. - **Intermediate** — Requires comfort with the mathematical foundations (Part I) and familiarity with the subfield. The bulk of the papers listed here. - **Advanced** — Assumes deep fl → Appendix I: Key Papers and Reading Lists
Directional (6 tests):
Higher income → lower predicted default probability - Higher FICO score → lower predicted default probability - Higher debt-to-income ratio → higher predicted default probability - Longer employment tenure → lower predicted default probability - More delinquencies → higher predicted default probabil → Case Study 2: Meridian Financial — Regulatory Model Validation for Credit Scoring
Disadvantages:
TPE does not directly model correlations between hyperparameters. - The theory is less developed than GP-UCB or Thompson sampling. → Chapter 22: Bayesian Optimization and Sequential Decision-Making — From Hyperparameter Tuning to Bandits
distribution-free guarantee
it does not require the forecaster to be Gaussian, linear, or even particularly good. It requires only that the adaptation rate $\gamma$ is appropriate for the rate of distributional change. → Chapter 23: Advanced Time Series and Temporal Models — State-Space Models, Temporal Fusion Transformers, and Probabilistic Forecasting
diverse-workload environments
organizations running ETL, ML, analytics, dbt, and data engineering pipelines on a shared platform. Its 80+ provider packages, 10+ years of production track record, and massive community mean that operators exist for almost every external system. Teams with existing Airflow expertise and a mix of wo → Chapter 27: Quiz
domain ownership
the team that produces data owns it as a product; (2) **data as a product** — each data asset has a discoverable interface, documented schema, quality guarantees, and an owner; (3) **self-serve data platform** — shared infrastructure enables domain teams to publish data products without building the → Chapter 25: Quiz
Domain-specific red flags:
**No beyond-accuracy metrics.** Recommendation systems must balance accuracy with diversity, novelty, coverage, and fairness. A paper that reports only Hit@K and NDCG@K without any beyond-accuracy metric is optimizing for a narrow objective. - **Implicit feedback confusion.** Many papers train on im → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype

E

easily fits on a single GPU
it consumes less than 1% of an A100's 80 GB memory. DDP is simpler than FSDP (no sharding complexity, no all-gather overhead per layer), and the model is small enough that replicating it on every GPU wastes negligible memory. FSDP would become necessary if the model grew to the point where **paramet → Chapter 26: Quiz
effect modifiers
covariates that may cause the treatment effect to vary. The CATE $\tau(x)$ is estimated as a function of $X$. Including a variable in $X$ means "I believe the treatment effect may differ at different values of this variable." → Chapter 19: Quiz
effective rank
the number of singular values significantly larger than zero. We will make this precise in Section 1.4 when we develop SVD. → Chapter 1: Linear Algebra for Machine Learning
Eigendecomposition
the X-ray of a square matrix, revealing its fundamental modes of action 2. **Singular Value Decomposition** — the fundamental theorem of data science, applicable to *any* matrix 3. **Matrix calculus** — the language that backpropagation speaks, and the mathematical apparatus behind every gradient yo → Chapter 1: Linear Algebra for Machine Learning
embedding
mapping inputs to dense vector representations that can be used for retrieval, clustering, and similarity search. → Chapter 13: Transfer Learning, Foundation Models, and the Modern Deep Learning Workflow
equivariant function approximation
applied to different symmetry groups. Weight sharing in CNNs (translation equivariance) and permutation-invariant aggregation in GNNs (permutation equivariance) are both consequences of imposing the data's symmetry on the architecture. The transformer, which has no built-in symmetry and uses positio → Chapter 14: Quiz
Evaluation criteria:
DDP training reproduces single-GPU quality (Recall@20 within 1%). - AMP provides measurable speedup (>1.5x expected). - Profiling identifies the primary bottleneck and the optimization addresses it. - Cost estimate is realistic and documented. → Chapter 26: Training at Scale — Distributed Training, GPU Optimization, and Managing Compute Costs
Examples of platform bets in data science:
Choosing PyTorch vs. TensorFlow vs. JAX as the primary training framework - Choosing AWS SageMaker vs. GCP Vertex AI vs. a self-managed Kubernetes cluster - Choosing Spark vs. Dask vs. Ray for distributed computation - Choosing Delta Lake vs. Iceberg vs. Hudi as the lakehouse format → Chapter 38: The Staff Data Scientist — Technical Leadership, Mentoring, Strategy, and Shaping the Roadmap
Exercise-specific adjustments:
**Mathematical derivation exercises (Parts I, III, IV):** Award full credit only when the student shows every application of the chain rule, every conditional expectation expansion, and every assumption invoked. A correct final answer without derivation earns at most 5/10. A correct setup with a sig → Grading Rubrics
Exercises
implementation problems designed to take 1-4 hours each - **Quiz** — self-assessment questions - **Two case studies** — applied scenarios from industry and research - **Key takeaways** — summary card - **Further reading** — annotated paper recommendations → Advanced Data Science
expectation
a declarative assertion about a dataset that can be evaluated against any batch of data. An expectation suite is a collection of expectations that together define the "contract" a dataset must satisfy. → Chapter 28: ML Testing and Validation Infrastructure — Data Contracts, Behavioral Testing, and Great Expectations

F

familywise error rate
the probability that at least one null hypothesis is falsely rejected — inflates rapidly: → Chapter 33: Rigorous Experimentation at Scale — Multi-Armed Bandits, Interference Effects, and Experimentation Platforms
Feast
it is open source, framework-agnostic, and its architecture exposes the core concepts without vendor abstraction. → Chapter 25: Data Infrastructure — Feature Stores, Data Warehouses, Lakehouses, and the Plumbing Nobody Teaches
Feature importances
which covariates drive treatment effect heterogeneity. Unlike standard feature importance (which features predict $Y$), causal forest feature importance tells you which features predict *treatment effect variation*. → Chapter 19: Causal Machine Learning — Heterogeneous Treatment Effects, Uplift Modeling, and Double Machine Learning
feature views
logical groupings of features that share an entity, a data source, and a materialization schedule. → Chapter 25: Data Infrastructure — Feature Stores, Data Warehouses, Lakehouses, and the Plumbing Nobody Teaches
Forget gate
decides what to remove from the cell state: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
Full system healthy
multi-source retrieval, deep ranking, re-ranking; (2) **Real-time features unavailable** — fall back to batch features (stale but personalized); (3) **Ranking model unavailable** — use retrieval scores directly (lower quality); (4) **Feature store unavailable** — return globally popular items (not p → Chapter 24: Quiz
fundamental
it is a mathematical consequence, not an engineering limitation. Information-theoretically, if a mechanism reveals zero information about any individual (perfect privacy), it also reveals zero information about the population (zero utility). Useful analysis requires extracting some information from → Chapter 32: Quiz

G

global modeling
training across many related series to leverage cross-series patterns — which is how TFT outperforms per-series classical methods on StreamRec's multi-category engagement data. The key risk is overfitting on short series. → Chapter 23: Key Takeaways
global models
trained across many related series — often outperform per-series classical models, because they leverage cross-series structure. → Chapter 23: Advanced Time Series and Temporal Models — State-Space Models, Temporal Fusion Transformers, and Probabilistic Forecasting
Greedy forward selection
Start with no features; iteratively add the feature that improves the objective most. $O(d^2)$ model evaluations. 2. **Greedy backward elimination** — Start with all features; iteratively remove the least important. $O(d^2)$ model evaluations. 3. **L1 regularization (Lasso)** — Adds an $\ell_1$ pena → Chapter 5: Quiz

H

hallucinate
they generate text that is fluent, confident, and wrong. This is not a bug to be patched; it is a fundamental consequence of the training objective. → Chapter 11: Large Language Models — Architecture, Training, Fine-Tuning, RAG, and Practical Applications
Hard negative sampling
sampling non-edges between nodes that are close in the graph (e.g., at distance 2) — produces more informative gradients. → Chapter 14: Graph Neural Networks and Geometric Deep Learning — When Your Data Has Structure Beyond Grids and Sequences
heterophilous graphs
where connected nodes tend to have *different* labels (e.g., in dating networks, bipartite buyer-seller networks, or amino acid interaction networks) — aggregation mixes features from dissimilar nodes, which can destroy the discriminative signal. In extreme cases, a GCN on a heterophilous graph perf → Chapter 14: Quiz
Hidden state
the output of the LSTM cell: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
Hidden state update
interpolates between old and new: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
High
partial pooling is the killer app | Mixed-effects models (close, but no full posterior) | | Few groups, abundant data per group | **Low** — posterior ≈ MLE | MLE or regularized regression | | Need full uncertainty for decisions | **High** — direct posterior probability statements | Bootstrap (close, → Chapter 21: Bayesian Modeling in Practice — PyMC, Hierarchical Models, and When Bayesian Methods Earn Their Complexity
high $P(Y=1 \mid X)$
the baseline probability of the adverse outcome, regardless of whether the treatment changes it. The quantity that optimal targeting should use is the **Conditional Average Treatment Effect (CATE)**: $E[Y(1) - Y(0) \mid X = x]$, which measures how much the treatment changes the outcome for individua → Chapter 15: Quiz
homophily
the assumption that connected nodes tend to have similar labels or features (e.g., friends share political views, papers in the same field cite each other). Aggregation effectively "borrows" useful information from similar nodes. → Chapter 14: Quiz
How many neurons are needed
the required width can be exponentially large in the input dimension. (2) **That gradient descent can find the right weights** — the theorem is an existence result and says nothing about optimization. (3) **Sample efficiency** — it says nothing about how much training data is needed to learn the app → Chapter 6: Quiz

I

Idempotency
the property that rerunning a task produces the same output — enables safe retries and backfills. Partition-based overwrites, atomic writes, and deterministic artifact paths are the primary implementation techniques. → Chapter 27: ML Pipeline Orchestration — Airflow, Dagster, Prefect, and Designing Robust Data Workflows
identity matrix
no weight matrix $\mathbf{W}_{hh}$ appears in this gradient path. The gradient can flow through the cell state for hundreds of time steps without multiplicative attenuation, creating a "gradient highway." → Chapter 9: Quiz
IEEE 754 Half Precision (fp16):
1 sign bit, 5 exponent bits, 10 mantissa bits - Range: $\pm 6.55 \times 10^4$ - Smallest positive normal: $6.10 \times 10^{-5}$ - Precision: ~3.3 decimal digits → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge
Immutability
some features cannot be changed (age, race, sex). The optimization must hold these fixed. (2) **Actionability** — the changes should represent actions the person can actually take. "Increase your age by 10 years" is mathematically valid but useless. (3) **Plausibility** — the counterfactual should l → Chapter 35: Quiz
Implementation considerations:
Shadow predictions must not affect user experience (asynchronous evaluation) - Log both champion and challenger predictions with identical features to ensure a fair comparison - Monitor challenger latency independently — a slow challenger should not affect champion performance - Run shadow mode for → Appendix H: ML System Design Patterns
inconsistent predictions
the same input can produce different outputs depending on what other inputs are in the batch. This is one of the most common production bugs in deep learning. → Chapter 7: Quiz
Increase dropout
start with 0.3 and increase to 0.5 if needed. 3. **Increase weight decay** — try $10^{-3}$ to $10^{-1}$ with AdamW. 4. **Data augmentation** — if applicable to the domain. 5. **Reduce model size** — fewer layers or smaller hidden dimensions. 6. **Get more training data** — the most reliable fix but → Chapter 7: Quiz
Increase local batch size
processes more data per GPU before each all-reduce, amortizing the fixed communication cost. Trade-off: more GPU memory consumed by activations; may require gradient checkpointing. (2) **Gradient accumulation** — accumulate gradients over $K$ micro-batches before the all-reduce, reducing communicati → Chapter 26: Quiz
Inference speed
1D CNNs are much faster than transformers for short sequences because they avoid the $O(n^2)$ self-attention computation and are highly parallelizable. In a recommendation system serving millions of requests per second, latency matters. (2) **Model size** — 1D CNNs have far fewer parameters than tra → Chapter 8: Quiz
Input embedding
converts token IDs to dense vectors $\in \mathbb{R}^{d_{\text{model}}}$ 2. **Positional encoding** — adds position information 3. **$N$ transformer blocks** — each containing multi-head self-attention + FFN → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)
Input gate
decides what new information to add: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
integrated gradients
a method for attributing a prediction to input features. → Chapter 6: Exercises
inter-token computation
it routes information between positions, allowing each position to aggregate information from other positions. The FFN performs **intra-token computation** — it applies an independent nonlinear transformation at each position, processing the information that attention has gathered. Research on trans → Chapter 10: Quiz
interaction effect
the amount by which the treatment effect of A differs when B is active vs. inactive. Test $H_0: \beta_{AB} = 0$. A significant interaction means the combined effect of A and B is not the sum of their individual effects. In practice, the interaction is compared to the main effect of A as a relative m → Chapter 33: Quiz
internal covariate shift
the change in the distribution of layer inputs caused by updates to the preceding layers. The intuition is compelling: if each layer's inputs are normalized, then the layer can learn without its inputs shifting under it. → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge
Interpretation:
**Reliability** measures calibration error. Lower is better. A perfectly calibrated model has reliability = 0. - **Resolution** measures discrimination — how much the model's predictions vary with the true outcome. Higher is better. - **Uncertainty** is a property of the data, not the model. It is m → Appendix G: Evaluation Metrics Reference
Intrinsic dimensionality
the fine-tuning loss landscape has a much lower effective dimension than the full parameter space (Aghajanyan et al., 2021). (2) **The weight update $\Delta W$ is empirically low-rank** — most of the model's knowledge is preserved from pretraining, and only a small task-specific adjustment is needed → Chapter 11: Quiz
Introductory Statistics
Making Sense of Data in the Age of AI 2. **Intermediate Data Science** — Machine Learning, Experimentation, and the Craft of Data-Driven Decisions 3. **Advanced Data Science** — Deep Learning, Causal Inference, and Production Systems at Scale *(this book)* → Advanced Data Science
Intuition
a plain-language explanation of what the math means 2. **Formal notation** — the mathematical expression 3. **Code** — a numpy or PyTorch implementation you can run → How to Use This Book
Invalid adjustment sets:
$\{$Biomarker$\}$: Biomarker is a descendant of Drug X (Drug X $\to$ Biomarker). Violates condition 1. Conditioning on it blocks the causal pathway. - $\{$Insurance Status$\}$: Insurance Status is a collider (Age $\to$ Insurance Status $\leftarrow$ Comorbidities). Conditioning on it opens a spurious → Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
invariance tolerances are much higher
credit scoring invariance tests require 0.98-0.99 overlap (near-perfect invariance to protected attributes like gender and zip code), while recommendation invariance tests tolerate 0.70-0.95 because some platform-specific effects are legitimate. Second, **directional tests encode economic relationsh → Chapter 28: Quiz
Invariance — Fair Lending (6 tests):
Gender invariance: changing gender must not change the score by more than 0.01 - Marital status invariance: same constraint - Race/ethnicity invariance: same constraint (tested via constructed matched pairs) - National origin invariance: applicant born in the U.S. vs. not must not change the score b → Case Study 2: Meridian Financial — Regulatory Model Validation for Credit Scoring
investigate the SRM root cause
check for platform-specific differences, redirect chains, performance discrepancies, and bot filtering; (3) **fix the root cause** and re-run the experiment; (4) only trust the engagement result if the re-run passes SRM. An SRM-contaminated experiment cannot be salvaged by statistical adjustment — t → Chapter 33: Quiz
Isotonic regression strengths and limitations:
**Strengths:** Non-parametric — can correct arbitrary monotonic miscalibration patterns. No functional form assumption. - **Limitations:** Requires more calibration data than Platt scaling (1,000+ samples recommended). Produces a step function, which can be coarse with small calibration sets. May ov → Chapter 34: Uncertainty Quantification — Calibration, Conformal Prediction, and Knowing What Your Model Doesn't Know

J

Justification for $\varepsilon = 3$ (target):
5.1% accuracy degradation and 14.2% Recall@20 degradation is borderline acceptable - Casual and Japanese market segments require quality mitigation before deployment - $\varepsilon = 3$ is within the range used by Apple (1-8 per data type per day) and represents a defensible privacy standard → Case Study 2: StreamRec DP Training — Privacy-Utility Tradeoff at Scale
Justification for $\varepsilon = 8$ (initial):
Only 1.9% accuracy degradation and 7.0% Recall@20 degradation from baseline - All user segments except new_users (already underperforming) meet the 0.18 threshold - Formal DP guarantee satisfies the EDPB's "quantifiable privacy protection" guidance - 3.5x training time overhead is manageable within → Case Study 2: StreamRec DP Training — Privacy-Utility Tradeoff at Scale

K

KD-tree
Exact (in low dimensions); partitions space with axis-aligned hyperplanes. Degrades to $O(n)$ in high dimensions. 2. **Locality-Sensitive Hashing (LSH)** — Approximate; hashes similar points to the same bucket with high probability. Query time is $O(n^\rho)$ where $\rho < 1$. 3. **HNSW (Hierarchical → Chapter 5: Quiz
Key ablation findings:
Reducing the number of heads hurts performance — multi-head is better than single-head. - Reducing $d_k$ hurts performance — attention dimension matters. - Bigger models are better (up to the tested range). - Dropout is important — without it, performance degrades. - Learned positional embeddings pe → Case Study 1: Paper Walkthrough — "Attention Is All You Need" Using the Three-Pass Method
Key design decisions:
**Multi-stage ranking:** The funnel architecture (1000 -> 100 -> 10) allows using progressively more expensive models at each stage, keeping total latency within budget. - **Hybrid retrieval:** Combine lexical (BM25) and semantic (dense retrieval with bi-encoders) to capture both exact-match and mea → Appendix H: ML System Design Patterns
Key design principles:
**Two-stage architecture:** Rules handle obvious cases (known fraud patterns, velocity checks); ML handles the ambiguous middle. Rules provide explainability and fast response; ML provides generalization. - **Latency constraint:** The entire pipeline must complete within 100-200ms to avoid degrading → Appendix H: ML System Design Patterns
Key properties:
$\mathbf{H}$ is symmetric (by Schwarz's theorem, when second partials are continuous) - If $\mathbf{H}$ is positive definite at $\mathbf{x}$, then $\mathbf{x}$ is a local minimum - If $\mathbf{H}$ is negative definite, then $\mathbf{x}$ is a local maximum - If $\mathbf{H}$ has both positive and nega → Chapter 2: Multivariate Calculus and Optimization
known unknowns
failure modes you have seen before or can imagine. You instrument the system, define thresholds, and receive alerts when those thresholds are violated. → Chapter 30: Monitoring, Observability, and Incident Response — Keeping ML Systems Healthy in Production

L

largely unobserved
we observe proxies (user history, demographics, past behavior) but not the latent preference itself. (2) The algorithm is specifically designed to exploit preference signals, creating strong confounding. (3) Because the confounder is unobserved, the backdoor criterion cannot be straightforwardly sat → Chapter 17: Quiz
Learning objectives
what you will be able to do after completing the chapter - **Motivating example** — why this topic matters, grounded in the anchor examples - **Main content** — rigorous exposition with mathematical derivations paired with code implementations - **Progressive project milestone** — the next step in b → How to Use This Book
Limitations:
**Overpowered at large sample sizes.** With millions of serving predictions, the KS test will detect trivially small shifts that have no practical impact. A PSI of 0.01 and a KS p-value of $10^{-15}$ is not unusual — the shift is statistically significant but operationally meaningless. - **Single-po → Chapter 30: Monitoring, Observability, and Incident Response — Keeping ML Systems Healthy in Production
live production traffic
with its real distribution of users, items, contexts, and edge cases — rather than on a static, retrospective holdout set. This catches issues that holdout data misses: traffic pattern changes, seasonal effects, new user segments, and the interaction between model predictions and the serving infrast → Chapter 28: Quiz
local average treatment effect (LATE)
the average treatment effect among **compliers** (Imbens and Angrist, 1994). → Chapter 18: Causal Estimation Methods — Matching, Propensity Scores, Instrumental Variables, Difference-in-Differences, and Regression Discontinuity
Lower memory
$O(Bs)$ instead of $O(Bs^2)$, because the attention matrix is never written to HBM; and (2) **Higher throughput** — by fusing multiple HBM read/write operations into a single kernel, FlashAttention eliminates the round trips that make standard attention memory-bound. The speedup comes not from reduc → Chapter 26: Quiz

M

malicious or curious server
a server that honestly computes the aggregate but also inspects individual client updates to infer information about their local data. Differential privacy addresses a different threat: an adversary who sees the *output* (the trained model) and tries to infer information about the training data. DP → Chapter 32: Quiz
marginal likelihood
is often intractable. → Chapter 4: Information Theory for Data Science — Entropy, KL Divergence, and Why Your Loss Function Works
Masked multi-head self-attention
the decoder attends to previous output positions only. A causal mask prevents position $i$ from attending to positions $j > i$, preserving the autoregressive property (each prediction depends only on past predictions, not future ones). → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)
methodological consistency
one team sets standards for experimentation, validation, fairness, and deployment, ensuring that all DS work meets the same quality bar. Also enables knowledge sharing and career development. → Chapter 39: Quiz
Milestone-specific notes:
**M0-M1 (mathematical foundations):** Emphasis on correctness of SVD implementation and profiling methodology. Do not penalize for naive implementations — the point is understanding, not optimization. - **M2-M6 (deep learning):** Emphasis on training curves, proper evaluation, and comparison across → Grading Rubrics
Minimum Functionality (6 tests):
AUC $\geq$ 0.78 overall - AUC $\geq$ 0.72 for each income quartile (4 tests) - Hosmer-Lemeshow calibration $p$-value $> 0.05$ → Case Study 2: Meridian Financial — Regulatory Model Validation for Credit Scoring
ML connections:
**Regression:** Assuming Gaussian noise on the target variable leads to the mean squared error loss. Specifically, if $y = f(x) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$, then the MLE for $f$ minimizes $\sum_i (y_i - f(x_i))^2$. - **Latent spaces:** Variational autoencoders enforce $ → Chapter 3: Probability Theory and Statistical Inference
ML Test Score
a rubric for evaluating the maturity of an ML system's testing infrastructure. The rubric covers four categories: → Chapter 28: ML Testing and Validation Infrastructure — Data Contracts, Behavioral Testing, and Great Expectations
ML-specific considerations:
The fallback for a recommendation model might be a popularity-based ranker (no personalization, but always available and fast) - The fallback for a fraud model might be a rule-based system (higher false positive rate, but never misses known patterns) - Monitor fallback activation rate as a system he → Appendix H: ML System Design Patterns
More stable
small input perturbations cause proportionally small output changes. 2. **More interpretable** — regulators can read the weights without seeing artifacts of correlated-feature cancellation. 3. **More generalizable** — the model relies on the shared signal across correlated features, not on idiosyncr → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge
Multi-head self-attention
lets each position attend to all other positions. 2. **Position-wise feed-forward network (FFN)** — applies an independent nonlinear transformation at each position. → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)
multi-scale readout
concatenating sum, mean, and max — often outperforms any single readout because different properties depend on different aggregation semantics. Sum captures total activity (useful for molecular weight prediction), mean captures average behavior (useful for density), and max captures extreme features → Chapter 14: Graph Neural Networks and Geometric Deep Learning — When Your Data Has Structure Beyond Grids and Sequences

N

natural experiments
instrumental variables, interrupted time series, or synthetic control methods that exploit the policy's timing or geographic variation. → Chapter 15: Quiz
natural parameters
$\mathbf{T}(x)$ is the **sufficient statistic** — the function of the data that contains all information about $\boldsymbol{\eta}$ - $A(\boldsymbol{\eta})$ is the **log-partition function** (ensures normalization) - $h(x)$ is the **base measure** → Chapter 3: Probability Theory and Statistical Inference
Neither model should be deployed
the validation gate should block both. The first model passes behavioral tests but does not meet the absolute performance floor, meaning it provides worse recommendations than the minimum acceptable standard. The second model has good aggregate performance but fails behavioral tests, meaning it has → Chapter 28: Quiz
next-token prediction
the single training objective behind GPT, LLaMA, and every decoder-only LLM. The model learns to predict the next token given all preceding tokens, and this deceptively simple objective turns out to be sufficient for learning grammar, facts, reasoning patterns, and even rudimentary world models. → Chapter 11: Large Language Models — Architecture, Training, Fine-Tuning, RAG, and Practical Applications
Neyman orthogonality
the same principle underlying DML (Section 19.4). → Chapter 19: Causal Machine Learning — Heterogeneous Treatment Effects, Uplift Modeling, and Double Machine Learning
no significant shift
the distribution is stable. A PSI between 0.1 and 0.25 indicates **moderate shift** — the distribution has changed enough to warrant investigation but may not require immediate action. A PSI above 0.25 indicates **significant shift** — the distribution has changed substantially and action is require → Chapter 28: Quiz
non-numerical
selecting the best feature, choosing a hyperparameter, picking the most popular category — where adding continuous noise to the output is not meaningful. The Laplace and Gaussian mechanisms add noise to numerical answers; the exponential mechanism randomizes the *selection* of a discrete answer, bia → Chapter 32: Quiz
Non-obvious dependencies
an ML decision like choosing real-time serving implies downstream requirements (online feature store, streaming pipeline, GPU infrastructure) that should be documented explicitly. (2) **Evolution through experimentation** — decisions based on A/B test results (e.g., "+12% engagement for session-awar → Chapter 24: Quiz

O

Observations:
The paper is from a university research group you have not previously encountered. - The method requires constructing a hypergraph from user sessions, applying hypergraph convolution, and training with a contrastive loss. - The baselines include LightGCN, SASRec, and SR-GNN — all published between 2 → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype
Offline evaluation
holdout metrics and sliced metrics, taking minutes; (2) **Behavioral tests** — MFT, INV, DIR suites, taking minutes; (3) **Shadow evaluation** — live traffic comparison, taking 3-7 days; (4) **Canary deployment** — 5-10% real traffic, taking 3 days. The ordering is important because each stage is pr → Chapter 28: Quiz
Operation ordering
numpy and PyTorch may compute matrix multiplications using different BLAS implementations with different accumulation orders. Floating-point addition is not associative, so $(a + b) + c \neq a + (b + c)$ in general. (2) **Fused operations** — PyTorch may use fused multiply-add (FMA) instructions tha → Chapter 6: Quiz
Output gate
decides what to expose from the cell state: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
overfitting
the model memorizes the training data without generalizing. → Chapter 7: Quiz

P

Pattern 4: Both losses plateau at a high value.
Underfitting. The model lacks capacity or the optimization is stuck. - Fix: Increase model size, increase learning rate, try a different optimizer (SGD → Adam), check for dead neurons (ReLU collapse). → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge
Pattern 5: Loss oscillates without converging.
Learning rate is too high for the current phase, or batch size is too small. - Fix: Reduce learning rate, increase batch size, add learning rate warmup. → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge
Pipeline code version
git SHA pinning the exact code that ran 2. **Configuration version** — hyperparameters, feature lists, and thresholds used 3. **Data versions** — Delta Lake table versions or partition identifiers for each input data source 4. **Feature store version** — Feast feature set version and the specific fe → Chapter 27: Quiz
placebo tests
the key inferential tool for synthetic control. By applying the method to each control state as if it were treated, the authors construct a distribution of "placebo effects" and show that California's effect is an extreme outlier, providing a permutation-based p-value. For the extension to multiple → Chapter 33: Further Reading
Platt scaling strengths and limitations:
**Strengths:** Simple, fast, works well when the uncalibrated scores are already monotonically related to true probabilities. Two parameters make overfitting unlikely even on small calibration sets. - **Limitations:** Assumes the logistic model is the right recalibration function. Fails when the mis → Chapter 34: Uncertainty Quantification — Calibration, Conformal Prediction, and Knowing What Your Model Doesn't Know
Position-wise FFN
same as in the encoder. → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)
post-LN
layer normalization applied after the residual addition: → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)
Potential outcomes:
$Y_i(1)$: Would patient $i$ be readmitted within 30 days if prescribed Drug X? - $Y_i(0)$: Would patient $i$ be readmitted within 30 days if prescribed standard therapy? → Case Study 1: MediCore Treatment Effect — Defining Potential Outcomes for Drug Efficacy
Practical guidance for practitioners:
**Peer-reviewed papers at top venues:** Lower your guard slightly. Focus your evaluation on methodology and production relevance, not whether the results are fabricated. - **Preprints from established research labs (Google, Meta, DeepMind, Microsoft Research, university groups with strong track reco → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype
pre-LN
layer normalization applied before the sub-layer: → Chapter 10: The Transformer Architecture — Attention Is All You Need (and Why It Changed Everything)
Primary sources:
**arXiv.** Subscribe to cs.LG, cs.CL, cs.CV, stat.ML, cs.IR (information retrieval), and stat.ME (methodology) as relevant to your work. Use arXiv Sanity Lite or Semantic Scholar alerts for personalized filtering. - **Conference proceedings.** Read the best paper awards and oral presentations from t → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype
prior
beliefs about $\theta$ before observing data - $p(D \mid \theta)$: **likelihood** — probability of the data given parameter value $\theta$ - $p(D) = \int p(D \mid \theta) p(\theta) d\theta$: **marginal likelihood** (evidence) — normalizing constant - $p(\theta \mid D)$: **posterior** — updated belie → Chapter 20: Quiz
Problem formulation
translating a vague business question into a well-defined data science problem — is the highest-value activity a data scientist performs. A candidate who selects the perfect algorithm but frames the wrong problem produces negative value. **Communication** — explaining results to non-technical stakeh → Chapter 39: Quiz
progressive rollout
a staged increase in traffic percentage with monitoring at each stage. → Chapter 29: Continuous Training and Deployment — CI/CD for ML, Canary Deployments, Shadow Mode, and Progressive Rollout
propensity score
and achieve the same bias reduction. → Chapter 18: Causal Estimation Methods — Matching, Propensity Scores, Instrumental Variables, Difference-in-Differences, and Regression Discontinuity
Propensity Score Matching (PSM):
Estimate $e(x) = P(W=1 \mid X=x)$ using logistic regression, gradient boosting, or a neural network - Match each treated unit to the nearest control unit(s) in propensity score space - Estimate ATE from the matched sample → Appendix J: Causal Inference Identification Guide
pseudo-counts
equivalent to having already observed $\alpha - 1$ successes and $\beta - 1$ failures before the experiment began. → Chapter 20: Bayesian Thinking — Priors, Posteriors, and Why Frequentist vs. Bayesian Is the Wrong Debate

R

Reading the diagram:
Points above the diagonal: the model is *under-confident* (it predicts 0.6 but the actual rate is 0.8) - Points below the diagonal: the model is *over-confident* (it predicts 0.8 but the actual rate is 0.6) - Horizontal gap between the point and the diagonal is the calibration error for that bin → Appendix G: Evaluation Metrics Reference
Regression adjustment
the most general form — regresses $Y$ on treatment, pre-experiment covariates, and their interactions. Lin (2013) showed that the fully interacted regression estimator is at least as efficient as the simple difference-in-means and is consistent even if the regression model is misspecified. Modern ex → Chapter 33: Rigorous Experimentation at Scale — Multi-Armed Bandits, Interference Effects, and Experimentation Platforms
Regularization
add dropout, weight decay (L2 regularization), or both. (2) **More data** — if available, more training data reduces overfitting. (3) **Simpler model** — reduce the number of layers or neurons. (4) **Early stopping** — stop training when validation loss starts increasing. (5) **Data augmentation** — → Chapter 6: Quiz
relevance
how strongly the instrument predicts treatment. → Chapter 18: Quiz
Reset gate
controls how much of the previous state to use when computing the candidate: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
Residual (skip) connections
provide a direct gradient highway from the loss to earlier layers, bypassing the vanishing gradient path. 2. **Normalization layers** (batch norm or layer norm) — keep activations and gradients in a healthy range at every layer. 3. **Proper initialization** — He initialization for ReLU networks ensu → Chapter 7: Quiz
Residual connection
**Pre-layer normalization** (again) - **Feed-forward network** — typically a gated variant: SwiGLU (Shazeer, 2020), computed as $\text{FFN}(x) = (W_1 x \odot \sigma(W_g x)) W_2$, where $\sigma$ is the SiLU activation and $\odot$ is element-wise multiplication - **Residual connection** → Chapter 11: Large Language Models — Architecture, Training, Fine-Tuning, RAG, and Practical Applications
root cause
that message passing is inherently a low-pass filtering operation. They delay over-smoothing but do not eliminate it. **Graph transformers**, which replace local message passing with global attention, fundamentally avoid the depth-smoothing tradeoff but sacrifice the sparsity advantage and local ind → Chapter 14: Quiz
Rules of thumb for weight diagnostics:
Effective sample size ratio below 0.5 (ESS is less than half the actual sample size) suggests extreme weights. - If the top 5% of weights carry more than 50% of the total weight, the estimate is driven by a handful of observations. - Propensity scores below 0.05 or above 0.95 signal positivity viola → Chapter 18: Causal Estimation Methods — Matching, Propensity Scores, Instrumental Variables, Difference-in-Differences, and Regression Discontinuity
Rules of thumb:
$\hat{R} < 1.01$: excellent convergence - $1.01 \leq \hat{R} < 1.05$: acceptable for most purposes - $\hat{R} \geq 1.05$: **do not trust these samples** — investigate and rerun → Chapter 21: Bayesian Modeling in Practice — PyMC, Hierarchical Models, and When Bayesian Methods Earn Their Complexity
Rung 1: Association (Seeing)
$P(Y \mid X)$. Questions about observed joint distributions. Example: "What is the click-through rate for users who were shown item X?" → Chapter 15: Quiz
Rung 2: Intervention (Doing)
$P(Y \mid \text{do}(X))$. Questions about the effects of active interventions. Example: "If we recommend item X to a random user, what is the expected click probability?" → Chapter 15: Quiz
Rung 3: Counterfactual (Imagining)
$P(Y_{x'} \mid X=x, Y=y)$. Questions about specific individuals under alternative conditions. Example: "This user clicked on item X after we recommended it. Would they have clicked if we had NOT recommended it?" → Chapter 15: Quiz

S

sample space
the set of all possible outcomes - $\mathcal{F}$ is a **$\sigma$-algebra** — a collection of subsets of $\Omega$ (the "events" we can assign probabilities to) - $P: \mathcal{F} \to [0, 1]$ is a **probability measure** satisfying the Kolmogorov axioms → Chapter 3: Probability Theory and Statistical Inference
sampling variability
they do not account for **systematic bias** from confounding, model misspecification, or SUTVA violations. → Chapter 16: Quiz
Scale
foundation models are trained on orders of magnitude more data (billions of tokens/images vs. millions), (2) **Generality** — they transfer to a wide range of tasks without architectural modification, not just tasks similar to the pretraining task, and (3) **Emergence** — they exhibit capabilities t → Chapter 13: Quiz
Scope
transformer attention is global (all positions attend to all), while GAT attention is local (only neighbors attend). (2) **Score function** — GAT uses an additive scoring function with a single attention vector $\mathbf{a}$, while transformers use separate query/key projections with dot-product scor → Chapter 14: Quiz
Secondary sources:
**Research newsletters.** The Batch (Andrew Ng), Import AI (Jack Clark), NLP Newsletter (Sebastian Ruder). These provide curated summaries but should be treated as pointers, not substitutes for reading the actual papers. - **Team reading groups.** If your team does not have a paper reading group, st → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype
self-supervised learning
a training paradigm where the supervision signal comes from the data itself, not from human labels. → Chapter 13: Transfer Learning, Foundation Models, and the Modern Deep Learning Workflow
sequences
ordered collections of variable length where position matters. → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
sequential patterns
the order of items, not just their categories. → Case Study 2: StreamRec Session Modeling — Predicting Next Item from Click Sequences
served to real users
the canary model's output determines what users see. This makes canary evaluation more realistic (it measures actual user behavior, not predicted behavior) but also more risky (a bad canary model affects real users). → Chapter 29: Continuous Training and Deployment — CI/CD for ML, Canary Deployments, Shadow Mode, and Progressive Rollout
Serving patterns
batch, real-time, and near-real-time — differ in latency, freshness, cost, and infrastructure complexity. The right choice depends on the product requirements, not the model architecture. Hybrid patterns (real-time with batch fallback) are common and effective. → Chapter 24: ML System Design — Architecture Patterns for Real-World Machine Learning
set operation
it computes pairwise similarities and weighted averages without any notion of position or order. If the input sequence is permuted, the outputs are the same permutation of the original outputs (permutation equivariance). Without positional encoding, the transformer treats "the dog bit the man" ident → Chapter 10: Quiz
Single feature computation path
the same code (or pipeline) writes to both the online and offline stores, eliminating implementation mismatches. (2) **Backfill** — when a new feature is added, it is retroactively computed for historical data in the offline store. (3) **Point-in-time joins** — the training pipeline retrieves featur → Chapter 24: Quiz
small fraction
often cited as roughly 5% — of the total system. The rest consists of data collection and verification, feature extraction and management, serving infrastructure, monitoring and alerting, configuration management, process management, machine resource management, and analysis tools. The surrounding i → Chapter 24: Quiz
Smooth everywhere
GELU is differentiable at $z = 0$, avoiding the sharp corner of ReLU. This smoothness can benefit optimization. (2) **No dead neurons** — GELU has a non-zero gradient for slightly negative inputs, so neurons are not permanently killed. (3) **Probabilistic interpretation** — GELU can be viewed as a s → Chapter 6: Quiz
smooths the loss landscape
it makes the loss function more Lipschitz continuous and makes the gradients more predictive of the actual loss change. → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge
state-of-the-art (SOTA) claims
papers that claim to achieve the best performance on a specific benchmark. This culture has both benefits (clear progress tracking) and costs (incentivizes overfitting to benchmarks rather than solving problems). → Chapter 37: Reading Research Papers — How to Stay Current, Evaluate Claims, and Separate Signal from Hype
Statistical behavior:
$X$ and $Y$ are marginally dependent: $X \not\!\perp\!\!\!\perp Y$. - $X$ and $Y$ are conditionally independent given $Z$: $X \perp\!\!\!\perp Y \mid Z$. → Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
Statistical validation
specifically, PSI-based distribution shift detection or a statistical test on the feature's mean. Schema validation (Great Expectations or Pandera) would not catch this because the schema is unchanged: the column exists, has the correct type, is non-null, and values may still be within the valid ran → Chapter 28: Quiz
Strengths:
Massive ecosystem: 80+ provider packages (AWS, GCP, Azure, Kubernetes, Spark, dbt) - Battle-tested at enormous scale (Airbnb, Google, Netflix, Uber) - Rich web UI for monitoring, debugging, and manual intervention - Mature community with extensive documentation → Chapter 27: ML Pipeline Orchestration — Airflow, Dagster, Prefect, and Designing Robust Data Workflows
surprises are diagnostic
they reveal gaps in the team's mental model of the system, gaps that may cause future surprises if not addressed. In the StreamRec retrospective, the "surprised_by" item (cold-start bias is fairness bias) revealed a gap in the team's mental model: they understood cold-start as a model quality proble → Chapter 36: Quiz
synthetic data
artificial data that preserves the statistical properties of the original dataset but does not correspond to any real individual. Analysts and model developers work with the synthetic data, never touching the original. → Chapter 32: Privacy-Preserving Data Science — Differential Privacy, Federated Learning, and Synthetic Data

T

temporal data leakage
the inclusion of future information in the training data. Without a point-in-time join, a naive join retrieves the latest feature value, which may include data generated after the training example's event. The model then learns from signals that are not available at serving time, producing inflated → Chapter 25: Quiz
Tests for features and data
schema validation, feature importance correlation, data pipeline unit tests, feature coverage, and data monitoring; (2) **Tests for model development** — training reproducibility, holdout quality, sliced quality, staleness detection, and proxy metrics; (3) **Tests for ML infrastructure** — training → Chapter 28: Quiz
The difference
$H(\bar{p}) - \frac{1}{M}\sum_m H(p_m)$ — measures the *epistemic* uncertainty from model disagreement. This is closely related to mutual information between the model index and the prediction. → Chapter 4: Information Theory for Data Science — Entropy, KL Divergence, and Why Your Loss Function Works
The fundamental problem of causal inference
we can never observe both potential outcomes for the same individual — makes causal inference a missing data problem that cannot be solved by more data or better models alone. It requires assumptions. → Chapter 15: Beyond Prediction — Why Correlation Isn't Enough and What Causal Inference Offers
The Ladder of Causation
association, intervention, counterfactual — formalizes the hierarchy of causal reasoning and establishes that higher rungs require more than data; they require causal models. → Chapter 15: Beyond Prediction — Why Correlation Isn't Enough and What Causal Inference Offers
The learning rate schedule
particularly warmup + cosine decay — can matter as much as the choice of optimizer. → Chapter 2: Multivariate Calculus and Optimization
There is a clear overfitting pattern
training RMSE decreases monotonically with rank, but held-out RMSE reaches a minimum around the true latent dimension and increases thereafter. 4. **Simple SVD on the imputed matrix is a reasonable baseline** but has fundamental limitations: the imputation step biases the decomposition. In Chapter 2 → Chapter 1: Linear Algebra for Machine Learning
They are equal when there is no confounding
specifically, when $X$ has no common causes with $Y$, meaning all paths between $X$ and $Y$ are directed from $X$ to $Y$ (no backdoor paths). In this case, the observational conditional distribution already reflects the causal effect. Formally: if $\{Y(0), Y(1)\} \perp\!\!\!\perp X$ (unconditional i → Chapter 17: Quiz
Three assumptions
SUTVA, ignorability, positivity — are required for identification. SUTVA requires domain knowledge. Ignorability is untestable. Positivity is partially testable. → Chapter 16: The Potential Outcomes Framework — Counterfactuals, ATEs, and the Fundamental Problem of Causal Inference
Three estimands
ATE, ATT, ATU — answer different questions and may have different values when treatment effects are heterogeneous. Choosing the right estimand is as important as estimating it accurately. → Chapter 16: The Potential Outcomes Framework — Counterfactuals, ATEs, and the Fundamental Problem of Causal Inference
Three junction types
fork, chain, collider — determine how information flows through a graph. Forks and chains are open by default and blocked by conditioning; colliders are blocked by default and opened by conditioning. This asymmetry is the fundamental rule of graphical causal reasoning. → Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
Training loop
forward pass, loss computation, backward pass, optimizer step, gradient accumulation, (2) **Evaluation** — periodic evaluation on a validation set with custom metrics, (3) **Logging** — integration with TensorBoard, Weights & Biases, and other logging frameworks, (4) **Checkpointing** — automatic mo → Chapter 13: Quiz
training-serving skew
will produce predictions that are silently, catastrophically wrong. → Chapter 24: ML System Design — Architecture Patterns for Real-World Machine Learning

U

uncertainty region
where the model's score is close to the decision boundary. For instances far from the boundary (very high or very low scores), the model's prediction is accepted as-is. For instances near the boundary, the prediction is flipped in favor of the disadvantaged group. → Chapter 31: Fairness in Machine Learning — Definitions, Impossibility Results, Mitigation Strategies, and Organizational Practice
underdispersed
it does not capture the full variability in the data. Common causes: (1) the model assumes a distribution with too little variance (e.g., Poisson when the data are overdispersed — a negative binomial would be more appropriate), (2) important sources of variation are missing from the model (e.g., a h → Chapter 21: Quiz
Understanding Why
Understanding WHY an algorithm works (not just how to call it) is what separates a senior data scientist from someone who copies Stack Overflow. The math serves understanding, not credentials. 2. **Prediction ≠ Causation** — The most important and most commonly confused distinction in applied data s → Advanced Data Science — Master Outline
unknown unknowns
failure modes you have never seen and could not have predicted. The March 3 incident above was an unknown unknown: no one anticipated that a feature pipeline would silently convert integers to timestamps, so no monitor existed for it. → Chapter 30: Monitoring, Observability, and Incident Response — Keeping ML Systems Healthy in Production
Update gate
controls how much of the previous state to keep: → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
User Preference
the latent variable representing a user's underlying interest in specific content. User Preference is a common cause (fork) of both the Recommendation (the algorithm recommends items it predicts the user will like, which is driven by preferences) and Engagement (users engage more with content that m → Chapter 17: Quiz

V

Valid adjustment sets:
$\{$Severity, Age$\}$: Blocks all three backdoor paths. Does not include descendants of Drug X. Valid. - $\{$Severity$\}$: Blocks paths 1 and 2. Path 3 (through Age) remains open. Valid only if there is no path Drug X $\leftarrow$ Age $\to$ Hospitalization that is unblocked — not valid in this graph → Chapter 17: Graphical Causal Models — DAGs, d-Separation, and Structural Causal Models
validation instrument
its purpose is to confirm or reject a hypothesis. → Chapter 39: Quiz
vanishing gradient problem
gradients shrink exponentially as they propagate backward through the network, causing early layers to learn extremely slowly. → Chapter 7: Quiz
Variable selection networks
learned attention over input features, allowing the model to discover which variables matter at each time step. 2. **Static covariate encoders** — time-invariant features (e.g., item category, geographic region) influence the network's behavior through context vectors. 3. **Temporal self-attention** → Chapter 23: Advanced Time Series and Temporal Models — State-Space Models, Temporal Fusion Transformers, and Probabilistic Forecasting

W

washout periods
inserting untreated buffer periods between alternations; (2) **longer periods** — reducing the fraction of time contaminated by carryover; and (3) **regression adjustment** — explicitly modeling the carryover as a function of the previous period's assignment, e.g., $Y_t = \mu + \tau D_t + \gamma D_{ → Chapter 33: Quiz
Weaknesses:
Task-centric model makes data lineage opaque: you know that `train_retrieval` depends on `compute_features`, but Airflow does not know *what data* flows between them - XCom serialization limits: large data must be stored externally (S3, GCS) with only a reference passed through XCom - DAG parsing ov → Chapter 27: ML Pipeline Orchestration — Airflow, Dagster, Prefect, and Designing Robust Data Workflows
Weisfeiler-Leman (WL) graph isomorphism test
a classical algorithm from graph theory. → Chapter 14: Graph Neural Networks and Geometric Deep Learning — When Your Data Has Structure Beyond Grids and Sequences
What is missing (by 2024 standards):
No confidence intervals or multi-run statistics. Single-run numbers only. - No evaluation beyond translation (the English constituency parsing result in Table 4 is a limited generalization test). - No analysis of failure modes — what kinds of sentences does the Transformer get wrong? - No scaling an → Case Study 1: Paper Walkthrough — "Attention Is All You Need" Using the Three-Pass Method
What to look for:
**Gradient norm decreasing with depth:** Vanishing gradients. Later (lower) layers learn slowly while early layers dominate. Fix: residual connections, better initialization, normalization. - **Gradient norm increasing with depth:** Exploding gradients. Fix: gradient clipping, lower learning rate. - → Chapter 7: Training Deep Networks — Initialization, Batch Normalization, Dropout, Learning Rate Schedules, and the Dark Art of Making It Converge
When consistency fails:
**Multiple treatment versions.** "Exercise" is not a single treatment. Running 30 minutes daily, swimming 45 minutes three times per week, and weight training every other day are all "exercise," but they may have different effects. If $D_i = 1$ means "exercises," then $Y_i(1)$ is not well-defined be → Chapter 16: The Potential Outcomes Framework — Counterfactuals, ATEs, and the Fundamental Problem of Causal Inference
When no-interference fails:
**Vaccination programs.** If enough people around me are vaccinated, my risk of infection decreases even if I am not vaccinated (herd immunity). My $Y_i(0)$ depends on $D_j$ for other individuals. - **Marketplace experiments.** If StreamRec recommends item X to user A but not user B, and both users → Chapter 16: The Potential Outcomes Framework — Counterfactuals, ATEs, and the Fundamental Problem of Causal Inference
When to avoid:
The query space is too large to enumerate (e.g., arbitrary text queries in search) - Real-time context matters (e.g., current session behavior, time-of-day) - The application requires sub-second freshness → Appendix H: ML System Design Patterns
When to use bidirectional RNNs:
**Encoding tasks** where the entire sequence is available at once: classification, tagging, encoding for seq2seq models. - **Never for autoregressive generation** where future tokens are not available at prediction time. → Chapter 9: Recurrent Networks and Sequence Modeling — RNNs, LSTMs, GRUs, and Their Limitations
When to use which framework:
Use the **potential outcomes framework** when the treatment is well-defined, the estimand (ATE, ATT, LATE) is clear, and you are primarily concerned with estimation and inference. - Use the **graphical framework** when the causal structure is complex (many variables, potential mediators, colliders), → Appendix J: Causal Inference Identification Guide
When to use which parameterization:
**Non-centered:** When groups have small samples (the data provide little information about individual $\theta_j$, so the posterior resembles the prior, and the funnel geometry dominates). This is the common case. - **Centered:** When groups have large samples (the data dominate the prior for each $ → Chapter 21: Bayesian Modeling in Practice — PyMC, Hierarchical Models, and When Bayesian Methods Earn Their Complexity
When to use:
The entity space is finite and enumerable (all users, all products) - Prediction freshness of hours is acceptable - Real-time features do not significantly improve quality - The team lacks real-time serving infrastructure - StreamRec example: pre-compute the top-100 recommendations for each user dai → Appendix H: ML System Design Patterns

Z

zero inference overhead
the merged model has identical architecture and latency to the original, and (4) multiple LoRA adapters can be swapped in and out of a single base model for different tasks, enabling multi-task serving with a single GPU. → Chapter 13: Quiz