Glossary

This glossary defines every key term introduced in Intermediate Data Science. Terms are listed alphabetically. The parenthetical chapter reference indicates where the term is first introduced or receives its primary treatment.

Accuracy (Chapter 16): The proportion of correct predictions (true positives + true negatives) divided by total predictions. Misleading for imbalanced datasets — a model that always predicts the majority class achieves high accuracy while being useless.

Acquisition function (Chapter 18): In Bayesian optimization, the function that decides which hyperparameter configuration to try next. Balances exploration (trying unexplored regions) with exploitation (refining known good regions). Common choices: Expected Improvement (EI), Upper Confidence Bound (UCB).

ADASYN (Chapter 17): Adaptive Synthetic Sampling. An oversampling method similar to SMOTE that generates more synthetic examples in regions where the minority class is harder to learn. Focuses synthesis on boundary regions.

Agglomerative clustering (Chapter 20): A bottom-up hierarchical clustering approach. Each observation starts as its own cluster; clusters are iteratively merged based on a linkage criterion until a single cluster remains. Visualized with dendrograms.

Algorithmic bias (Chapter 33): Systematic errors in ML outputs that produce unfair outcomes for specific groups. Can originate from biased training data, biased labels, inappropriate model choices, or feedback loops in deployed systems.

Alternative hypothesis (Chapter 3): In A/B testing, the claim that there is a real difference between control and treatment groups. Denoted H_1. Accepted when the p-value falls below the significance threshold.

Anomaly (Chapter 22): A data point that deviates significantly from expected behavior. Also called an outlier. In production systems, anomalies can indicate fraud, equipment failure, or data quality issues.

Anomaly score (Chapter 22): A continuous value assigned by an anomaly detection algorithm indicating how "abnormal" a data point is. Higher scores mean greater deviation from expected patterns.

Anti-join (Chapter 5): A SQL join that returns rows from the left table that have no match in the right table. Commonly implemented with LEFT JOIN ... WHERE right.key IS NULL. Useful for finding missing records.

Apache Arrow (Chapter 28): A cross-language columnar memory format for flat and hierarchical data. Used by Polars internally for fast in-memory analytics. Enables zero-copy reads between libraries.

Apriori algorithm (Chapter 23): An algorithm for mining frequent itemsets from transaction data. Exploits the anti-monotone property: if an itemset is infrequent, all its supersets are also infrequent. Enables efficient pruning of the search space.

ARIMA (Chapter 25): AutoRegressive Integrated Moving Average. A classical time series model with three components: AR (autoregressive — past values predict future), I (differencing for stationarity), MA (moving average of past errors). Parameterized as ARIMA(p, d, q).

Association rules (Chapter 23): Patterns of the form "if A, then B" discovered in transaction data. Characterized by support, confidence, and lift. Used in market basket analysis and cross-selling.

AUC-PR (Chapter 16): Area Under the Precision-Recall Curve. A threshold-independent metric that summarizes precision-recall tradeoffs. More informative than AUC-ROC for imbalanced datasets because it focuses on the minority class.

AUC-ROC (Chapter 16): Area Under the Receiver Operating Characteristic Curve. Measures a classifier's ability to distinguish between classes across all possible thresholds. Ranges from 0.5 (random) to 1.0 (perfect). Can be misleadingly optimistic for heavily imbalanced data.

Augmented Dickey-Fuller test (Chapter 25): A statistical test for stationarity in time series data. The null hypothesis is that the series has a unit root (is non-stationary). A low p-value indicates stationarity.

Autocorrelation (ACF) (Chapter 25): The correlation of a time series with a lagged version of itself. The ACF plot shows correlation at different lag values and is used to identify the MA order (q) in ARIMA models.

Autoencoder (Chapter 22): A neural network trained to reconstruct its input through a compressed representation (bottleneck). For anomaly detection, trained on normal data only; anomalies produce high reconstruction error.

Bagging (Chapter 13): Bootstrap Aggregating. An ensemble method that trains multiple models on different bootstrap samples of the training data and averages their predictions. Reduces variance. The foundation of Random Forests.

Bag of Words (Chapter 26): A text representation that counts word occurrences in a document, ignoring word order and grammar. Simple and fast but loses sequential and contextual information.

BaseEstimator (Chapter 10): A scikit-learn base class for custom estimators. Provides get_params() and set_params() methods. Combined with TransformerMixin to create custom transformers that integrate into scikit-learn Pipelines.

Baseline model (Chapter 2): A simple model (predict majority class, predict mean, simple heuristic) that every more complex model must beat. Establishes the minimum bar for usefulness.

Batch prediction (Chapter 31): Generating predictions for a large dataset at once, typically on a schedule (e.g., nightly). Contrasted with real-time prediction. Simpler infrastructure, higher throughput, but stale predictions.

Bayesian optimization (Chapter 18): A hyperparameter tuning strategy that uses a surrogate model (typically a Gaussian process or tree-based model) to predict which configurations are likely to perform well, then selects the most promising one to try next. More efficient than random search.

Bayes' theorem (Chapter 4): A formula for updating the probability of a hypothesis given new evidence: P(A|B) = P(B|A) * P(A) / P(B). Foundation of Naive Bayes classifiers and Bayesian reasoning.

Bias-variance tradeoff (Chapter 1): The tension between a model's ability to fit training data (low bias) and its ability to generalize to new data (low variance). Complex models have low bias but high variance; simple models have high bias but low variance.

Binary encoding (Chapter 7): An encoding method that represents each category as a binary number. Produces fewer columns than one-hot encoding for high-cardinality features. Each column represents one bit position.

Binning (Chapter 6): The process of converting a continuous variable into discrete intervals (bins). Can capture non-linear relationships in linear models. Risk: information loss if bin boundaries are poorly chosen.

Black (Chapter 29): An opinionated Python code formatter. Enforces a consistent style with minimal configuration. The "uncompromising" formatter: one way to format code, no arguments.

Blue-green deployment (Chapter 31): A deployment strategy that maintains two identical production environments. Traffic is switched from the current (blue) environment to the new (green) environment. Enables instant rollback.

Boosting (Chapter 14): An ensemble method that trains models sequentially, with each new model focusing on the errors of the previous ones. Reduces bias. The foundation of XGBoost, LightGBM, and CatBoost.

Box-Cox transformation (Chapter 6): A family of power transformations that stabilizes variance and makes data more normally distributed. Parameterized by lambda; includes log transformation (lambda=0) and square root (lambda=0.5) as special cases.

Brier score (Chapter 16): A proper scoring rule that measures the accuracy of probabilistic predictions. Calculated as the mean squared error between predicted probabilities and actual binary outcomes. Lower is better. Range: 0 (perfect) to 1.

Buffer (Chapter 27): In geospatial analysis, a zone of a specified distance around a geometry. Used for proximity analysis: "count all stores within 5 km of each customer."

Calibration curve (Chapter 16): A plot comparing predicted probabilities to observed frequencies. A well-calibrated model's curve follows the diagonal: when it says "70% probability," the event occurs 70% of the time.

Canary deployment (Chapter 31): A deployment strategy that routes a small percentage of traffic to the new model version while monitoring for issues. Gradually increases the traffic share if no problems are detected.

Cardinality (Chapter 7): The number of unique values in a categorical variable. Low cardinality (2-10 values) and high cardinality (100+ values) require different encoding strategies.

CatBoost (Chapter 14): A gradient boosting library by Yandex. Key differentiator: ordered boosting to reduce target leakage and native categorical feature handling without manual encoding. Strong default performance.

Categorical variable (Chapter 7): A variable that takes on a limited number of discrete values representing categories or groups. Nominal categoricals have no ordering (color, country); ordinal categoricals have a natural order (education level, satisfaction rating).

Centroid (Chapter 20): The center point of a cluster, calculated as the mean of all points assigned to that cluster. K-Means iteratively updates centroids until convergence.

Changepoint (Chapter 25): A point in time where a time series exhibits a significant change in its statistical properties (trend, level, variance). Prophet detects changepoints automatically.

Chi-squared test (Chapter 9): A statistical test for the association between categorical variables. In feature selection, used to test whether a categorical feature and the target variable are independent. Also used for drift detection on categorical features.

Choropleth map (Chapter 27): A map in which areas (countries, states, counties) are shaded or patterned according to a data variable. Created with folium or geopandas for visualizing regional patterns.

Class imbalance (Chapter 17): A dataset condition where one class has significantly more observations than another. Common in real-world problems: fraud (0.1%), equipment failure (1%), churn (5-10%). Requires specialized handling.

Class weight (Chapter 17): A parameter in many scikit-learn estimators (class_weight='balanced') that assigns higher weight to minority class samples during training. Adjusts the loss function to penalize minority class errors more heavily.

Clustering (Chapter 20): An unsupervised learning technique that groups similar data points together without pre-defined labels. Exploratory by nature — there is no single "correct" clustering, only useful ones.

Coefficient path (Chapter 11): A plot showing how model coefficients change as regularization strength varies. For Lasso, coefficients shrink to exactly zero at different alpha values, effectively performing feature selection.

Coherence score (Chapter 26): A metric for evaluating topic model quality. Measures how semantically similar the top words in each topic are. Higher coherence generally indicates more interpretable topics.

Collaborative filtering (Chapter 24): A recommendation approach based on the assumption that users who agreed in the past will agree in the future. User-based CF finds similar users; item-based CF finds similar items.

Cold start problem (Chapter 24): The inability of collaborative filtering to make recommendations for new users (no interaction history) or new items (no user ratings). Addressed with content-based methods or hybrid approaches.

ColumnTransformer (Chapter 10): A scikit-learn class that applies different transformations to different subsets of columns. Essential for mixed-type datasets: scale numeric features, encode categorical features, all in one step.

Concept drift (Chapter 32): A change in the relationship between input features and the target variable over time. The rules of the problem change: what predicted churn six months ago no longer predicts churn today. Harder to detect than data drift.

Confidence (Chapter 23): In association rules, the probability of the consequent given the antecedent: P(B|A). A confidence of 0.8 for {bread} -> {butter} means 80% of transactions containing bread also contain butter.

Contamination parameter (Chapter 22): In Isolation Forest, the expected proportion of anomalies in the dataset. Sets the decision threshold. If you expect 1% of data points to be anomalous, set contamination=0.01.

Content-based filtering (Chapter 24): A recommendation approach that suggests items similar to what a user has previously liked, based on item features (genre, description, attributes). Does not require other users' data, avoiding the cold start problem for users.

Convergence (Chapter 4): The point at which an iterative optimization algorithm (like gradient descent) stops making meaningful progress. The loss function changes by less than a specified tolerance between iterations.

Conviction (Chapter 23): An association rule metric that measures the degree to which the antecedent and consequent are associated, compared to independence. Conviction of 1 means independence; higher values indicate stronger association.

Cookiecutter-data-science (Chapter 29): A standardized project structure template for data science projects. Organizes code into src/data/, src/features/, src/models/, and separates notebooks, reports, and raw/processed data.

Coordinate Reference System (CRS) (Chapter 27): A system that defines how geographic coordinates map to locations on Earth's surface. Mismatched CRS between datasets causes points to appear in wrong locations. EPSG:4326 (WGS84) is the most common.

Correlation matrix (Chapter 9): A table showing pairwise Pearson correlations between all numeric features. Used to identify multicollinearity. Visualized as a heatmap.

Cosine similarity (Chapter 24): A measure of similarity between two vectors based on the angle between them, ignoring magnitude. Ranges from -1 (opposite) to 1 (identical direction). Widely used in text similarity and recommendation systems.

Cost-sensitive learning (Chapter 17): An approach that explicitly accounts for the different costs of different types of errors. Instead of minimizing total errors, minimizes total cost. Implemented via class weights, sample weights, or custom loss functions.

Covariate shift (Chapter 32): A type of data drift where the input feature distribution changes but the conditional distribution of the target given features remains the same. New types of customers appear, but the churn rules haven't changed.

CRISP-DM (Chapter 34): Cross-Industry Standard Process for Data Mining. A six-phase methodology: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. The most widely referenced (if not followed) framework for DS projects.

Cross-entropy (Chapter 4): A loss function for classification that measures the difference between predicted probabilities and actual labels. Also called log-loss. For binary classification: -[ylog(p) + (1-y)log(1-p)].

Cross-validation (Chapter 16): A resampling method that partitions data into k folds, trains on k-1 folds, and evaluates on the held-out fold, rotating through all folds. Provides a more robust estimate of model performance than a single train-test split.

CTE (Common Table Expression) (Chapter 5): A temporary named result set in SQL defined with the WITH clause. Improves query readability and allows step-by-step logic. Recursive CTEs enable hierarchical queries.

Curse of dimensionality (Chapter 9): The phenomenon where data becomes increasingly sparse as the number of features grows. In high-dimensional spaces, distances between points converge, making similarity-based algorithms like KNN ineffective.

Dashboard (Chapter 34): A visual display of key metrics and model outputs designed for non-technical stakeholders. Effective dashboards show the "what" and "so what" — not the "how."

Dask (Chapter 28): A Python library for parallel and out-of-core computation. Provides a pandas-like API but operates on partitioned DataFrames that can exceed available RAM. Uses lazy evaluation and a task graph.

Data drift (Chapter 32): A change in the input feature distribution over time. The model receives data that looks different from its training data. Detected with PSI, KS tests, and feature distribution monitoring.

Data leakage (Chapter 2): When information from outside the training dataset is used to create the model. Causes overly optimistic evaluation metrics that don't hold in production. Common sources: future information in features, test data contaminating training.

Data maturity model (Chapter 34): A framework for assessing an organization's data capabilities across levels: ad-hoc reporting, descriptive analytics, predictive analytics, prescriptive analytics, autonomous systems.

Data pipeline (Chapter 2): The sequence of steps that transforms raw data into model-ready features. Includes data extraction, cleaning, transformation, feature engineering, and validation.

Data storytelling (Chapter 34): The practice of communicating data insights through narrative structure: context, conflict (the problem), resolution (the insight), and call to action. More persuasive than tables and charts alone.

DBSCAN (Chapter 20): Density-Based Spatial Clustering of Applications with Noise. Groups together points that are closely packed (high density) and marks points in low-density regions as noise. Does not require specifying the number of clusters. Parameterized by eps (radius) and min_samples.

Decision boundary (Chapter 11): The surface in feature space that separates different predicted classes. Linear models produce linear boundaries; kernel SVMs and tree ensembles produce non-linear boundaries.

Decision tree (Chapter 13): A model that makes predictions by recursively splitting the feature space into regions based on feature value thresholds. Interpretable but prone to overfitting without pruning or ensemble methods.

Demographic parity (Chapter 33): A fairness criterion requiring that the positive prediction rate is equal across protected groups. Also called statistical parity. A hiring model has demographic parity if it selects the same proportion of candidates from each demographic group.

Dendrogram (Chapter 20): A tree diagram showing the hierarchical arrangement of clusters produced by agglomerative clustering. The height of each merge indicates the distance between merged clusters.

Differencing (Chapter 25): A technique to make a non-stationary time series stationary by computing the difference between consecutive observations. First-order differencing: y't = y_t - y. The "I" in ARIMA.

Dimensionality reduction (Chapter 21): Techniques that reduce the number of features while preserving important structure. PCA for linear reduction; t-SNE and UMAP for non-linear visualization.

Disparate impact (Chapter 33): A legal and statistical concept: a facially neutral policy that disproportionately affects a protected group. In ML, a model that is not explicitly race-aware can still produce disparate impact through correlated features.

Docker (Chapter 31): A platform for packaging applications into containers — lightweight, portable environments that include all dependencies. A Dockerfile specifies the image; docker run starts the container.

Document-term matrix (Chapter 26): A matrix where rows are documents, columns are terms, and values are counts or TF-IDF scores. Typically very sparse. The standard input format for classical NLP models.

Dummy variable trap (Chapter 7): The collinearity problem caused by including all one-hot encoded columns for a categorical feature. For a feature with k categories, use k-1 dummy variables (or use regularization to handle it).

Early stopping (Chapter 14): A regularization technique for iterative algorithms (gradient boosting, neural networks) that stops training when performance on a validation set stops improving. Prevents overfitting without needing to set the exact number of iterations.

Effect size (Chapter 3): A quantitative measure of the magnitude of the difference between groups in an A/B test. Unlike p-values, effect size indicates practical significance. Common measures: Cohen's d, relative lift.

Eigenvalue (Chapter 21): In PCA, the eigenvalue of a principal component indicates how much variance it explains. The explained variance ratio is the eigenvalue divided by the sum of all eigenvalues.

Elastic Net (Chapter 11): A regularization method that combines L1 (Lasso) and L2 (Ridge) penalties. Controlled by the l1_ratio parameter: 0 = pure Ridge, 1 = pure Lasso. Useful when many features are correlated.

Elbow method (Chapter 20): A heuristic for choosing the number of clusters k. Plot inertia (within-cluster sum of squares) vs. k and look for the "elbow" where the rate of decrease sharply changes. Subjective but widely used.

Embedding (Chapter 7): A learned dense vector representation of a discrete variable (word, category, entity). Captures semantic relationships. Previewed in categorical encoding; covered more fully in Chapter 36.

Ensemble (Chapter 13): A model that combines predictions from multiple base models. Reduces variance (bagging), bias (boosting), or both. Random Forests and gradient boosting are ensemble methods.

Epsilon (eps) (Chapter 20): In DBSCAN, the maximum distance between two points for them to be considered neighbors. Smaller eps creates tighter, more numerous clusters; larger eps merges more points.

Equalized odds (Chapter 33): A fairness criterion requiring that the true positive rate and false positive rate are equal across protected groups. Stronger than demographic parity. If satisfied, prediction errors are distributed equally across groups.

Executive summary (Chapter 34): A one-page document summarizing a data science project for senior leadership. Covers: business question, approach (one sentence), key findings, recommendation, expected ROI. No jargon.

Expected value framework (Chapter 34): A method for quantifying the economic value of a model by multiplying the probability and cost/benefit of each prediction outcome (TP, FP, TN, FN) and summing them.

EXPLAIN / EXPLAIN ANALYZE (Chapter 5): PostgreSQL commands that show the query execution plan. EXPLAIN shows the planned operations; EXPLAIN ANALYZE actually runs the query and shows actual execution times. Essential for SQL optimization.

Explicit feedback (Chapter 24): User preferences expressed through deliberate actions: ratings, reviews, likes. Clearer signal but sparse — most users don't rate most items.

F1 score (Chapter 16): The harmonic mean of precision and recall: 2 * (precision * recall) / (precision + recall). Ranges from 0 to 1. Useful when you need a single metric that balances precision and recall.

Fairness audit (Chapter 33): A systematic evaluation of an ML model's performance across protected groups. Computes fairness metrics, identifies disparities, and documents findings. Should be conducted before deployment and monitored in production.

FastAPI (Chapter 31): A modern Python web framework for building REST APIs. Features automatic OpenAPI documentation, type validation via Pydantic, and async support. The standard choice for model serving in Python.

Feature (Chapter 1): An input variable used by a model to make predictions. Also called a predictor, independent variable, or column. Feature quality is the single biggest driver of model performance.

Feature engineering (Chapter 6): The process of creating new features from raw data using domain knowledge and creativity. Temporal features, ratio features, aggregation features, and interaction terms. More impactful than model selection for most problems.

Feature importance (Chapter 13): A measure of how much each feature contributes to a model's predictions. Impurity-based importance is fast but biased toward high-cardinality features; permutation importance is slower but more reliable.

Feature randomization (Chapter 13): In Random Forests, the practice of considering only a random subset of features at each split. Combined with bagging, this decorrelates the trees and reduces variance. Controlled by max_features.

Feature selection (Chapter 9): The process of choosing a subset of relevant features for modeling. Filter methods (statistical tests), wrapper methods (RFE), and embedded methods (L1 regularization, tree importance). Must happen inside cross-validation.

Feature store (Chapter 2): A centralized repository for storing, sharing, and versioning features across ML projects. Ensures consistency between training and serving. Examples: Feast, Tecton.

Filter methods (Chapter 9): Feature selection techniques that evaluate features independently of any model. Examples: correlation with target, mutual information, chi-squared test, variance threshold. Fast but may miss feature interactions.

Folium (Chapter 27): A Python library for creating interactive Leaflet.js maps. Supports choropleth maps, marker clusters, heatmaps, and custom popups. Outputs HTML for easy sharing.

Forecast horizon (Chapter 25): The number of time steps into the future that a model predicts. Forecast accuracy degrades as the horizon increases. A critical parameter in business forecasting.

Forward selection (Chapter 9): A wrapper feature selection method that starts with zero features and iteratively adds the feature that most improves model performance. Computationally expensive for many features.

FP-Growth (Chapter 23): Frequent Pattern Growth. An algorithm for mining frequent itemsets that is faster than Apriori for large datasets. Uses a compressed FP-tree data structure and avoids candidate generation.

Frequency encoding (Chapter 7): An encoding method that replaces each category with its frequency (count or proportion) in the training data. Simple and effective, especially for high-cardinality features. Preserves ordinality if frequency correlates with the target.

FunctionTransformer (Chapter 10): A scikit-learn wrapper that turns any Python function into a transformer compatible with Pipelines. Useful for simple transformations: FunctionTransformer(np.log1p).

Gamma parameter (Chapter 12): In RBF kernel SVMs, gamma controls how far the influence of a single training example reaches. High gamma means each point influences only nearby points (complex boundary, risk of overfitting); low gamma means broader influence (smoother boundary).

Gap statistic (Chapter 20): A method for choosing the number of clusters that compares within-cluster dispersion to a null reference distribution (uniform random data). Selects k where the gap is largest.

Gaussian Naive Bayes (Chapter 15): A Naive Bayes variant that assumes continuous features follow Gaussian (normal) distributions within each class. Simple, fast, and often surprisingly effective despite the normality assumption.

Generalization (Chapter 1): A model's ability to perform well on unseen data — data it was not trained on. The central goal of machine learning. A model that performs well only on training data has failed to generalize.

Geocoding (Chapter 27): The process of converting addresses or place names to geographic coordinates (latitude/longitude). Used to prepare text location data for geospatial analysis.

GeoDataFrame (Chapter 27): A geopandas extension of a pandas DataFrame that includes a geometry column. Supports spatial operations: intersection, union, buffer, spatial joins.

GeoJSON (Chapter 27): A JSON-based format for encoding geographic data structures. More web-friendly than shapefiles. Supports point, line, polygon, and multi-geometry types.

Gini impurity (Chapter 13): A measure of how often a randomly chosen element would be misclassified. Used by decision trees (including scikit-learn's default) to evaluate the quality of splits. Ranges from 0 (pure node) to 0.5 (binary, 50/50 split).

Gradient (Chapter 4): A vector of partial derivatives indicating the direction and rate of steepest increase of a function. In ML, the gradient of the loss function tells the optimizer which direction to adjust parameters.

Gradient boosting (Chapter 14): An ensemble technique that builds models sequentially, with each new model trained to predict the residual errors of the ensemble so far. Reduces bias progressively. The basis of XGBoost, LightGBM, and CatBoost.

Gradient descent (Chapter 4): An iterative optimization algorithm that updates model parameters in the direction of the negative gradient of the loss function. Step size controlled by the learning rate. Variants: batch, stochastic (SGD), mini-batch.

GridSearchCV (Chapter 18): A scikit-learn class that exhaustively tries every combination of hyperparameters from a specified grid, evaluating each with cross-validation. Simple but computationally expensive for large search spaces.

Group k-fold (Chapter 16): A cross-validation strategy that ensures all observations from the same group (e.g., the same customer, the same patient) appear in the same fold. Prevents data leakage when observations are not independent.

Guardrail metrics (Chapter 3): Metrics monitored during an A/B test to ensure no unintended harm. If a guardrail metric degrades beyond a threshold (e.g., page load time, error rate), the test is stopped regardless of the primary metric.

Halving search (Chapter 18): A resource-efficient hyperparameter tuning strategy (HalvingGridSearchCV, HalvingRandomSearchCV) that evaluates many configurations with a small resource budget, then progressively allocates more resources to the most promising candidates.

Hash encoding (Chapter 7): An encoding that applies a hash function to map categories to a fixed number of columns. Handles unseen categories and controls dimensionality. Drawback: hash collisions map different categories to the same column.

Haversine distance (Chapter 27): The great-circle distance between two points on a sphere, given their latitude and longitude. More accurate than Euclidean distance for geographic coordinates. Accounts for Earth's curvature.

Health check endpoint (Chapter 31): An API endpoint (typically GET /health) that returns the service status. Used by load balancers and monitoring systems to verify the model server is operational.

Hierarchical clustering (Chapter 20): A clustering method that produces a nested hierarchy of clusters. Agglomerative (bottom-up) is most common. Does not require specifying k in advance. Visualized with dendrograms. Scales poorly to large datasets.

Hinge loss (Chapter 4): The loss function used by Support Vector Machines: max(0, 1 - y * f(x)). Penalizes predictions that are correct but not confident enough (within the margin). Produces sparse solutions (support vectors).

Histogram-based splitting (Chapter 14): A technique used by LightGBM that bins continuous features into discrete histograms before finding optimal splits. Dramatically faster than exact splitting for large datasets. Trades a small amount of accuracy for major speed gains.

Hyperparameter (Chapter 18): A model configuration set before training that controls the learning process itself. Examples: learning rate, number of trees, regularization strength, number of clusters. Tuned via search algorithms, not learned from data.

ICE (Individual Conditional Expectation) (Chapter 19): A plot showing how the prediction for a single observation changes as one feature varies, holding all others constant. A disaggregated version of the PDP — reveals heterogeneous effects.

Imbalance ratio (Chapter 17): The ratio of majority class size to minority class size. An imbalance ratio of 12:1 means the majority class is 12 times larger. Ratios above 10:1 typically require specialized handling.

imblearn (Chapter 17): The imbalanced-learn library. Provides resampling techniques (SMOTE, ADASYN, random under/oversampling, Tomek links) and pipeline-compatible transformers for handling class imbalance.

Impurity-based feature importance (Chapter 13): Feature importance measured by the total reduction in impurity (Gini or entropy) across all splits on that feature. Fast to compute but biased: prefers high-cardinality and continuous features.

Implicit feedback (Chapter 24): User preferences inferred from behavior: clicks, views, purchases, time spent. Abundant but noisy — a click doesn't always mean interest, and absence of a click doesn't mean disinterest.

Index scan (Chapter 5): A database operation that uses an index to locate rows efficiently. Much faster than a sequential scan for selective queries. Shown in EXPLAIN output.

Inertia (Chapter 20): In K-Means, the sum of squared distances between each point and its assigned centroid. K-Means minimizes inertia. Always decreases as k increases, making it unsuitable as a sole metric for choosing k.

Information gain (Chapter 13): The reduction in entropy (or Gini impurity) achieved by splitting on a particular feature. Decision trees greedily choose the split that maximizes information gain at each node.

Interaction term (Chapter 6): A feature created by multiplying two or more features together. Captures relationships that neither feature captures alone. Example: tenure_months * avg_hours might predict churn better than either feature individually.

IQR method (Chapter 22): An anomaly detection baseline using the Interquartile Range. Points below Q1 - 1.5IQR or above Q3 + 1.5IQR are flagged as outliers. Simple and assumption-free.

Isolation Forest (Chapter 22): An anomaly detection algorithm based on the principle that anomalies are "few and different" and therefore require fewer random splits to isolate. Efficient for high-dimensional data. Key parameter: contamination.

Iterative imputation (MICE) (Chapter 8): Multiple Imputation by Chained Equations. Models each feature with missing values as a function of other features, iterating through all features multiple times. More accurate than simple imputation for MAR data.

Joblib (Chapter 10): A Python library for serializing (saving) and deserializing (loading) Python objects efficiently. The recommended way to save scikit-learn pipelines and models to disk.

K-Means (Chapter 20): A clustering algorithm that partitions n observations into k clusters by minimizing within-cluster variance. Assumes spherical, equal-variance clusters. Fast and scalable but sensitive to initialization and outliers. K-Means++ improves initialization.

K-Nearest Neighbors (KNN) (Chapter 15): A non-parametric algorithm that predicts based on the k closest training examples in feature space. No training phase (lazy learner). Performance degrades in high dimensions (curse of dimensionality).

KNN imputation (Chapter 8): An imputation method that fills missing values using the mean (or median/mode) of the k nearest neighbors in feature space. Captures local structure better than global statistics. Implemented by KNNImputer.

Kolmogorov-Smirnov (KS) test (Chapter 32): A non-parametric test that compares two distributions. In model monitoring, used to detect whether a feature's distribution has shifted between training and production data. Produces a p-value.

L1 regularization (Lasso) (Chapter 11): A penalty term that adds the sum of absolute values of coefficients to the loss function. Drives some coefficients to exactly zero, performing automatic feature selection. Produces sparse models.

L2 regularization (Ridge) (Chapter 11): A penalty term that adds the sum of squared coefficients to the loss function. Shrinks coefficients toward zero but never exactly to zero. Handles multicollinearity by distributing weight across correlated features.

LAG / LEAD (Chapter 5): SQL window functions that access values from previous (LAG) or subsequent (LEAD) rows without a self-join. Essential for computing temporal features: "How does this month's usage compare to last month's?"

Laplace smoothing (Chapter 15): A technique that adds a small count to every category to prevent zero probabilities in Naive Bayes. Without smoothing, a single unseen word would make the entire document probability zero.

Lasso regression (Chapter 11): See L1 regularization.

Latent factors (Chapter 24): Hidden variables in matrix factorization that represent underlying dimensions of user preferences or item characteristics. In a movie recommender, latent factors might correspond to genre preference, pacing, or tone.

Lazy evaluation (Chapter 28): A computation strategy where operations are recorded but not executed until the result is explicitly requested. Used by Dask and Polars to optimize execution plans before running them.

LDA (Latent Dirichlet Allocation) (Chapter 26): A generative probabilistic model for topic modeling. Assumes each document is a mixture of topics and each topic is a mixture of words. Used to discover the main themes in a collection of documents.

Learning curve (Chapter 16): A plot of model performance (e.g., accuracy) vs. training set size. Diagnoses bias vs. variance: if training and validation scores converge at a low value, the model has high bias; if they remain far apart, it has high variance.

Learning rate (Chapter 4): A hyperparameter that controls the step size in gradient descent. Too large: the optimizer overshoots the minimum and diverges. Too small: training takes forever and may get stuck in local minima. In boosting, controls how much each new tree contributes.

Leave-one-out encoding (Chapter 7): A variant of target encoding that uses the target mean of all rows except the current one. Reduces overfitting compared to standard target encoding, especially for small datasets.

Lemmatization (Chapter 26): The process of reducing a word to its dictionary form (lemma). "running" -> "run", "better" -> "good". More linguistically informed than stemming; uses vocabulary and morphological analysis.

Lift (Chapter 23): In association rules, the ratio of the observed co-occurrence to the expected co-occurrence under independence: P(A and B) / [P(A) * P(B)]. Lift > 1 indicates positive association; lift = 1 indicates independence; lift < 1 indicates negative association.

LightGBM (Chapter 14): Light Gradient Boosting Machine. A fast gradient boosting framework by Microsoft that uses histogram-based splitting and leaf-wise tree growth. Faster training than XGBoost on large datasets. Native categorical feature support.

LIME (Chapter 19): Local Interpretable Model-agnostic Explanations. Explains individual predictions by fitting a simple interpretable model (e.g., linear regression) in the neighborhood of the prediction. Model-agnostic: works with any classifier.

Linkage (Chapter 20): In hierarchical clustering, the method for computing distance between clusters. Ward (minimize variance increase), complete (maximum pairwise distance), average (mean pairwise distance), single (minimum pairwise distance).

Listwise deletion (Chapter 8): Removing any row with at least one missing value. Simple but wastes data, especially when missingness is spread across many columns. Valid only when data is MCAR and few values are missing.

Local Outlier Factor (LOF) (Chapter 22): An anomaly detection algorithm that compares the local density of a point to the local density of its neighbors. Points with substantially lower density than their neighbors are outliers. Effective for datasets with varying cluster densities.

Log-loss (Chapter 4): See Cross-entropy.

Log transformation (Chapter 6): Applying the natural logarithm to a feature. Reduces right skew, compresses the range, and makes multiplicative relationships additive. Use np.log1p() to handle zero values.

MAE (Mean Absolute Error) (Chapter 16): The average absolute difference between predictions and actual values. Less sensitive to outliers than RMSE. Interpretable in the original units of the target variable.

Mahalanobis distance (Chapter 22): A distance metric that accounts for correlations between variables. Unlike Euclidean distance, considers the covariance structure of the data. Points far in Mahalanobis distance are anomalous given the data's shape.

MAPE (Mean Absolute Percentage Error) (Chapter 25): A forecast accuracy metric expressed as a percentage: mean of |actual - predicted| / |actual|. Intuitive but undefined when actual values are zero. Asymmetric: penalizes under-predictions more than over-predictions.

MAR (Missing at Random) (Chapter 8): A missing data mechanism where the probability of missingness depends on observed variables but not on the missing value itself. Example: younger users are less likely to fill in their age. MICE handles MAR data well.

Market basket analysis (Chapter 23): The application of association rule mining to transaction data to discover products frequently purchased together. Drives cross-selling recommendations and store layout decisions.

Materialized view (Chapter 5): A database object that stores the result of a query physically. Unlike a regular view, it doesn't re-execute the query on each access. Must be refreshed periodically. Useful for expensive feature extraction queries.

Matrix factorization (Chapter 24): A technique that decomposes a matrix into the product of two (or more) lower-rank matrices. In recommender systems, decomposes the user-item matrix into user factors and item factors, revealing latent dimensions.

MCAR (Missing Completely at Random) (Chapter 8): A missing data mechanism where the probability of missingness is the same for all observations, unrelated to any data. Example: a sensor fails due to a power outage. The only mechanism where listwise deletion is unbiased.

McNemar's test (Chapter 16): A statistical test for comparing the performance of two classifiers on the same dataset. Tests whether the classifiers make different types of errors. More appropriate than a paired t-test for classification.

Mean imputation (Chapter 8): Replacing missing values with the column mean. Fast and simple but distorts the distribution (reduces variance) and ignores relationships between features. Never compute the mean on the test set — use the training mean.

Missing indicator (Chapter 8): A binary feature (0/1) that records whether a value was originally missing. Added alongside the imputed value. Captures the signal in missingness itself: "the fact that this value is missing is informative."

ML lifecycle (Chapter 2): The complete process from problem framing through deployment and monitoring. Not a linear sequence but an iterative cycle: models are retrained, features are revised, and business objectives evolve.

MLflow (Chapter 30): An open-source platform for the ML lifecycle. Components: Tracking (log experiments), Projects (package code), Models (model format), Model Registry (version and stage models). The industry standard for experiment management.

MNAR (Missing Not at Random) (Chapter 8): A missing data mechanism where the probability of missingness depends on the unobserved value itself. Example: patients with severe symptoms are too ill to answer a survey. The hardest mechanism to handle; no imputation method fully solves it.

Model card (Chapter 33): A documentation framework for ML models. Records: intended use, limitations, performance across demographic groups, ethical considerations, and training data description. Proposed by Mitchell et al. (2019) at Google.

Model decay (Chapter 32): The gradual degradation of a deployed model's performance over time due to data drift, concept drift, or changes in the business environment. All models decay — the question is how fast and whether you're monitoring.

Model governance (Chapter 34): Policies and processes for managing ML models throughout their lifecycle. Covers approval workflows, risk assessment, documentation requirements, and audit trails. Increasingly mandated by regulation.

Model Registry (Chapter 30): A centralized hub for model versioning and lifecycle management. Tracks model versions, transitions between stages (Staging, Production, Archived), and associates models with the experiments that produced them.

Model serving (Chapter 31): The infrastructure and process for making model predictions available to other systems. Real-time serving (REST API) for individual predictions; batch serving for periodic large-scale scoring.

Multinomial Naive Bayes (Chapter 15): A Naive Bayes variant for discrete count features. Standard for text classification with bag-of-words or TF-IDF features. Assumes features follow a multinomial distribution.

Multiple testing correction (Chapter 3): Adjustment to significance thresholds when performing multiple statistical tests simultaneously. Without correction, the probability of at least one false positive increases dramatically. Bonferroni (divide alpha by number of tests) is conservative; FDR (Benjamini-Hochberg) is more powerful.

Multicollinearity (Chapter 9): A condition where two or more features are highly correlated. Inflates coefficient variance in linear models, making interpretation unreliable. Detected with VIF. Regularization (Ridge) mitigates the problem.

Mutual information (Chapter 9): A measure of the dependence between two variables. Captures both linear and non-linear relationships. Used as a filter method for feature selection. Zero means independence.

NDCG (Normalized Discounted Cumulative Gain) (Chapter 24): A ranking metric that measures the quality of a ranked list. Discounts the relevance of items based on their position: items ranked higher contribute more to the score. The standard metric for recommendation evaluation.

N-gram (Chapter 26): A contiguous sequence of n words from text. Unigrams (single words), bigrams (two-word sequences), trigrams (three-word sequences). Captures local word order that bag-of-words misses.

Nominal (Chapter 7): A categorical variable with no natural ordering. Examples: color, country, device type. One-hot encoding preserves the lack of ordering; ordinal encoding would impose a false hierarchy.

Novelty effect (Chapter 3): In A/B testing, the temporary increase in engagement with a new feature simply because it is new. If the test duration is too short, the novelty effect inflates the treatment effect. Mitigated by running tests long enough for the effect to stabilize.

Null hypothesis (Chapter 3): In A/B testing, the assumption that there is no real difference between control and treatment groups. Denoted H_0. We reject H_0 only when the p-value is sufficiently small.

NTILE (Chapter 5): A SQL window function that distributes rows into a specified number of approximately equal groups. NTILE(10) creates deciles. Used for bucketed analysis and percentile-based feature engineering.

Observation unit (Chapter 2): The entity that each row in your dataset represents. Defining this precisely is critical: is it a customer? A customer-month? A transaction? Getting it wrong causes subtle data leakage.

One-class SVM (Chapter 22): An SVM variant trained only on "normal" data. Learns a decision boundary around normal observations; new points outside the boundary are flagged as anomalies. Useful when anomaly examples are unavailable.

One-hot encoding (OHE) (Chapter 7): An encoding that creates a binary column for each category value. A feature with k categories produces k columns (or k-1 to avoid the dummy variable trap). Safe and interpretable for low-cardinality features.

Optuna (Chapter 18): A Python framework for Bayesian hyperparameter optimization. Features: define-by-run search spaces, pruning of unpromising trials, built-in visualization. Integrates with scikit-learn, XGBoost, LightGBM.

Ordinal encoding (Chapter 7): An encoding that maps categories to integers (0, 1, 2, ...). Appropriate only when the categories have a natural order. Tree-based models handle ordinal encoding well; linear models do not.

Out-of-bag (OOB) error (Chapter 13): In Random Forests, the prediction error estimated using samples not included in a tree's bootstrap sample. A free estimate of generalization performance without needing a separate validation set.

Overfitting (Chapter 1): When a model learns noise and idiosyncrasies of the training data rather than the underlying pattern. Symptoms: excellent training performance, poor test performance. Addressed by regularization, cross-validation, and simpler models.

P-value (Chapter 3): The probability of observing a result at least as extreme as the actual result, assuming the null hypothesis is true. Not the probability that the null hypothesis is true. A p-value below the significance threshold (typically 0.05) leads to rejecting H_0.

Partial autocorrelation (PACF) (Chapter 25): The correlation of a time series with its lagged values after removing the effects of shorter lags. Used to identify the AR order (p) in ARIMA models.

Partial dependence plot (PDP) (Chapter 19): A plot showing the marginal effect of one or two features on the model's predictions, averaging over all other features. Reveals the shape of the relationship: linear, non-linear, threshold, interaction.

Partition pruning (Chapter 5): A database optimization that skips irrelevant partitions when scanning partitioned tables. Only reads partitions matching the query's WHERE clause. Dramatic speedup for large time-partitioned tables.

Permutation importance (Chapter 19): A model-agnostic feature importance method that measures how much model performance degrades when a feature's values are randomly shuffled. More reliable than impurity-based importance, especially for correlated features.

Pipeline (Chapter 10): A scikit-learn object that chains multiple processing steps (transformers and a final estimator) into a single object. Ensures consistent fit/transform behavior and prevents data leakage between steps.

Polars (Chapter 28): A high-performance DataFrame library written in Rust. Features: lazy evaluation, expressive expression API, no index, multi-threaded execution. Significantly faster than pandas for data wrangling tasks.

Polynomial features (Chapter 6): Features created by raising existing features to higher powers or combining them multiplicatively. Captures non-linear relationships in linear models. Degree 2 (quadratic) is common; degree 3+ risks overfitting.

Population Stability Index (PSI) (Chapter 32): A metric that quantifies the shift in a variable's distribution between two time periods. PSI < 0.1: stable; 0.1-0.25: moderate shift (investigate); > 0.25: significant shift (likely need to retrain). Widely used in financial model validation.

Posterior probability (Chapter 4): The updated probability of a hypothesis after observing evidence. In Bayes' theorem: P(hypothesis|evidence). The result of combining the prior with the likelihood.

Practical significance (Chapter 3): Whether a statistically significant result matters in practice. A 0.1% improvement in conversion rate might be statistically significant with a large sample but not worth the engineering cost to implement.

Pre-commit hooks (Chapter 29): Scripts that run automatically before each git commit. Used to enforce code quality: run black, ruff, mypy, and tests before code enters the repository. Prevents bad code from being committed.

Precision (Chapter 16): Of all positive predictions, the proportion that are actually positive: TP / (TP + FP). High precision means few false alarms. Critical when false positives are expensive.

Precision at k (Chapter 22): In anomaly detection, the proportion of true anomalies among the top k highest-scored items. A practical metric when you can only investigate a fixed number of cases.

Prior probability (Chapter 4): The probability of a hypothesis before observing evidence. In Bayes' theorem: P(hypothesis). Represents existing knowledge or belief.

Problem framing (Chapter 2): The process of translating a business question into a precise ML task. Defines: what is the target variable, what is the observation unit, what metric defines success, and what data is available.

Prophet (Chapter 25): A time series forecasting library by Meta that handles trend, multiple seasonalities, holidays, and changepoints. Designed for business forecasting by practitioners, not statisticians. Robust to missing data and outliers.

Protected attribute (Chapter 33): A demographic characteristic (race, gender, age) that should not unfairly influence model predictions. Models may not use protected attributes directly but can still discriminate through correlated features.

Pruning (Chapter 13): Limiting the complexity of a decision tree by restricting its growth (pre-pruning: max_depth, min_samples_leaf) or removing branches after growing (post-pruning). Essential to prevent overfitting.

Pydantic (Chapter 31): A Python library for data validation using Python type annotations. In FastAPI, defines request and response schemas. Automatically validates incoming data and generates clear error messages.

Pytest (Chapter 29): The standard Python testing framework. Features: fixtures for setup/teardown, parametrize for data-driven tests, assert introspection for clear failure messages. The foundation of testing data science code.

Query plan (Chapter 5): The sequence of operations a database will execute to fulfill a query. Includes scan types (sequential, index), join strategies, sort operations, and estimated costs. The primary tool for SQL optimization.

Random Forest (Chapter 13): An ensemble of decision trees trained with bagging and feature randomization. Each tree is trained on a bootstrap sample with a random subset of features at each split. Reduces variance compared to a single tree.

RandomizedSearchCV (Chapter 18): A scikit-learn class that samples a fixed number of hyperparameter combinations from specified distributions. More efficient than grid search for high-dimensional search spaces. Controlled by n_iter.

Real-time prediction (Chapter 31): Generating predictions on-demand in response to individual requests, typically via a REST API. Requires low latency (milliseconds to seconds). More complex infrastructure than batch prediction.

Recall (Chapter 16): Of all actual positives, the proportion that are correctly identified: TP / (TP + FN). Also called sensitivity or true positive rate. Critical when false negatives are costly (disease diagnosis, fraud detection).

Reconstruction error (Chapter 22): In autoencoders, the difference between the input and the reconstructed output. Normal data produces low reconstruction error; anomalies produce high error because the autoencoder hasn't learned to reconstruct them.

Recursive CTE (Chapter 5): A CTE that references itself, enabling hierarchical or iterative queries. Used for tree traversal (org charts, category hierarchies) and graph problems.

Recursive Feature Elimination (RFE) (Chapter 9): A wrapper method that trains a model, ranks features by importance, removes the least important feature, and repeats. Continues until the desired number of features is reached.

Regularization (Chapter 11): A technique that adds a penalty to the loss function to constrain model complexity. Reduces overfitting by shrinking coefficients (L2) or eliminating features (L1). Controlled by a strength parameter (alpha or C).

Residual (Chapter 14): The difference between the actual value and the model's prediction. In gradient boosting, each new tree is trained to predict the residuals (errors) of the current ensemble.

RFM features (Chapter 6): Recency, Frequency, Monetary — a classic feature engineering pattern for customer data. Recency: time since last activity. Frequency: count of activities. Monetary: total spending. Powerful predictors of customer behavior.

Ridge regression (Chapter 11): See L2 regularization.

RMSE (Root Mean Squared Error) (Chapter 16): The square root of the mean squared error. Penalizes large errors more heavily than MAE. In the same units as the target variable. The most common regression metric.

ROW_NUMBER (Chapter 5): A SQL window function that assigns a unique sequential integer to each row within a partition. Used for deduplication (keep the most recent record), pagination, and ranking.

Ruff (Chapter 29): A fast Python linter written in Rust. Replaces flake8, isort, and many other tools. Integrates with pre-commit hooks. Catches style violations, imports issues, and potential bugs.

Sample weight (Chapter 17): A per-observation weight that tells the model how much to value each training example. Higher weights increase the cost of misclassifying that observation. Directly implements cost-sensitive learning.

SARIMA (Chapter 25): Seasonal ARIMA. Extends ARIMA to handle seasonal patterns. Parameterized as ARIMA(p,d,q)(P,D,Q,m) where the second set of parameters controls the seasonal component and m is the seasonal period.

Scree plot (Chapter 21): A plot of eigenvalues (or explained variance ratios) in descending order. Used to choose the number of PCA components. Look for the "elbow" where the curve levels off.

Self-join (Chapter 5): A SQL join of a table with itself. Used for computing relationships between rows: comparing each customer to their previous visit, finding duplicate records, or computing pairwise distances.

Sentiment analysis (Chapter 26): The task of identifying the emotional tone of text: positive, negative, or neutral. Lexicon-based approaches (VADER) use word-level scores; ML approaches use trained classifiers.

Sequential scan (Chapter 5): A database operation that reads every row in a table. The default when no suitable index exists. Shown in EXPLAIN output. Usually indicates a need for indexing when the query is selective.

Shadow deployment (Chapter 32): Running a new model in parallel with the production model without serving its predictions to users. Allows comparison of new model behavior against the current model without risk.

SHAP (SHapley Additive exPlanations) (Chapter 19): A framework for interpreting model predictions based on Shapley values from game theory. Assigns each feature a contribution to the prediction for each observation. Provides both global and local explanations. TreeSHAP is fast for tree-based models.

Shapley values (Chapter 19): A concept from cooperative game theory that fairly distributes a total payout among players based on their marginal contributions. In ML, the "players" are features and the "payout" is the prediction.

Silhouette score (Chapter 20): A cluster quality metric that measures how similar a point is to its own cluster compared to the nearest neighboring cluster. Ranges from -1 (misclassified) to 1 (well-clustered). Average silhouette score summarizes overall clustering quality.

SimpleImputer (Chapter 8): A scikit-learn transformer for basic imputation: fill missing values with mean, median, most frequent value, or a constant. Always fit on training data only.

Simpson's paradox (Chapter 3): A phenomenon where a trend that appears in subgroups reverses or disappears when the groups are combined. In A/B testing, overall results can show treatment winning while every subgroup shows control winning.

Slack variables (Chapter 12): In soft-margin SVMs, variables that measure how much each point is allowed to violate the margin. Controlled by the C parameter: high C allows few violations (complex boundary); low C allows more violations (simpler boundary).

SMOTE (Chapter 17): Synthetic Minority Over-sampling Technique. Creates synthetic minority class examples by interpolating between existing minority samples and their nearest neighbors. Must be applied only to training data, inside cross-validation.

Sparsity (Chapter 11): Having many zero-valued coefficients. L1 regularization produces sparse models, effectively selecting a subset of features. Sparse models are easier to interpret and cheaper to serve.

Spatial index (Chapter 27): A data structure that accelerates spatial queries (point-in-polygon, nearest neighbor). R-trees are the most common. Without spatial indexing, spatial joins require brute-force pairwise comparison.

Spatial join (Chapter 27): A join between two GeoDataFrames based on spatial relationships (within, intersects, contains). Example: join customer points with census tract polygons to get demographic features.

StandardScaler (Chapter 11): A scikit-learn transformer that standardizes features by removing the mean and scaling to unit variance. Essential for regularized linear models and SVMs. Always fit on training data only.

Stationarity (Chapter 25): A property of a time series whose statistical properties (mean, variance, autocorrelation) do not change over time. Most time series models assume stationarity. Non-stationary series must be differenced or detrended.

Statistical power (Chapter 3): The probability of detecting a real effect when one exists. Power = 1 - P(Type II error). Typically targeted at 80%. Determined by sample size, effect size, and significance threshold.

Stemming (Chapter 26): The process of reducing a word to its root form by removing suffixes. "running" -> "run", "happiness" -> "happi". Faster but cruder than lemmatization; produces non-words.

Stop words (Chapter 26): Common words (the, is, at, which) that carry little semantic meaning and are often removed during text preprocessing. Removal improves model efficiency but context-dependent — "to be or not to be" loses meaning without stop words.

Stratified k-fold (Chapter 16): A cross-validation variant that ensures each fold has approximately the same class distribution as the full dataset. Essential for imbalanced classification problems.

Supervised learning (Chapter 1): A learning paradigm where the model is trained on labeled data (input-output pairs). The model learns to map inputs to outputs. Includes classification and regression tasks.

Support (Chapter 23): In association rules, the proportion of transactions that contain an itemset. A support of 0.05 means the itemset appears in 5% of all transactions. Used as a minimum threshold to filter rare combinations.

Support vector (Chapter 12): The training examples closest to the decision boundary that define the margin. These are the only points that influence the SVM's decision boundary — all other points could be removed without changing the model.

Surrogate model (Chapter 18): In Bayesian optimization, a probabilistic model (Gaussian process, Tree-structured Parzen Estimator) that approximates the objective function. Cheaper to evaluate than the actual model, enabling efficient search.

t-SNE (Chapter 21): t-distributed Stochastic Neighbor Embedding. A non-linear dimensionality reduction technique for visualization. Preserves local structure well but distorts global structure. Cluster distances and sizes in t-SNE plots are not meaningful.

Target encoding (Chapter 7): An encoding that replaces each category with the mean of the target variable for that category. Effective for high-cardinality features but prone to overfitting. Requires regularization (smoothing) and must be computed within cross-validation to prevent leakage.

Target leakage (Chapter 2): A specific form of data leakage where the target variable (or a proxy for it) is included in the features. Produces suspiciously high accuracy that collapses in production.

Target variable (Chapter 1): The variable a model is trying to predict. Also called the label, dependent variable, or response variable. In classification, the target is categorical; in regression, it is continuous.

Task graph (Chapter 28): In Dask, a directed acyclic graph (DAG) of all operations to be performed. Dask constructs the graph lazily and executes it when .compute() is called, enabling optimization of the execution plan.

Technical debt (Chapter 29): The accumulated cost of shortcuts and suboptimal decisions in code. In ML systems, includes: undocumented feature engineering, hardcoded paths, untested pipelines, and duplicated preprocessing logic. Compounds over time.

TF-IDF (Chapter 26): Term Frequency-Inverse Document Frequency. A text representation that weights words by how frequent they are in a document relative to how common they are across all documents. Highlights distinctive words while down-weighting common ones.

Threshold tuning (Chapter 17): Adjusting the classification decision threshold (default 0.5) to optimize for a specific metric or cost function. For imbalanced problems, lowering the threshold increases recall at the cost of precision. Use the precision-recall curve to find the optimal point.

Time series split (Chapter 25): A cross-validation strategy for temporal data that always trains on past data and evaluates on future data. Unlike random k-fold, respects the temporal ordering. Also called walk-forward validation.

Tokenization (Chapter 26): The process of splitting text into individual tokens (words, subwords, or characters). The first step in any NLP pipeline. Word tokenization splits on whitespace and punctuation.

Tomek links (Chapter 17): Pairs of observations from different classes that are each other's nearest neighbor. Removing the majority-class member of each Tomek link cleans the decision boundary. An undersampling technique.

Topic modeling (Chapter 26): An unsupervised NLP technique that discovers abstract topics in a document collection. LDA is the most common approach. Each document is a mixture of topics; each topic is a distribution over words.

TransformerMixin (Chapter 10): A scikit-learn mixin class that provides a default fit_transform() method. Combined with BaseEstimator to create custom transformers that integrate seamlessly with scikit-learn Pipelines.

Trend (Chapter 25): The long-term direction of change in a time series. Can be linear (steady increase) or non-linear (acceleration, leveling off). One component of time series decomposition alongside seasonality and residuals.

Type I error (Chapter 3): Rejecting the null hypothesis when it is actually true. A false positive: concluding there is a difference when there isn't one. Controlled by the significance level (alpha).

Type II error (Chapter 3): Failing to reject the null hypothesis when it is actually false. A false negative: concluding there is no difference when there is one. Controlled by statistical power.

UMAP (Chapter 21): Uniform Manifold Approximation and Projection. A non-linear dimensionality reduction technique that is generally faster than t-SNE and better preserves global structure. Parameterized by n_neighbors and min_dist.

Underfitting (Chapter 1): When a model is too simple to capture the underlying patterns in the data. Symptoms: poor performance on both training and test data. Addressed by adding features, using more complex models, or reducing regularization.

Unsupervised learning (Chapter 1): A learning paradigm where the model discovers patterns in unlabeled data. Includes clustering, dimensionality reduction, anomaly detection, and association rules.

VADER (Chapter 26): Valence Aware Dictionary and sEntiment Reasoner. A rule-based sentiment analysis tool tuned for social media text. Handles punctuation, capitalization, slang, and emoticons. Returns compound, positive, negative, and neutral scores.

Validation curve (Chapter 16): A plot of training and validation performance vs. a hyperparameter value. Reveals the sweet spot: too low a value causes underfitting; too high causes overfitting.

Variance inflation factor (VIF) (Chapter 9): A measure of multicollinearity. VIF for a feature is 1 / (1 - R-squared) when that feature is regressed on all others. VIF > 5 suggests moderate multicollinearity; VIF > 10 indicates severe multicollinearity.

Variance threshold (Chapter 9): A filter feature selection method that removes features with variance below a specified threshold. The simplest form of feature selection: a feature with zero variance contains no information.

Walk-forward validation (Chapter 25): See Time series split.

Weights and Biases (W&B) (Chapter 30): A cloud-based experiment tracking platform. Offers rich visualization, team collaboration, model and dataset versioning, and hyperparameter sweep management. SaaS model with a free tier.

Window function (Chapter 5): A SQL function that performs a calculation across a set of rows related to the current row, without collapsing the result into a single row like GROUP BY. Includes ROW_NUMBER, RANK, LAG, LEAD, and running aggregates.

Wrapper methods (Chapter 9): Feature selection techniques that evaluate feature subsets by training a model on each subset. More computationally expensive than filter methods but captures feature interactions. RFE is the most common wrapper method.

XGBoost (Chapter 14): eXtreme Gradient Boosting. A highly optimized gradient boosting library. Features: regularized objective, built-in handling of missing values, parallel tree construction, GPU training. The standard benchmark for tabular ML.

Z-score (Chapter 22): The number of standard deviations a data point is from the mean: z = (x - mu) / sigma. Points with |z| > 3 are commonly flagged as anomalies. Assumes approximately normal distribution.