Glossary

A comprehensive reference of key terms, concepts, and acronyms used throughout AI & Machine Learning for Business. Entries are arranged alphabetically. Cross-references in [Ch. X] notation point to the chapter where a term is discussed in greatest depth; many terms appear in additional chapters as well.

A/B testing. A controlled experiment in which two variants (A and B) of a product, feature, or model are shown to different user segments to determine which performs better on a defined metric. Widely used to validate ML-driven recommendations and personalization strategies. [Ch. 17]

Ablation study. An evaluation technique in which components of a model or pipeline are systematically removed to measure each component's contribution to overall performance. Useful for justifying architectural decisions in model development. [Ch. 12]

Accuracy. The proportion of correct predictions out of all predictions made by a classification model. While intuitive, accuracy can be misleading on imbalanced datasets where precision, recall, or F1-score may be more informative. [Ch. 10]

Activation function. A mathematical function applied to a neuron's weighted sum of inputs to introduce nonlinearity into a neural network. Common examples include ReLU, sigmoid, and tanh. [Ch. 13]

Active learning. A semi-supervised learning approach in which the model identifies the most informative unlabeled data points and requests labels for those specifically, reducing annotation costs. Particularly valuable when labeled data is expensive or scarce. [Ch. 11]

Adversarial attack. A deliberate attempt to deceive a machine learning model by providing specially crafted input designed to cause incorrect predictions. Examples include imperceptible pixel perturbations that cause image classifiers to misclassify objects. [Ch. 36]

Adversarial robustness. The degree to which a model maintains correct predictions when subjected to adversarial attacks or deliberately manipulated inputs. A critical consideration for safety-critical deployments. [Ch. 36]

Agent (AI agent). An autonomous or semi-autonomous software system that uses AI to perceive its environment, make decisions, and take actions to achieve defined goals. Increasingly deployed in customer service, workflow automation, and complex reasoning tasks. [Ch. 30]

Agile methodology. An iterative project management framework that emphasizes incremental delivery, cross-functional collaboration, and rapid adaptation to change. Frequently adapted for AI/ML projects, though with modifications to accommodate data dependencies and experimentation cycles. [Ch. 22]

Algorithm. A finite sequence of well-defined instructions for solving a problem or performing a computation. In machine learning, algorithms define the procedure by which a model learns patterns from data. [Ch. 2]

Algorithmic auditing. The systematic examination of an algorithm's inputs, logic, and outputs to assess compliance with fairness standards, legal requirements, or organizational policies. May be conducted internally or by independent third parties. [Ch. 35]

Algorithmic bias. Systematic and repeatable errors in a computer system's output that create unfair outcomes, often reflecting historical prejudices present in training data or design choices. Distinguished from statistical bias, which refers to estimation error. [Ch. 34]

Amazon SageMaker. A fully managed cloud service from Amazon Web Services (AWS) that provides tools for building, training, and deploying machine learning models at scale. Includes built-in algorithms, notebook environments, and MLOps capabilities. [Ch. 26]

Annotation. The process of labeling raw data (images, text, audio) with metadata that a supervised learning model can use as ground truth during training. Quality and consistency of annotations directly affect model performance. [Ch. 8]

Anomaly detection. The identification of data points, events, or observations that deviate significantly from expected patterns. Applications include fraud detection, network intrusion detection, and predictive maintenance. [Ch. 11]

API (Application Programming Interface). A set of protocols and tools that allows different software applications to communicate with each other. ML models are commonly served via REST or gRPC APIs for real-time inference. [Ch. 25]

Attention mechanism. A neural network component that allows the model to dynamically focus on different parts of the input when producing each part of the output. The foundation of the Transformer architecture that powers modern large language models. [Ch. 14]

AUC-ROC (Area Under the Receiver Operating Characteristic Curve). A performance metric for binary classification models that measures the model's ability to distinguish between positive and negative classes across all possible decision thresholds. A value of 1.0 indicates perfect discrimination; 0.5 indicates random chance. [Ch. 10]

AutoML (Automated Machine Learning). A suite of techniques and tools that automate parts of the machine learning pipeline, including feature engineering, model selection, and hyperparameter tuning. Designed to make ML accessible to non-specialists while accelerating expert workflows. [Ch. 26]

Autonomous vehicle. A vehicle capable of sensing its environment and navigating without human input, relying on a combination of computer vision, sensor fusion, and reinforcement learning. Represents one of the most complex AI engineering challenges. [Ch. 31]

Azure Machine Learning. Microsoft's cloud-based platform for building, training, and deploying machine learning models, offering integration with the broader Azure ecosystem and enterprise tooling. [Ch. 26]

Backpropagation. The algorithm used to compute gradients of the loss function with respect to each weight in a neural network, enabling iterative weight updates during training. Works by applying the chain rule of calculus from the output layer backward through the network. [Ch. 13]

Bag of words. A text representation method that models a document as an unordered collection of word frequencies, discarding grammar and word order. Simple but often effective as a baseline for text classification tasks. [Ch. 14]

Bagging (bootstrap aggregating). An ensemble learning technique that trains multiple models on different random subsets of the training data (sampled with replacement) and combines their predictions, typically by averaging or majority voting. Random Forest is the most prominent bagging method. [Ch. 11]

Batch inference. The process of generating predictions for a large set of inputs at once, typically on a scheduled basis, rather than responding to individual requests in real time. Common in scenarios like nightly recommendation updates or monthly risk scoring. [Ch. 25]

Batch normalization. A technique that normalizes the inputs to each layer of a neural network within a mini-batch, stabilizing and accelerating training. Reduces sensitivity to weight initialization and learning rate selection. [Ch. 13]

Batch size. The number of training examples processed together in one forward and backward pass during model training. Larger batch sizes increase computational efficiency but may affect generalization; smaller batches introduce more noise but can help escape local minima. [Ch. 12]

Bayesian optimization. A sequential strategy for optimizing expensive black-box functions that builds a probabilistic surrogate model of the objective and uses an acquisition function to select the next evaluation point. Widely used for hyperparameter tuning. [Ch. 12]

BERT (Bidirectional Encoder Representations from Transformers). A pre-trained language model developed by Google that reads text bidirectionally to build deep contextual word representations. Foundational to modern NLP tasks including question answering, sentiment analysis, and named entity recognition. [Ch. 14]

Bias (statistical). The systematic error introduced when a model's assumptions cause it to consistently miss the true relationship in the data. High bias typically results in underfitting. See also Algorithmic bias. [Ch. 10]

Bias-variance tradeoff. The fundamental tension in machine learning between a model's ability to fit training data closely (low bias) and its ability to generalize to unseen data (low variance). Optimal model complexity balances these two sources of error. [Ch. 10]

Big data. A term describing datasets that are too large, fast-moving, or complex for traditional data processing tools to handle effectively. Commonly characterized by the "three Vs": volume, velocity, and variety. [Ch. 7]

Binary classification. A supervised learning task in which the model assigns each input to one of exactly two classes (e.g., spam vs. not-spam, fraudulent vs. legitimate). [Ch. 9]

Black-box model. A model whose internal decision-making process is opaque or difficult for humans to interpret, such as deep neural networks or large ensembles. Contrasted with white-box or interpretable models. [Ch. 34]

Boosting. An ensemble method that sequentially trains weak learners, with each new model focusing on the errors made by its predecessors. Produces strong predictive performance; prominent implementations include AdaBoost, Gradient Boosting, XGBoost, and LightGBM. [Ch. 11]

Business case. A structured document or analysis that justifies an AI/ML investment by articulating the problem, proposed solution, expected benefits, costs, risks, and success criteria. Essential for securing executive sponsorship and funding. [Ch. 4]

Business intelligence (BI). The use of data analysis tools and techniques to transform raw data into actionable insights for business decision-making. AI/ML extends traditional BI by enabling predictive and prescriptive analytics. [Ch. 3]

Canary deployment. A deployment strategy in which a new model version is released to a small subset of users or traffic before being rolled out more broadly, allowing teams to detect issues early with minimal impact. [Ch. 25]

CCPA (California Consumer Privacy Act). A data privacy law enacted in California in 2018 granting consumers rights over their personal data, including the right to know what data is collected, request deletion, and opt out of data sales. Amended and expanded by the CPRA in 2023. [Ch. 37]

Center of Excellence (CoE). A dedicated organizational unit that provides leadership, best practices, research, and support for AI/ML initiatives across an enterprise. Serves as a hub for talent development, standards, and knowledge sharing. [Ch. 22]

Chatbot. A software application that uses NLP and often generative AI to simulate human conversation through text or voice interfaces. Modern chatbots powered by large language models can handle complex, multi-turn dialogues. [Ch. 30]

Churn prediction. The use of machine learning to identify customers who are likely to stop using a product or service, enabling proactive retention interventions. A common and high-ROI application of classification models. [Ch. 17]

CI/CD (Continuous Integration / Continuous Delivery). A software engineering practice in which code changes are automatically built, tested, and prepared for release. In ML contexts, CI/CD pipelines extend to include data validation, model training, evaluation, and deployment steps. [Ch. 25]

Classification. A supervised learning task in which the model assigns input data to one of a set of predefined categories or classes. Examples include email spam detection, medical diagnosis, and customer segmentation. [Ch. 9]

Class imbalance. A condition in which the classes in a classification dataset are not represented equally, with one class significantly outnumbering others. Requires specialized techniques such as oversampling, undersampling, SMOTE, or cost-sensitive learning to address. [Ch. 10]

Cloud computing. The delivery of computing resources (servers, storage, databases, networking, AI services) over the internet on a pay-as-you-go basis. The dominant infrastructure paradigm for training and deploying ML models at scale. [Ch. 26]

Clustering. An unsupervised learning technique that groups data points into clusters based on similarity, without predefined labels. Common algorithms include k-means, DBSCAN, and hierarchical clustering. [Ch. 11]

CNTK (Microsoft Cognitive Toolkit). A deprecated open-source deep learning framework developed by Microsoft. Largely superseded by PyTorch and TensorFlow. [Ch. 26]

Co-pilot (AI co-pilot). An AI assistant embedded in a workflow tool that augments human work by generating suggestions, drafting content, writing code, or surfacing relevant information in real time. Examples include GitHub Copilot and Microsoft 365 Copilot. [Ch. 30]

Cold start problem. The difficulty of making accurate recommendations or predictions for new users, items, or entities about which the system has no historical data. Common in recommendation systems and addressed through content-based methods or hybrid approaches. [Ch. 17]

Collaborative filtering. A recommendation technique that predicts a user's preferences based on the preferences of similar users or similar items, without requiring explicit feature engineering. [Ch. 17]

Computer vision. The field of AI concerned with enabling machines to interpret and understand visual information from images and video. Applications include object detection, facial recognition, medical imaging, and quality inspection. [Ch. 15]

Concept drift. A change in the statistical relationship between input features and the target variable over time, causing model performance to degrade. Distinct from data drift, which refers to changes in input distributions alone. [Ch. 25]

Confusion matrix. A table that summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives. The basis for computing accuracy, precision, recall, and F1-score. [Ch. 10]

Constitutional AI. An alignment approach developed by Anthropic in which an AI model is trained to follow a set of principles (a "constitution") through self-critique and revision, aiming to produce helpful, harmless, and honest outputs. [Ch. 36]

Content-based filtering. A recommendation approach that suggests items similar to those a user has previously liked, based on item features rather than the behavior of other users. Often combined with collaborative filtering in hybrid systems. [Ch. 17]

Contrastive learning. A self-supervised learning technique that trains models to distinguish similar (positive) pairs of data points from dissimilar (negative) pairs, learning useful representations without labeled data. [Ch. 13]

Convolutional Neural Network (CNN). A class of deep neural networks designed primarily for processing structured grid data such as images, using convolutional layers that apply learned filters to detect local patterns like edges, textures, and shapes. [Ch. 15]

Copyrighted training data. Content protected by copyright law that is used to train AI models, raising legal and ethical questions about fair use, attribution, and compensation for creators. A rapidly evolving area of AI law. [Ch. 37]

Cost function. See Loss function.

Cost-sensitive learning. A machine learning approach that assigns different misclassification costs to different classes, particularly useful when errors on the minority class are more consequential than errors on the majority class. [Ch. 10]

Counterfactual explanation. An explainability method that describes the smallest change to the input features that would result in a different model prediction. Provides intuitive, actionable explanations (e.g., "If your income were $5,000 higher, the loan would have been approved"). [Ch. 34]

Cross-entropy loss. A loss function commonly used for classification tasks that measures the divergence between predicted probability distributions and actual class labels. Lower cross-entropy indicates better-calibrated predictions. [Ch. 13]

Cross-validation. A model evaluation technique that partitions the data into multiple folds, training and testing the model on different splits to produce a more robust estimate of performance. k-fold cross-validation (typically k = 5 or 10) is the most common variant. [Ch. 10]

CUDA (Compute Unified Device Architecture). A parallel computing platform and API developed by NVIDIA that enables developers to use GPUs for general-purpose computing, including deep learning training and inference. [Ch. 26]

Customer lifetime value (CLV/LTV). The predicted total revenue a business expects to earn from a customer over the entire duration of their relationship. ML models can estimate CLV to prioritize acquisition, retention, and personalization efforts. [Ch. 17]

Cybersecurity (AI for). The application of machine learning to detect, prevent, and respond to cyber threats, including intrusion detection, malware classification, phishing detection, and automated incident response. [Ch. 19]

DAG (Directed Acyclic Graph). A graph structure with directed edges and no cycles, used to represent dependencies in ML pipelines, workflow orchestration (e.g., Apache Airflow), and certain neural network architectures. [Ch. 25]

Dashboard. A visual display of key metrics and performance indicators, often powered by BI tools, that provides stakeholders with at-a-glance insight into model performance, business KPIs, or operational health. [Ch. 3]

Data augmentation. Techniques that artificially expand a training dataset by applying transformations (rotation, flipping, cropping, noise injection, synonym replacement) to existing data points, improving model robustness and reducing overfitting. [Ch. 15]

Data catalog. A centralized metadata repository that helps organizations discover, understand, and govern their data assets. Essential infrastructure for scaling AI/ML initiatives across the enterprise. [Ch. 7]

Data cleaning (data cleansing). The process of detecting and correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset. Typically the most time-consuming phase of any ML project. [Ch. 8]

Data drift. A change in the statistical distribution of input features over time, which may degrade model performance even when the underlying relationships remain stable. Distinguished from concept drift. [Ch. 25]

Data engineering. The discipline of designing, building, and maintaining the infrastructure and pipelines that collect, store, transform, and deliver data for analytics and machine learning. [Ch. 7]

Data ethics. The branch of ethics that evaluates data practices and the moral implications of collecting, using, and sharing data, particularly concerning privacy, consent, fairness, and societal impact. [Ch. 34]

Data flywheel. A self-reinforcing cycle in which a product generates user data, that data improves the AI model, the improved model enhances the product, and the enhanced product attracts more users and data. A powerful competitive moat for AI-driven businesses. [Ch. 5]

Data governance. The organizational framework of policies, roles, processes, and standards that ensure data is managed as a strategic asset with appropriate quality, security, privacy, and compliance. [Ch. 7]

Data ingestion. The process of importing and loading data from various sources into a storage system or processing pipeline for downstream analysis and model training. [Ch. 7]

Data labeling. See Annotation.

Data lake. A centralized repository that stores structured, semi-structured, and unstructured data at any scale in its raw format. Provides flexibility for diverse analytical workloads but requires governance to avoid becoming a "data swamp." [Ch. 7]

Data lakehouse. A hybrid data architecture that combines the flexibility and scale of data lakes with the data management and ACID transaction capabilities of data warehouses. Examples include Databricks Lakehouse and Apache Iceberg. [Ch. 7]

Data leakage. The inadvertent inclusion of information from the test set or future data in the training process, resulting in overly optimistic performance estimates that do not generalize to production. One of the most common and costly mistakes in ML projects. [Ch. 10]

Data lineage. The ability to trace data from its origin through all transformations and movements to its final use, enabling auditability, debugging, and regulatory compliance. [Ch. 7]

Data mesh. A decentralized data architecture paradigm that treats data as a product owned by domain teams rather than a centralized data team. Emphasizes domain ownership, data as a product, self-serve infrastructure, and federated governance. [Ch. 7]

Data pipeline. An automated sequence of steps that moves data from source systems through transformations to a destination (data warehouse, feature store, model training job). [Ch. 7]

Data privacy. The right of individuals to control how their personal information is collected, used, stored, and shared. A central concern in AI systems that process personal data. See also GDPR, CCPA. [Ch. 37]

Data product. A reusable data asset (dataset, API, dashboard, ML model) designed, documented, and maintained to serve downstream consumers reliably and at scale. [Ch. 7]

Data quality. The degree to which data is accurate, complete, consistent, timely, and fit for its intended purpose. Poor data quality is the most common cause of ML project failure. [Ch. 8]

Data science. An interdisciplinary field that uses statistical methods, algorithms, and domain expertise to extract knowledge and insights from structured and unstructured data. [Ch. 2]

Data strategy. An organization's comprehensive plan for how it will collect, manage, analyze, and leverage data to achieve its business objectives. A prerequisite for successful AI adoption. [Ch. 5]

Data warehouse. A centralized repository of integrated, structured data from multiple sources, optimized for analytical queries and reporting. Traditional backbone of business intelligence. [Ch. 7]

Data wrangling. See Data cleaning.

Databricks. A unified analytics platform built on Apache Spark that provides collaborative workspaces for data engineering, data science, and machine learning. [Ch. 26]

Dataset. A structured collection of data used for training, validating, or testing a machine learning model. The quality, size, and representativeness of the dataset are among the most important determinants of model performance. [Ch. 8]

Decision boundary. The surface or line in feature space that separates different classes as determined by a classification model. The shape and complexity of decision boundaries vary by algorithm. [Ch. 9]

Decision tree. A supervised learning algorithm that makes predictions by learning a series of if-then-else decision rules from the features in the data, forming a tree-like structure. Highly interpretable but prone to overfitting when grown deep. [Ch. 9]

Deep learning. A subfield of machine learning that uses neural networks with many layers (hence "deep") to learn hierarchical representations of data. Excels at tasks involving images, text, speech, and other unstructured data. [Ch. 13]

Deepfake. Synthetic media generated by deep learning in which a person's likeness, voice, or actions are convincingly fabricated or altered. Raises concerns about misinformation, fraud, and consent. [Ch. 36]

Demand forecasting. The use of statistical and machine learning methods to predict future customer demand for products or services, enabling better inventory management, pricing, and supply chain planning. [Ch. 18]

Deployment (model deployment). The process of integrating a trained ML model into a production environment where it can receive input data and return predictions to end users or downstream systems. [Ch. 25]

Descriptive analytics. Analysis that summarizes historical data to describe what has happened. The foundational level of the analytics maturity spectrum, preceding diagnostic, predictive, and prescriptive analytics. [Ch. 3]

DevOps. A set of practices that combines software development (Dev) and IT operations (Ops) to shorten development cycles and deliver software reliably. MLOps extends DevOps principles to machine learning. [Ch. 25]

Differential privacy. A mathematical framework that provides a formal guarantee that the output of a data analysis or ML model does not reveal whether any individual's data was included in the input dataset. Achieved by adding calibrated noise to data or model outputs. [Ch. 37]

Diffusion model. A class of generative models that learn to create data by iteratively denoising a random noise signal, producing high-quality images, audio, or video. The architecture behind Stable Diffusion, DALL-E, and similar systems. [Ch. 16]

Dimensionality reduction. Techniques that reduce the number of features in a dataset while preserving as much information as possible. Common methods include PCA, t-SNE, and UMAP. Useful for visualization, noise reduction, and computational efficiency. [Ch. 11]

Disparate impact. A legal doctrine in which a facially neutral policy or algorithm disproportionately disadvantages a protected group, even without discriminatory intent. The "four-fifths rule" is a common threshold for assessing disparate impact. [Ch. 35]

Distributed computing. The use of multiple interconnected computers to process data or train models in parallel, enabling work at scales that exceed the capacity of a single machine. Frameworks include Apache Spark, Horovod, and Ray. [Ch. 26]

Docker. A platform for building, shipping, and running applications in lightweight, isolated containers. Widely used to package ML models and their dependencies for reproducible deployment. [Ch. 25]

Domain adaptation. A transfer learning technique that adapts a model trained on one domain (source) to perform well on a different but related domain (target) where labeled data may be limited. [Ch. 13]

Drift detection. Automated monitoring techniques that identify when data drift or concept drift is occurring in production, triggering alerts or model retraining workflows. [Ch. 25]

Dropout. A regularization technique for neural networks in which randomly selected neurons are ignored (dropped out) during each training iteration, preventing over-reliance on any single neuron and reducing overfitting. [Ch. 13]

Early stopping. A regularization technique that halts model training when performance on a validation set stops improving, preventing overfitting to the training data. [Ch. 12]

Edge computing (edge AI). The deployment of AI models on devices at or near the data source (smartphones, IoT devices, cameras) rather than in centralized cloud data centers. Reduces latency, bandwidth costs, and privacy risks. [Ch. 26]

Embedding. A dense, low-dimensional vector representation of a discrete entity (word, sentence, image, user, product) that captures semantic meaning or similarity in a continuous space. Words with similar meanings have embeddings that are close together. [Ch. 14]

Encoder-decoder architecture. A neural network design in which an encoder compresses input data into a latent representation and a decoder generates output from that representation. Used in machine translation, summarization, and image segmentation. [Ch. 14]

Ensemble learning. A meta-approach that combines predictions from multiple models to produce a result that is more accurate and robust than any individual model. Includes bagging, boosting, and stacking methods. [Ch. 11]

Entity extraction. See Named Entity Recognition (NER).

Epoch. One complete pass through the entire training dataset during the training of a machine learning model. Models are typically trained for multiple epochs until convergence. [Ch. 12]

Ethical AI. The practice of designing, developing, and deploying AI systems in accordance with ethical principles such as fairness, transparency, accountability, privacy, and societal benefit. [Ch. 34]

ETL (Extract, Transform, Load). A data integration process that extracts data from source systems, transforms it into a suitable format, and loads it into a target system such as a data warehouse. Increasingly replaced or supplemented by ELT in modern cloud architectures. [Ch. 7]

EU AI Act. A comprehensive regulatory framework adopted by the European Union that classifies AI systems by risk level (unacceptable, high, limited, minimal) and imposes corresponding requirements for transparency, safety, human oversight, and documentation. The world's first major AI-specific legislation. [Ch. 37]

Evaluation metric. A quantitative measure used to assess the performance of a machine learning model. The choice of metric should align with the business objective. See also Accuracy, Precision, Recall, F1-score, AUC-ROC, RMSE. [Ch. 10]

Executive sponsor. A senior leader who champions an AI initiative, provides strategic direction, secures resources, and removes organizational barriers. Critical for AI project success. [Ch. 22]

Experiment tracking. The systematic logging and comparison of model training runs, including hyperparameters, code versions, datasets, and performance metrics. Tools include MLflow, Weights & Biases, and Neptune. [Ch. 25]

Explainability. The degree to which a human can understand the reasoning behind a model's predictions or decisions. Encompasses both global explanations (how the model works overall) and local explanations (why a specific prediction was made). See also Interpretability. [Ch. 34]

Exploratory Data Analysis (EDA). The initial investigation of a dataset using summary statistics, visualizations, and data profiling to discover patterns, anomalies, and relationships before formal modeling. [Ch. 8]

Extrapolation. Making predictions for data points outside the range of the training data distribution. Most ML models are unreliable when extrapolating beyond their training domain. [Ch. 10]

F1-score. The harmonic mean of precision and recall, providing a single metric that balances both. Particularly useful when class distributions are uneven and both false positives and false negatives carry cost. [Ch. 10]

Facial recognition. A computer vision technology that identifies or verifies a person's identity by analyzing facial features in images or video. Subject to significant regulatory scrutiny due to accuracy disparities across demographics and privacy concerns. [Ch. 35]

Feature. An individual measurable property or characteristic of a data point used as input to a machine learning model. Also called a variable, attribute, or predictor. Feature quality is often more important than model complexity. [Ch. 8]

Feature engineering. The process of using domain knowledge to create, transform, or select input features that improve model performance. Traditionally the most impactful activity in applied ML, though increasingly automated by deep learning and AutoML. [Ch. 8]

Feature importance. A measure of how much each input feature contributes to a model's predictions. Methods include permutation importance, Gini importance (for tree-based models), and SHAP values. [Ch. 34]

Feature selection. The process of identifying and retaining only the most relevant features from a dataset, removing redundant or uninformative variables to improve model performance, reduce overfitting, and decrease training time. [Ch. 8]

Feature store. A centralized repository for storing, managing, and serving machine learning features, ensuring consistency between training and inference and enabling feature reuse across models and teams. Examples include Feast and Tecton. [Ch. 25]

Federated learning. A distributed machine learning approach in which models are trained across multiple decentralized devices or institutions without exchanging raw data, preserving privacy. Each participant trains on local data and shares only model updates. [Ch. 37]

Few-shot learning. The ability of a model to learn a new task from only a small number of labeled examples, often by leveraging pre-trained knowledge. In the context of LLMs, few-shot learning refers to providing a handful of examples in the prompt. [Ch. 16]

Fine-tuning. The process of taking a pre-trained model and continuing its training on a smaller, task-specific dataset to adapt it for a particular domain or application. Requires significantly less data and compute than training from scratch. [Ch. 16]

Foundational model. See Foundation model.

Foundation model. A large AI model trained on broad, diverse data at scale that can be adapted (via fine-tuning, prompting, or other techniques) to a wide range of downstream tasks. Examples include GPT-4, Claude, Gemini, and Llama. [Ch. 16]

Fraud detection. The application of machine learning to identify potentially fraudulent transactions, claims, or activities by learning patterns that distinguish legitimate behavior from anomalous or deceptive behavior. [Ch. 19]

Full-stack AI. An approach to AI development in which a team or platform covers the entire lifecycle from data ingestion and model training through deployment, monitoring, and business integration. [Ch. 22]

GAN (Generative Adversarial Network). A generative model architecture consisting of two neural networks — a generator that creates synthetic data and a discriminator that evaluates its authenticity — trained in competition with each other. Used for image generation, data augmentation, and style transfer. [Ch. 16]

GDPR (General Data Protection Regulation). A comprehensive data privacy regulation enacted by the European Union in 2018 that governs the collection, processing, and storage of personal data of EU residents. Grants individuals rights including access, rectification, erasure, and data portability, and requires lawful bases for processing. [Ch. 37]

Generalization. A model's ability to perform well on new, unseen data that was not part of the training set. The ultimate goal of machine learning; poor generalization indicates overfitting. [Ch. 10]

Generative AI. AI systems capable of creating new content — text, images, code, music, video — rather than merely classifying or predicting based on existing data. Powered by foundation models and architectures such as Transformers, diffusion models, and GANs. [Ch. 16]

Gradient. The vector of partial derivatives of the loss function with respect to each model parameter, indicating the direction and magnitude of change needed to reduce the loss. The basis of gradient-based optimization methods. [Ch. 13]

Gradient boosting. An ensemble technique that builds models sequentially, with each new model trained to correct the residual errors of the previous models using gradient descent on the loss function. XGBoost, LightGBM, and CatBoost are widely used implementations. [Ch. 11]

Gradient descent. An iterative optimization algorithm that adjusts model parameters in the direction that reduces the loss function, using the computed gradient to determine the step direction and learning rate to determine the step size. [Ch. 12]

GPU (Graphics Processing Unit). A specialized processor originally designed for rendering graphics but now widely used for the parallel matrix computations required by deep learning training and inference. NVIDIA GPUs with CUDA support dominate the ML hardware market. [Ch. 26]

Graph neural network (GNN). A class of neural networks designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs. Applications include social network analysis, drug discovery, and knowledge graph completion. [Ch. 13]

Green AI. A research direction and practice focused on reducing the environmental and computational costs of AI, including energy consumption and carbon emissions associated with training and running large models. [Ch. 38]

Grid search. A hyperparameter tuning method that exhaustively evaluates all combinations of specified hyperparameter values. Simple but computationally expensive; often replaced by random search or Bayesian optimization for large search spaces. [Ch. 12]

Grounding. Techniques that connect a language model's outputs to verifiable external sources of information (databases, documents, knowledge graphs), reducing hallucinations and improving factual accuracy. See also Retrieval-Augmented Generation. [Ch. 30]

Guardrails. Safety mechanisms, rules, or filters applied to AI systems to prevent harmful, biased, off-topic, or otherwise undesirable outputs. May be implemented through prompt engineering, output filtering, or specialized safety models. [Ch. 36]

Hallucination. A phenomenon in which a generative AI model produces output that is factually incorrect, fabricated, or nonsensical, presented with apparent confidence. A significant challenge for deploying LLMs in business-critical applications. [Ch. 30]

Heuristic. A practical rule of thumb or simplified strategy used to make decisions or solve problems when an optimal solution is computationally impractical. Often used as baselines against which ML models are compared. [Ch. 9]

Hidden layer. Any layer in a neural network that is not the input or output layer. Hidden layers learn increasingly abstract representations of the data as depth increases. [Ch. 13]

Holdout set. A portion of the dataset set aside and not used during training, reserved for evaluating model performance. Typically divided into validation and test sets. [Ch. 10]

Human-in-the-loop (HITL). A design pattern in which human judgment is integrated into an AI workflow, typically to review, correct, or approve model predictions before they are acted upon. Common in high-stakes domains such as healthcare, finance, and criminal justice. [Ch. 34]

Hugging Face. An open-source platform and community that provides pre-trained models, datasets, and tools for NLP, computer vision, and generative AI. The Transformers library is its most prominent contribution. [Ch. 26]

Hybrid AI. An approach that combines multiple AI techniques (e.g., neural networks with symbolic reasoning, or ML models with rule-based systems) to leverage the strengths of each. [Ch. 40]

Hyperparameter. A configuration parameter set before model training begins (as opposed to model parameters, which are learned during training). Examples include learning rate, number of layers, batch size, and regularization strength. [Ch. 12]

Hyperparameter tuning (hyperparameter optimization). The process of systematically searching for the hyperparameter values that yield the best model performance. Methods include grid search, random search, Bayesian optimization, and Hyperband. [Ch. 12]

Hypothesis testing. A statistical method for making inferences about a population based on sample data, used in data analysis to determine whether observed patterns are statistically significant or could be due to chance. [Ch. 3]

Image classification. A computer vision task in which a model assigns an image to one or more predefined categories. One of the earliest and most successful applications of deep learning. [Ch. 15]

Image segmentation. A computer vision task that partitions an image into meaningful regions, assigning a class label to each pixel (semantic segmentation) or distinguishing individual object instances (instance segmentation). [Ch. 15]

Imbalanced data. See Class imbalance.

Imputation. The process of replacing missing values in a dataset with estimated values, using methods such as mean, median, mode substitution, or more sophisticated techniques like KNN imputation or model-based imputation. [Ch. 8]

Inference. The process of using a trained model to generate predictions on new, unseen data. In production systems, inference latency, throughput, and cost are key operational considerations. [Ch. 25]

Inference optimization. Techniques that reduce the computational cost and latency of model inference, including quantization, pruning, knowledge distillation, and hardware acceleration. Critical for deploying models at scale or on edge devices. [Ch. 26]

Information retrieval. The science of searching for and retrieving relevant information from large collections of documents or data. Underpins search engines and the retrieval component of RAG systems. [Ch. 14]

Infrastructure as Code (IaC). The practice of managing and provisioning computing infrastructure through machine-readable configuration files rather than manual processes. Tools like Terraform and CloudFormation enable reproducible ML infrastructure. [Ch. 25]

Innovation lab. A dedicated organizational unit or physical space where teams experiment with emerging technologies, including AI/ML, in a lower-risk environment before scaling successful initiatives. [Ch. 22]

Intelligent document processing (IDP). The use of AI techniques including OCR, NLP, and computer vision to extract, classify, and validate information from unstructured and semi-structured documents such as invoices, contracts, and forms. [Ch. 19]

Interpretability. The degree to which a human can understand the cause of a model's decision. Often used interchangeably with explainability, though some authors distinguish between inherently interpretable models and post-hoc explanation methods. [Ch. 34]

IoT (Internet of Things). A network of physical devices embedded with sensors, software, and connectivity that enables them to collect and exchange data. IoT devices generate massive data streams that can be analyzed by ML models for predictive maintenance, environmental monitoring, and smart automation. [Ch. 31]

Jupyter notebook. An open-source, web-based interactive computing environment that allows users to create documents containing live code, equations, visualizations, and narrative text. The standard tool for data exploration and model prototyping. [Ch. 8]

K-means clustering. An unsupervised learning algorithm that partitions data into k clusters by iteratively assigning each point to the nearest cluster centroid and updating centroids based on cluster membership. Simple, fast, but requires specifying k in advance. [Ch. 11]

K-nearest neighbors (KNN). A simple, non-parametric algorithm that classifies a data point based on the majority class among its k nearest neighbors in feature space. Easy to understand but computationally expensive at scale. [Ch. 9]

Kaggle. An online platform for data science competitions, datasets, and collaborative notebooks. Frequently used for benchmarking ML techniques and developing talent. [Ch. 2]

Keras. A high-level deep learning API that runs on top of TensorFlow, providing a user-friendly interface for building and training neural networks. Known for its simplicity and ease of prototyping. [Ch. 13]

Kernel (SVM). A function that maps input data into a higher-dimensional space where it becomes linearly separable. Common kernels include linear, polynomial, and radial basis function (RBF). [Ch. 9]

Knowledge distillation. A model compression technique in which a smaller "student" model is trained to replicate the behavior of a larger "teacher" model, achieving much of the teacher's performance at a fraction of the computational cost. [Ch. 26]

Knowledge graph. A structured representation of real-world entities and their relationships, typically stored as a graph of nodes (entities) and edges (relationships). Used to enhance search, recommendation systems, and LLM grounding. [Ch. 14]

KPI (Key Performance Indicator). A quantifiable measure used to evaluate the success of an organization, project, or individual in meeting objectives. AI project KPIs should bridge model performance metrics and business outcomes. [Ch. 4]

Kubernetes. An open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications, including ML model serving infrastructure. [Ch. 25]

Label. The known outcome or target variable associated with a training example in supervised learning. The model learns to predict labels for new, unlabeled data. [Ch. 9]

Label noise. Errors or inconsistencies in the labels assigned to training data, which can degrade model performance. Can result from annotator mistakes, ambiguous guidelines, or adversarial labeling. [Ch. 8]

Labeled data. Data for which the target variable (outcome) is known, enabling supervised learning. Obtaining sufficient high-quality labeled data is often the primary bottleneck in ML projects. [Ch. 8]

Large Language Model (LLM). A neural network with billions of parameters trained on vast text corpora to understand and generate human language. LLMs power chatbots, code assistants, content generation, and a wide range of language understanding tasks. Examples include GPT-4, Claude, Gemini, and Llama. [Ch. 16]

Latency. The time delay between submitting an input to a system and receiving a response. In ML serving, inference latency is a critical performance metric, often measured in milliseconds. [Ch. 25]

Latent space. A compressed, abstract representation of data learned by a model, where similar inputs map to nearby points. The "hidden" space in autoencoders, VAEs, and diffusion models where generation and interpolation occur. [Ch. 16]

Learning rate. A hyperparameter that controls the step size during gradient descent optimization. Too large a learning rate can cause training to diverge; too small a rate can cause training to converge slowly or get stuck in local minima. [Ch. 12]

LIME (Local Interpretable Model-agnostic Explanations). An explainability technique that approximates a complex model's behavior around a specific prediction with a simpler, interpretable model. Provides local explanations that show which features most influenced a particular decision. [Ch. 34]

Linear regression. A supervised learning algorithm that models the relationship between a dependent variable and one or more independent variables as a linear equation. One of the simplest and most interpretable ML algorithms. [Ch. 9]

LLMOps. The practice of operationalizing large language models in production, including prompt management, fine-tuning pipelines, evaluation frameworks, cost optimization, and safety monitoring. An emerging specialization within MLOps. [Ch. 30]

Logistic regression. A supervised learning algorithm for binary classification that models the probability of class membership using a logistic (sigmoid) function applied to a linear combination of features. Despite its name, it is a classification algorithm, not a regression algorithm. [Ch. 9]

Long Short-Term Memory (LSTM). A type of recurrent neural network designed to learn long-range dependencies in sequential data by using gating mechanisms that control information flow. Widely used for time series, text, and speech before the Transformer architecture. [Ch. 14]

LoRA (Low-Rank Adaptation). A parameter-efficient fine-tuning technique that adds small, trainable low-rank matrices to a frozen pre-trained model, dramatically reducing the memory and compute required for fine-tuning LLMs. [Ch. 16]

Loss function (cost function, objective function). A mathematical function that quantifies the difference between a model's predictions and the actual target values. Training seeks to minimize this function. Examples include mean squared error (regression) and cross-entropy (classification). [Ch. 12]

MAE (Mean Absolute Error). A regression metric that measures the average absolute difference between predicted and actual values. Less sensitive to outliers than RMSE. [Ch. 10]

Machine learning (ML). A subset of artificial intelligence in which algorithms learn patterns and relationships from data without being explicitly programmed, improving their performance through experience. [Ch. 2]

Machine learning engineer. A professional who bridges data science and software engineering, responsible for building, deploying, and maintaining ML systems in production. [Ch. 22]

MAPE (Mean Absolute Percentage Error). A regression metric that expresses prediction error as a percentage of the actual value, making it scale-independent. Useful for business communication but undefined when actual values are zero. [Ch. 10]

Markov chain. A stochastic model describing a sequence of events in which the probability of each event depends only on the state of the previous event (memoryless property). Foundational concept in reinforcement learning and MCMC sampling. [Ch. 11]

Markov Decision Process (MDP). A mathematical framework for modeling sequential decision-making in environments where outcomes are partly random and partly under the agent's control. The formal basis for reinforcement learning. [Ch. 13]

Matrix factorization. A mathematical technique that decomposes a large matrix into lower-dimensional matrices, used in recommendation systems to discover latent factors that explain user-item interactions. [Ch. 17]

Metadata. Data that describes other data, such as schema information, data types, creation dates, and lineage. Proper metadata management is essential for data governance and ML reproducibility. [Ch. 7]

Metric learning. A class of ML techniques that learn a distance function or embedding space in which similar items are close together and dissimilar items are far apart. Used in face verification, image retrieval, and few-shot learning. [Ch. 15]

Microservices. An architectural pattern in which a software application is composed of small, independently deployable services, each running its own process and communicating via APIs. ML models are often deployed as individual microservices. [Ch. 25]

Mini-batch. A subset of the training data used in one iteration of gradient descent. Mini-batch gradient descent is the standard practice in deep learning, balancing the stability of full-batch methods with the speed of stochastic methods. [Ch. 12]

Minimum Viable Product (MVP). The simplest version of a product or model that delivers enough value to validate the core hypothesis and gather user feedback. In AI projects, an MVP might use a simple model or rule-based approach before investing in complex ML. [Ch. 4]

MLflow. An open-source platform for managing the end-to-end ML lifecycle, including experiment tracking, model registry, and deployment. [Ch. 25]

MLOps (Machine Learning Operations). The set of practices, tools, and cultural norms that apply DevOps principles to the ML lifecycle, enabling reliable, automated, and reproducible model development, deployment, and monitoring. [Ch. 25]

Model card. A standardized document that accompanies a trained model, providing details about its intended use, performance metrics, training data, ethical considerations, and limitations. Promotes transparency and responsible use. [Ch. 34]

Model compression. Techniques that reduce the size and computational requirements of a model while preserving as much performance as possible. Includes quantization, pruning, knowledge distillation, and low-rank approximation. [Ch. 26]

Model drift. See Concept drift, Data drift.

Model registry. A centralized repository for managing model versions, metadata, and lifecycle stages (staging, production, archived). Facilitates governance, reproducibility, and collaboration. [Ch. 25]

Model serving. The infrastructure and processes that host trained models and handle incoming prediction requests, managing concerns like load balancing, autoscaling, batching, and latency. [Ch. 25]

Model validation. The process of assessing whether a trained model meets performance, fairness, and robustness requirements before it is deployed to production. May include statistical tests, bias audits, and stress testing. [Ch. 10]

Monitoring (model monitoring). The continuous tracking of a deployed model's performance, data quality, and operational health to detect degradation, drift, or anomalies that warrant intervention. [Ch. 25]

Monte Carlo simulation. A computational technique that uses repeated random sampling to estimate the probability distribution of uncertain outcomes. Used in risk assessment, scenario planning, and reinforcement learning. [Ch. 3]

Multi-armed bandit. A reinforcement learning framework for balancing exploration (trying new options) and exploitation (selecting the best-known option) when making sequential decisions under uncertainty. Used in dynamic A/B testing and ad placement. [Ch. 17]

Multi-class classification. A supervised learning task in which the model assigns each input to one of three or more classes (e.g., categorizing customer support tickets by topic). [Ch. 9]

Multi-label classification. A supervised learning task in which each input can be assigned to multiple classes simultaneously (e.g., tagging an article with multiple relevant topics). [Ch. 9]

Multi-modal AI. AI systems that can process and reason across multiple types of data (text, images, audio, video) simultaneously. GPT-4V and Gemini are examples of multi-modal foundation models. [Ch. 40]

Multi-task learning. A training approach in which a single model is trained to perform multiple related tasks simultaneously, sharing representations and potentially improving performance on all tasks through shared learning. [Ch. 13]

MVP. See Minimum Viable Product.

Naive Bayes. A family of probabilistic classifiers based on Bayes' theorem that assume independence among features. Despite this strong simplifying assumption, Naive Bayes often performs surprisingly well on text classification and other tasks. [Ch. 9]

Named Entity Recognition (NER). An NLP task that identifies and classifies named entities in text into predefined categories such as person, organization, location, date, and monetary value. [Ch. 14]

Natural Language Generation (NLG). The process of producing human-readable text from structured data or internal representations. A core capability of large language models. [Ch. 14]

Natural Language Processing (NLP). A subfield of AI focused on enabling computers to understand, interpret, and generate human language. Encompasses tasks such as translation, summarization, sentiment analysis, and question answering. [Ch. 14]

Natural Language Understanding (NLU). The subset of NLP concerned with machine reading comprehension — extracting meaning, intent, and context from human language input. [Ch. 14]

Network effect. A phenomenon in which the value of a product or service increases as more people use it. AI platforms exhibit network effects through data flywheels: more users generate more data, which improves models, which attracts more users. [Ch. 5]

Neural Architecture Search (NAS). An AutoML technique that uses optimization methods (reinforcement learning, evolutionary algorithms) to automatically discover optimal neural network architectures for a given task. [Ch. 26]

Neural network. A computing system inspired by biological neural networks, consisting of interconnected nodes (neurons) organized in layers that learn to transform inputs into outputs through training. The foundation of deep learning. [Ch. 13]

Neuron (node). The fundamental unit of a neural network that receives inputs, applies weights and a bias, passes the result through an activation function, and produces an output. [Ch. 13]

No-code / low-code ML. Platforms and tools that enable users to build, train, and deploy machine learning models with minimal or no programming, using visual interfaces and pre-built components. Democratizes ML but may limit flexibility. [Ch. 26]

Noise. Random variation or errors in data that do not represent the underlying pattern the model is trying to learn. Models that fit noise are said to be overfitting. [Ch. 10]

Normalization. The process of scaling numeric features to a standard range (e.g., 0 to 1) or standard distribution (mean 0, standard deviation 1). Improves convergence and performance for many ML algorithms. [Ch. 8]

Object detection. A computer vision task that identifies and locates objects within an image by drawing bounding boxes around them and assigning class labels. YOLO, Faster R-CNN, and DETR are prominent architectures. [Ch. 15]

Objective function. See Loss function.

OCR (Optical Character Recognition). The technology that converts images of text (handwritten or printed) into machine-readable text. A key component of intelligent document processing pipelines. [Ch. 19]

One-hot encoding. A representation method that converts categorical variables into binary vectors, where each category is represented by a vector with a single 1 and all other values 0. Simple but can create high-dimensional, sparse representations for categories with many levels. [Ch. 8]

Online learning. A training paradigm in which a model is updated incrementally as new data arrives, rather than being retrained from scratch on the entire dataset. Suitable for streaming data and rapidly changing environments. [Ch. 25]

Open-source AI. AI models, tools, and frameworks whose source code and/or model weights are publicly available for use, modification, and distribution. Examples include PyTorch, Hugging Face Transformers, and Meta's Llama models. [Ch. 26]

Operationalize. To transition an AI/ML model from a research or development environment into a production system where it delivers business value reliably and at scale. See also MLOps. [Ch. 25]

Optimization. The mathematical process of finding the parameters that minimize (or maximize) an objective function. In ML, optimization typically refers to minimizing a loss function during training. [Ch. 12]

Outlier. A data point that is significantly different from other observations in a dataset. Outliers may represent errors, rare events, or genuinely unusual cases, and their treatment can significantly affect model performance. [Ch. 8]

Overfitting. A condition in which a model learns the training data too well — including noise and idiosyncrasies — resulting in excellent training performance but poor generalization to new data. Addressed through regularization, cross-validation, early stopping, and more data. [Ch. 10]

PaaS (Platform as a Service). A cloud computing model that provides a platform for developing, running, and managing applications without managing the underlying infrastructure. AI/ML PaaS offerings include managed notebook environments, training clusters, and model hosting. [Ch. 26]

Parameter. A value internal to the model that is learned from training data. In a neural network, weights and biases are parameters. Distinguished from hyperparameters, which are set before training. [Ch. 12]

Parameter-efficient fine-tuning (PEFT). A family of techniques for adapting large pre-trained models using a small number of trainable parameters, reducing computational cost. Includes LoRA, prefix tuning, and adapter layers. [Ch. 16]

PCA (Principal Component Analysis). A linear dimensionality reduction technique that transforms features into a new set of orthogonal variables (principal components) ordered by the amount of variance they explain. [Ch. 11]

Perceptron. The simplest form of a neural network, consisting of a single neuron that computes a weighted sum of inputs and applies a threshold activation function. The building block from which more complex architectures evolved. [Ch. 13]

Personally Identifiable Information (PII). Any data that can be used to identify a specific individual, including name, email, Social Security number, biometric data, and IP address. PII handling is central to data privacy regulations. [Ch. 37]

Pipeline. A sequence of automated processing steps in ML that chains together data preprocessing, feature engineering, model training, evaluation, and deployment. Ensures reproducibility and reduces manual intervention. [Ch. 25]

Platform engineering. The practice of designing and building self-service toolchains and workflows for software development teams. In AI/ML, platform engineering provides standardized infrastructure for data scientists and ML engineers. [Ch. 26]

POC (Proof of Concept). A small-scale, time-limited project designed to demonstrate the feasibility and potential value of an AI/ML approach before committing to full-scale development. [Ch. 4]

Pooling. A downsampling operation in convolutional neural networks that reduces the spatial dimensions of feature maps, providing translation invariance and reducing computational cost. Max pooling and average pooling are the most common types. [Ch. 15]

Precision. The proportion of positive predictions that are actually correct (true positives divided by true positives plus false positives). High precision means few false alarms. [Ch. 10]

Predictive analytics. The use of statistical and machine learning techniques to forecast future outcomes based on historical data. Includes classification, regression, and time-series forecasting. [Ch. 3]

Predictive maintenance. The use of ML models to predict when equipment or machinery is likely to fail, enabling maintenance to be scheduled proactively, reducing downtime and costs. [Ch. 18]

Pre-processing. See Data cleaning, Feature engineering.

Pre-trained model. A model that has been trained on a large, general-purpose dataset and can be adapted to specific tasks through fine-tuning or prompting. Leverages transfer learning to reduce data and compute requirements. [Ch. 16]

Prescriptive analytics. The most advanced level of the analytics maturity spectrum, using optimization, simulation, and ML to recommend specific actions or decisions, not just predict outcomes. [Ch. 3]

Principal-agent problem (in AI). The challenge of ensuring an AI system (agent) acts in alignment with the goals and values of its human operators or principals, particularly as the system becomes more autonomous. [Ch. 36]

Privacy-preserving machine learning. A collection of techniques that enable ML training and inference while protecting the privacy of individual data subjects. Includes differential privacy, federated learning, homomorphic encryption, and secure multi-party computation. [Ch. 37]

Prompt. The input text or instruction given to a large language model to elicit a desired response. The quality and structure of the prompt significantly affect output quality. [Ch. 30]

Prompt engineering. The practice of designing and refining prompts to optimize the quality, relevance, accuracy, and safety of a language model's output. Techniques include few-shot prompting, chain-of-thought reasoning, and system prompts. [Ch. 30]

Pruning (model pruning). A model compression technique that removes weights, neurons, or entire layers from a trained neural network that contribute minimally to performance, reducing model size and inference cost. [Ch. 26]

PyTorch. An open-source deep learning framework developed by Meta AI (formerly Facebook AI Research) that provides dynamic computational graphs and a Pythonic interface. The dominant framework in ML research and increasingly in production. [Ch. 13]

QLoRA (Quantized Low-Rank Adaptation). A parameter-efficient fine-tuning method that combines quantization with LoRA to enable fine-tuning of very large language models on consumer-grade hardware. [Ch. 16]

Quantization. A model compression technique that reduces the precision of a model's weights (e.g., from 32-bit floating point to 8-bit integer), decreasing memory footprint and accelerating inference with minimal performance loss. [Ch. 26]

Question answering (QA). An NLP task in which a model provides answers to questions posed in natural language, either extracting answers from a given context passage or generating them from learned knowledge. [Ch. 14]

R-squared (R2). A statistical measure that represents the proportion of variance in the dependent variable that is explained by the independent variables in a regression model. Values range from 0 to 1, with higher values indicating better fit. [Ch. 10]

RAG (Retrieval-Augmented Generation). An architecture that enhances a language model's responses by first retrieving relevant information from an external knowledge base and then using that information as context for generation. Reduces hallucinations and keeps responses current without retraining. [Ch. 30]

Random Forest. An ensemble learning method that constructs multiple decision trees during training and outputs the average prediction (regression) or majority vote (classification) of the individual trees. Robust, accurate, and resistant to overfitting. [Ch. 11]

Random search. A hyperparameter tuning method that evaluates randomly sampled combinations of hyperparameter values. Often more efficient than grid search because it explores the search space more broadly. [Ch. 12]

Recall (sensitivity, true positive rate). The proportion of actual positive cases that the model correctly identifies (true positives divided by true positives plus false negatives). High recall means few missed positives. [Ch. 10]

Recommendation system (recommender system). An information filtering system that predicts a user's preferences and suggests relevant items (products, content, connections) based on user behavior, item attributes, or both. Powers key business functions at Netflix, Amazon, Spotify, and LinkedIn. [Ch. 17]

Recurrent Neural Network (RNN). A class of neural networks designed for sequential data, in which connections between nodes form directed cycles, allowing the network to maintain a form of memory across time steps. Largely superseded by Transformers for many tasks. [Ch. 14]

Regression. A supervised learning task in which the model predicts a continuous numerical value (e.g., house price, revenue, temperature) rather than a discrete class label. [Ch. 9]

Regularization. Techniques that constrain or penalize a model's complexity to prevent overfitting. Common methods include L1 regularization (Lasso), L2 regularization (Ridge), dropout, and early stopping. [Ch. 12]

Reinforcement Learning (RL). A type of machine learning in which an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties, seeking to maximize cumulative reward over time. Used in robotics, game playing, and recommendation optimization. [Ch. 13]

Reinforcement Learning from Human Feedback (RLHF). A training technique used to align language models with human preferences, in which the model is fine-tuned using feedback from human evaluators as a reward signal. A key method behind the instruction-following capability of ChatGPT and similar systems. [Ch. 16]

Reproducibility. The ability to independently replicate the results of an ML experiment using the same data, code, and configuration. A cornerstone of scientific rigor and a practical requirement for production ML systems. [Ch. 25]

Residual connection (skip connection). A neural network design pattern in which the input to a layer is added directly to the layer's output, allowing gradients to flow more easily during training and enabling the construction of very deep networks. [Ch. 13]

Responsible AI. A framework and practice for developing AI systems that are fair, transparent, accountable, safe, and aligned with human values and societal well-being. Encompasses technical, organizational, and governance dimensions. [Ch. 34]

REST API. A software architectural style for building web services that allows systems to communicate over HTTP using standard methods (GET, POST, PUT, DELETE). The most common interface for serving ML model predictions. [Ch. 25]

RMSE (Root Mean Squared Error). A regression metric that measures the square root of the average squared difference between predicted and actual values. Penalizes large errors more heavily than MAE. [Ch. 10]

Robotic Process Automation (RPA). Software technology that automates repetitive, rule-based tasks typically performed by humans, such as data entry, form filling, and transaction processing. Increasingly augmented with AI for handling unstructured data and decision-making. [Ch. 19]

ROI (Return on Investment). A financial metric that measures the profitability of an investment relative to its cost. AI project ROI calculations should account for both direct revenue/cost impacts and indirect benefits such as improved decision speed and reduced risk. [Ch. 4]

SaaS (Software as a Service). A cloud delivery model in which software applications are hosted by a provider and accessed by users over the internet on a subscription basis. Many AI/ML tools are delivered as SaaS, including analytics platforms, annotation tools, and model APIs. [Ch. 26]

Sampling. The process of selecting a representative subset of data from a larger population for analysis or model training. Proper sampling techniques are critical for avoiding bias and ensuring model generalizability. [Ch. 8]

Scalability. The ability of a system, model, or process to handle growing amounts of work or data efficiently. In ML, scalability concerns span data processing, model training, and inference serving. [Ch. 26]

Scikit-learn. A widely used open-source Python library for traditional machine learning that provides implementations of classification, regression, clustering, dimensionality reduction, and preprocessing algorithms with a consistent API. [Ch. 9]

Self-supervised learning. A learning paradigm in which a model trains on unlabeled data by generating supervisory signals from the data itself (e.g., predicting masked words in text or reconstructing masked image patches). The pre-training approach behind most foundation models. [Ch. 16]

Semantic search. A search technique that understands the intent and contextual meaning of a query, returning results based on semantic similarity rather than keyword matching. Powered by text embeddings and vector databases. [Ch. 14]

Semi-supervised learning. A training approach that combines a small amount of labeled data with a large amount of unlabeled data, leveraging the structure of the unlabeled data to improve learning. [Ch. 11]

Sentiment analysis. An NLP task that determines the emotional tone or opinion expressed in text, typically classifying it as positive, negative, or neutral. Widely used for social media monitoring, product review analysis, and brand perception tracking. [Ch. 14]

Shadow deployment (shadow mode). A deployment strategy in which a new model runs in production alongside the existing model, receiving real traffic and generating predictions, but its outputs are not served to users. Allows comparison of the new model's behavior against the incumbent without risk. [Ch. 25]

SHAP (SHapley Additive exPlanations). An explainability framework based on Shapley values from cooperative game theory that assigns each feature a contribution to a specific prediction. Provides consistent, mathematically grounded local and global explanations. [Ch. 34]

Sigmoid function. An activation function that maps input values to the range (0, 1), producing outputs interpretable as probabilities. Used in the output layer of binary classifiers and in gating mechanisms within LSTMs. [Ch. 13]

Simulation. The use of computational models to replicate real-world processes or systems, enabling experimentation and scenario analysis without real-world risk. AI can both power and benefit from simulations (e.g., for training autonomous vehicles or testing policies). [Ch. 18]

SMOTE (Synthetic Minority Over-sampling Technique). A data augmentation method for imbalanced classification that generates synthetic examples for the minority class by interpolating between existing minority samples. [Ch. 10]

Snowflake. A cloud-based data warehousing platform that provides scalable data storage, processing, and analytics with a separation of storage and compute. Increasingly integrated with ML workflows through Snowpark and partner tools. [Ch. 7]

Softmax function. An activation function applied to the output layer of multi-class classification models that converts a vector of raw scores into a probability distribution over classes, where all probabilities sum to 1. [Ch. 13]

Spark (Apache Spark). An open-source distributed computing framework for large-scale data processing and analytics, supporting batch processing, streaming, SQL, and machine learning (via MLlib). [Ch. 7]

Speech recognition (automatic speech recognition, ASR). The technology that converts spoken language into text. Powered by deep learning models and used in virtual assistants, transcription services, and accessibility tools. [Ch. 14]

Speech synthesis (text-to-speech, TTS). The artificial production of human speech from text input. Modern neural TTS systems produce remarkably natural-sounding voice output. [Ch. 14]

SQL (Structured Query Language). A standard language for managing and querying relational databases. Essential for data extraction and manipulation in ML workflows. [Ch. 7]

Stacking (stacked generalization). An ensemble method in which predictions from multiple base models are used as input features for a meta-model that learns how to best combine them. Often produces superior results but adds complexity. [Ch. 11]

Stakeholder alignment. The process of ensuring that all parties affected by or involved in an AI project share a common understanding of goals, expectations, risks, and success criteria. A critical success factor in AI project management. [Ch. 22]

Stanford AI Index. An annual report from Stanford University's Human-Centered AI Institute that tracks, collates, and visualizes data related to AI progress, adoption, and societal impact. A valuable reference for AI strategy discussions. [Ch. 1]

Statistical significance. The degree to which an observed result is unlikely to have occurred by chance alone, typically assessed using p-values or confidence intervals. Important for validating A/B test results and model comparisons. [Ch. 3]

Stochastic Gradient Descent (SGD). A variant of gradient descent that updates model parameters using the gradient computed on a single training example (or mini-batch) rather than the entire dataset, introducing noise that can help escape local minima. [Ch. 12]

Stopwords. Common words (e.g., "the," "is," "and") that are often removed during text preprocessing because they carry little semantic meaning. The decision to remove stopwords depends on the task and model. [Ch. 14]

Stratified sampling. A sampling technique that ensures each class or subgroup in the data is represented proportionally in the sample. Commonly used to split datasets for training, validation, and testing when classes are imbalanced. [Ch. 10]

Structured data. Data organized in a predefined format with a consistent schema, typically stored in rows and columns in relational databases or spreadsheets. Contrasted with unstructured data (text, images, audio). [Ch. 7]

Summarization. An NLP task that condenses a longer text into a shorter version while preserving key information and meaning. Can be extractive (selecting key sentences) or abstractive (generating new summary text). [Ch. 14]

Supervised learning. A machine learning paradigm in which the model learns from labeled training data — input-output pairs — to predict outcomes for new, unseen inputs. The most widely used ML paradigm in business applications. [Ch. 9]

Supply chain optimization. The application of analytics, simulation, and ML to improve supply chain efficiency, including demand forecasting, inventory optimization, route planning, and supplier risk management. [Ch. 18]

Support Vector Machine (SVM). A supervised learning algorithm that finds the hyperplane that maximally separates classes in feature space, optionally using kernel functions to handle non-linear relationships. Effective for high-dimensional, small-to-medium datasets. [Ch. 9]

Synthetic data. Artificially generated data that mimics the statistical properties of real data. Used when real data is scarce, sensitive, or expensive to obtain. Generated through methods including GANs, simulation, and rule-based systems. [Ch. 8]

t-SNE (t-distributed Stochastic Neighbor Embedding). A dimensionality reduction technique primarily used for visualizing high-dimensional data in two or three dimensions. Preserves local structure, making clusters visible, but is non-deterministic and computationally expensive. [Ch. 11]

Tabular data. Data organized in rows (observations) and columns (features), the most common data format in business applications. Gradient boosting methods (XGBoost, LightGBM) remain the leading approaches for tabular data. [Ch. 9]

Target leakage. See Data leakage.

Target variable. The outcome or dependent variable that a supervised learning model is trained to predict. Also called the response variable, label, or dependent variable. [Ch. 9]

TCO (Total Cost of Ownership). A financial estimate that includes all direct and indirect costs of owning and operating an AI system over its lifecycle, including infrastructure, data acquisition, talent, maintenance, and retraining costs. [Ch. 4]

Technical debt (ML technical debt). The accumulated cost of shortcuts, workarounds, and suboptimal decisions in ML systems that make future development slower and more expensive. ML-specific technical debt includes entangled features, undeclared consumers, and pipeline jungles. [Ch. 25]

TensorFlow. An open-source deep learning framework developed by Google Brain that provides tools for building and deploying ML models at scale. Supports both research prototyping and production deployment through TensorFlow Serving and TensorFlow Lite. [Ch. 13]

Test set. A portion of the dataset held out and used only for the final, unbiased evaluation of a model's performance after all training and hyperparameter tuning is complete. Must never be used for model selection decisions. [Ch. 10]

Text classification. An NLP task that assigns predefined categories to a piece of text. Applications include spam filtering, topic categorization, intent detection, and content moderation. [Ch. 14]

Text mining. The process of deriving meaningful patterns and insights from unstructured text data using NLP, statistical, and ML techniques. [Ch. 14]

Time series. A sequence of data points recorded at successive, equally spaced time intervals. Time series analysis and forecasting are used in demand prediction, financial modeling, and IoT monitoring. [Ch. 18]

Time series forecasting. The use of historical sequential data to predict future values. Methods range from classical statistical approaches (ARIMA, exponential smoothing) to deep learning architectures (LSTMs, Temporal Fusion Transformers). [Ch. 18]

Token. A unit of text that an NLP model processes — may be a word, subword, character, or special symbol depending on the tokenization method. LLMs are typically priced per token for API usage. [Ch. 14]

Tokenization. The process of breaking text into individual tokens (words, subwords, or characters) that serve as the input units for an NLP model. Byte Pair Encoding (BPE) and SentencePiece are common subword tokenization methods used by modern LLMs. [Ch. 14]

Topic modeling. An unsupervised NLP technique that discovers abstract "topics" in a collection of documents. Latent Dirichlet Allocation (LDA) is the most well-known method; more recent approaches use embeddings and clustering. [Ch. 14]

TPU (Tensor Processing Unit). A custom-designed application-specific integrated circuit (ASIC) developed by Google specifically for accelerating machine learning workloads. Available through Google Cloud Platform. [Ch. 26]

Training data. The portion of a dataset used to fit a machine learning model by adjusting its parameters to minimize the loss function. The quality and representativeness of training data are the single most important determinant of model performance. [Ch. 8]

Training set. See Training data.

Transfer learning. A technique in which a model trained on one task is reused as the starting point for a model on a different but related task. Dramatically reduces the data and computation required for new tasks by leveraging pre-trained representations. [Ch. 13]

Transformer. A neural network architecture introduced in the 2017 paper "Attention Is All You Need" that relies entirely on self-attention mechanisms to process sequential data in parallel, eliminating the need for recurrence. The foundation of all modern large language models and many computer vision models. [Ch. 14]

Transparency. The principle that AI systems and their decision-making processes should be open, understandable, and accessible to scrutiny by stakeholders. A pillar of responsible AI and a requirement under many regulatory frameworks. [Ch. 34]

Turing test. A test proposed by Alan Turing in which a human evaluator converses with both a human and a machine; if the evaluator cannot reliably distinguish the machine from the human, the machine is said to have passed. More a philosophical benchmark than a practical AI metric. [Ch. 1]

UMAP (Uniform Manifold Approximation and Projection). A dimensionality reduction technique that preserves both local and global data structure, offering advantages over t-SNE in speed and scalability. Used for data visualization and as a preprocessing step for clustering. [Ch. 11]

Underfitting. A condition in which a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data. Indicates high bias. [Ch. 10]

Unlabeled data. Data for which the target variable or class label is not known. Used in unsupervised learning and self-supervised pre-training. [Ch. 11]

Unstructured data. Data that does not conform to a predefined schema or format, including text, images, audio, video, and social media posts. Estimated to constitute 80-90% of enterprise data; AI excels at extracting value from unstructured data. [Ch. 7]

Unsupervised learning. A machine learning paradigm in which the model discovers patterns, structures, or groupings in data without labeled examples. Includes clustering, dimensionality reduction, and anomaly detection. [Ch. 11]

Upskilling. The process of teaching employees new skills, particularly in data literacy, AI, and ML, to prepare the workforce for AI-augmented roles. A critical enabler of organizational AI adoption. [Ch. 23]

Use case. A specific business problem or opportunity that AI/ML can address, defined with clear objectives, data requirements, success criteria, and expected business impact. The starting point for any AI initiative. [Ch. 4]

Validation set. A subset of the training data used to tune hyperparameters and evaluate model performance during development, providing feedback for model selection decisions without contaminating the test set. [Ch. 10]

Value alignment. The challenge of ensuring that an AI system's objectives and behaviors are consistent with human values and intentions, particularly as systems become more capable and autonomous. [Ch. 36]

Vanishing gradient problem. A difficulty in training deep neural networks in which gradients become extremely small as they are propagated backward through many layers, effectively preventing early layers from learning. Addressed by architectures using residual connections, batch normalization, and ReLU activations. [Ch. 13]

Variational Autoencoder (VAE). A generative model that learns to encode data into a structured latent space and decode samples from that space back into data. Useful for generating new data, anomaly detection, and learning disentangled representations. [Ch. 16]

Vector database. A specialized database designed to store and efficiently retrieve high-dimensional vector embeddings, enabling similarity search at scale. Key infrastructure for RAG systems, recommendation engines, and semantic search. Examples include Pinecone, Weaviate, Milvus, and Chroma. [Ch. 30]

Vendor lock-in. The risk that an organization becomes dependent on a single cloud provider's or vendor's proprietary tools and APIs, making it difficult and costly to switch providers. A strategic consideration in AI infrastructure decisions. [Ch. 26]

Version control. The practice of tracking and managing changes to code, data, models, and configurations over time. Git is the standard for code; DVC and LakeFS extend version control to data and models. [Ch. 25]

Vertex AI. Google Cloud's managed ML platform that provides tools for building, deploying, and scaling machine learning models, including AutoML, custom training, model registry, and feature store. [Ch. 26]

Vision Transformer (ViT). A Transformer-based architecture applied to image classification by treating an image as a sequence of fixed-size patches. Demonstrates that Transformer architectures can compete with or surpass CNNs on vision tasks when trained on sufficient data. [Ch. 15]

Warm start. A technique in which a model training process is initialized with parameters from a previously trained model rather than random values, accelerating convergence. Related to but distinct from transfer learning. [Ch. 12]

Waterfall methodology. A sequential project management approach in which each phase must be completed before the next begins. Generally less suitable for AI/ML projects than agile or iterative approaches due to the experimental nature of ML development. [Ch. 22]

Weight. A learnable parameter in a neural network that determines the strength of the connection between neurons. During training, weights are adjusted through backpropagation to minimize the loss function. [Ch. 13]

Weights & Biases (W&B). A commercial platform for experiment tracking, model visualization, dataset versioning, and collaborative ML development. [Ch. 25]

White-box model. A model whose internal decision-making process is transparent and interpretable to humans. Examples include linear regression, logistic regression, and shallow decision trees. Contrasted with black-box models. [Ch. 34]

Word embedding. See Embedding.

Word2Vec. A neural network-based technique developed at Google that learns word embeddings by training on word co-occurrence patterns in large text corpora. Introduced the concepts of continuous bag-of-words (CBOW) and skip-gram models. Foundational to modern NLP. [Ch. 14]

Workflow automation. The use of technology to automate sequences of tasks in a business process, reducing manual intervention and improving efficiency. AI enhances workflow automation by handling unstructured data, making decisions, and learning from outcomes. [Ch. 19]

XAI (Explainable AI). The field of AI research and practice focused on developing methods and tools that make AI systems' decisions understandable to humans. Encompasses techniques such as SHAP, LIME, counterfactual explanations, and attention visualization. [Ch. 34]

XGBoost (eXtreme Gradient Boosting). A highly optimized gradient boosting library that provides fast, scalable, and accurate implementations of gradient boosted decision trees. Consistently among the top-performing algorithms for structured/tabular data in competitions and industry applications. [Ch. 11]

YOLO (You Only Look Once). A family of real-time object detection models that frame detection as a single regression problem, predicting bounding boxes and class probabilities directly from full images in one evaluation. Known for speed and deployment efficiency. [Ch. 15]

Zero-shot learning. The ability of a model to perform a task it was not explicitly trained on, by leveraging generalized knowledge learned during pre-training. In the context of LLMs, zero-shot means providing only a task description in the prompt without any examples. [Ch. 16]

Zero-shot classification. A text classification approach in which a model classifies text into categories it has never seen during training, using natural language descriptions of the categories as guidance. Enabled by large pre-trained language models. [Ch. 14]

This glossary covers over 300 terms referenced throughout the textbook. For a term's full treatment, consult the chapter(s) indicated in brackets. Terms that appear extensively across multiple chapters are cross-referenced to the chapter that provides the most comprehensive discussion.