Glossary

How to use this glossary: Terms are listed alphabetically. Each entry includes the chapter where the term is primarily introduced. Cross-references to related terms are marked with see also. If a term appears in multiple chapters, only the chapter of first substantive introduction is listed.

Accuracy (Ch. 29) : The proportion of correct predictions out of all predictions made by a classification model. Accuracy can be misleading when classes are imbalanced. See also precision, recall, F1 score.

Aesthetic mapping (Ch. 14) : In the grammar of graphics, the assignment of data variables to visual properties such as position, color, size, or shape. For example, mapping a "region" column to the color of points on a scatter plot.

Aggregation (Ch. 7) : The process of combining multiple values into a single summary value, such as computing a mean, sum, count, or maximum. In pandas, aggregation is typically performed with .agg(), .mean(), .sum(), etc.

Algorithm (Ch. 25) : A step-by-step procedure for solving a problem or performing a computation. In machine learning, an algorithm is the method used to learn patterns from data (e.g., linear regression, decision tree).

API (Application Programming Interface) (Ch. 13) : A set of rules that allows one software system to communicate with another. In data science, APIs are commonly used to retrieve data from web services programmatically using HTTP requests.

Assignment (Ch. 3) : The act of giving a variable a value using the = operator. For example, x = 42 assigns the value 42 to the variable x. See also comparison operator.

Bar chart (Ch. 15) : A visualization that uses rectangular bars to represent values, with the length or height of each bar proportional to the value it represents. Used for comparing quantities across categories.

Bias (in models) (Ch. 25) : Systematic error introduced when a model makes simplifying assumptions that cause it to consistently miss certain patterns in the data. High bias leads to underfitting. See also variance, bias-variance tradeoff.

Bias (in sampling) (Ch. 22) : A systematic tendency for a sample to differ from the population it is supposed to represent. Common forms include selection bias, response bias, and survivorship bias.

Bias-variance tradeoff (Ch. 25) : The fundamental tension in modeling: simple models have high bias (they miss patterns) but low variance (they are stable); complex models have low bias but high variance (they are sensitive to the specific training data). The goal is to find a balance that minimizes total prediction error.

Boolean (Ch. 3) : A data type with exactly two possible values: True or False. Named after mathematician George Boole. Used for logical conditions, filtering, and control flow.

Boolean indexing (Ch. 7) : A method of selecting rows from a DataFrame by applying a condition that produces a Series of True/False values (a boolean mask). For example, df[df["age"] > 30] selects rows where age exceeds 30.

Box plot (Ch. 16) : A visualization showing the distribution of a numerical variable through its five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum, with outliers plotted as individual points.

Categorical variable (Ch. 7) : A variable that takes on a limited number of distinct categories or groups, such as "male" / "female" or "low" / "medium" / "high." See also numerical variable, ordinal variable.

Causal question (Ch. 1) : A question that asks whether one thing directly produces or prevents another. For example, "Does the vaccine reduce infection rates?" Answering causal questions requires experimental or quasi-experimental designs. See also descriptive question, predictive question.

Central limit theorem (Ch. 21) : A foundational result in statistics stating that the sampling distribution of the mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution. This theorem justifies many statistical inference procedures.

Classification (Ch. 27) : A supervised learning task where the goal is to predict which category (class) an observation belongs to. For example, predicting whether an email is spam or not spam. See also regression.

Cleaning (Ch. 8) : The process of detecting and correcting errors, inconsistencies, and missing values in a dataset to make it suitable for analysis. Common cleaning tasks include removing duplicates, fixing data types, handling missing values, and standardizing formats.

Coefficient (Ch. 26) : In a regression model, the numerical value that represents the strength and direction of the relationship between a predictor variable and the outcome variable. A coefficient of 2.5 means the outcome increases by 2.5 units for each one-unit increase in the predictor.

Comparison operator (Ch. 3) : An operator that compares two values and returns a boolean. Python comparison operators include ==, !=, <, >, <=, and >=.

Concatenation (Ch. 9) : Combining two or more DataFrames by stacking them vertically (adding rows) or horizontally (adding columns). In pandas, performed with pd.concat().

Conditional probability (Ch. 20) : The probability of an event occurring given that another event has already occurred, written as $P(A|B)$. See also probability.

Confidence interval (Ch. 22) : A range of values, computed from sample data, that is likely to contain the true population parameter. A 95% confidence interval means that if we repeated the sampling process many times, approximately 95% of the resulting intervals would contain the true parameter.

Confounding variable (Ch. 24) : A variable that influences both the predictor and the outcome, creating a spurious association between them. For example, ice cream sales and drowning rates are correlated because both are influenced by hot weather (the confounder).

Coordinate system (Ch. 14) : In the grammar of graphics, the system that defines how data positions are mapped to the plane of a chart. The most common is the Cartesian coordinate system (x-y axes). Polar coordinates are used for pie charts and radar charts.

Correlation (Ch. 24) : A statistical measure of the strength and direction of the linear relationship between two variables. Pearson's correlation coefficient $r$ ranges from $-1$ (perfect negative) to $+1$ (perfect positive), with 0 indicating no linear relationship. See also causation, confounding variable.

Cross-validation (Ch. 30) : A model evaluation technique that splits the data into multiple folds, training on some folds and testing on others, then averaging the results. k-fold cross-validation uses $k$ folds. Provides a more reliable estimate of model performance than a single train-test split.

CSV (Comma-Separated Values) (Ch. 6) : A plain-text file format where each row is a line of text and columns are separated by commas. The most common format for exchanging tabular data. Read in Python with pd.read_csv().

Data dictionary (Ch. 7) : A document describing every column (variable) in a dataset: its name, data type, description, valid values, and source. Essential for reproducibility and collaboration.

Data engineering (Ch. 1) : The discipline focused on building and maintaining the infrastructure (pipelines, databases, storage systems) that makes data available for analysis. Distinct from but complementary to data science.

DataFrame (Ch. 7) : The primary two-dimensional data structure in pandas, representing a table with labeled rows and columns. Each column is a Series. Created with pd.DataFrame().

Data science lifecycle (Ch. 1) : The iterative process of data science work, consisting of six stages: question formulation, data collection, data cleaning, exploratory analysis, modeling, and communication.

Data type (dtype) (Ch. 3) : The classification of a value that determines what operations can be performed on it. In Python: int, float, str, bool. In pandas: int64, float64, object, datetime64, category, etc.

Decision tree (Ch. 28) : A supervised learning model that makes predictions by recursively splitting data based on feature values, creating a tree-like structure of if-then rules. Interpretable but prone to overfitting.

Dependent variable (Ch. 25) : The variable a model is trying to predict or explain. Also called the outcome variable, response variable, or target variable. See also independent variable.

Descriptive question (Ch. 1) : A question that asks about the current or past state of the world. For example, "How many customers did we have last quarter?" See also predictive question, causal question.

Descriptive statistics (Ch. 19) : Numerical summaries that characterize a dataset, including measures of center (mean, median, mode), spread (variance, standard deviation, IQR), and shape (skewness, kurtosis).

Dictionary (Ch. 5) : A Python data structure that stores key-value pairs, allowing fast lookup by key. Created with curly braces: {"name": "Elena", "age": 28}.

Distribution (Ch. 21) : The pattern of values that a variable takes, including which values are common and which are rare. Described by its shape (symmetric, skewed), center, and spread. Common distributions include normal, binomial, and Poisson.

Domain knowledge (Ch. 1) : Expertise in the subject area being analyzed (e.g., medicine, finance, sports). Essential for formulating good questions, choosing appropriate methods, and interpreting results correctly.

Encoding (Ch. 12) : A system for representing text as bytes. Common encodings include UTF-8, ASCII, and Latin-1. Encoding mismatches cause garbled text (mojibake). In pandas, specify with pd.read_csv("file.csv", encoding="utf-8").

Ethical data science (Ch. 32) : The practice of considering the societal impacts of data science work, including fairness, privacy, transparency, consent, and potential for harm. Ethical considerations should inform every stage of the data science lifecycle.

Exploratory data analysis (EDA) (Ch. 6) : The process of summarizing, visualizing, and investigating a dataset to understand its structure, identify patterns, and generate hypotheses. Typically performed before formal modeling.

F1 score (Ch. 29) : The harmonic mean of precision and recall, providing a single metric that balances both. Ranges from 0 to 1. Useful when classes are imbalanced and accuracy alone is misleading. See also precision, recall.

Faceting (Ch. 14) : In the grammar of graphics, the technique of creating multiple small charts (panels), each showing a subset of the data based on a categorical variable. Also called small multiples.

False negative (Ch. 29) : An observation that is actually positive but was incorrectly classified as negative by a model. For example, a sick patient classified as healthy. See also false positive.

False positive (Ch. 29) : An observation that is actually negative but was incorrectly classified as positive by a model. For example, a legitimate email classified as spam. See also false negative.

Feature (Ch. 25) : An input variable used by a model to make predictions. Also called a predictor, independent variable, or covariate. See also feature engineering.

Feature engineering (Ch. 30) : The process of creating new features from existing data to improve model performance. For example, extracting day-of-week from a date column, or computing the ratio of two variables.

f-string (Ch. 3) : A Python string literal prefixed with f that allows embedding expressions inside curly braces. For example, f"Hello, {name}!" inserts the value of the variable name.

Function (Ch. 4) : A reusable block of code that performs a specific task. Defined with the def keyword in Python. Functions take parameters (inputs) and can return values (outputs).

Geometric object (geom) (Ch. 14) : In the grammar of graphics, the visual element used to represent data, such as points, lines, bars, or areas.

Git (Ch. 33) : A version control system that tracks changes to files over time. Allows multiple people to collaborate on code and provides a history of all changes.

GitHub (Ch. 33) : A cloud-based platform for hosting git repositories. Widely used for sharing code, collaborating on projects, and building a professional portfolio.

Groupby (Ch. 9) : An operation that splits a DataFrame into groups based on one or more columns, applies a function to each group, and combines the results. Implements the split-apply-combine pattern. In pandas: df.groupby("column").agg(func).

Histogram (Ch. 15) : A visualization showing the distribution of a numerical variable by dividing the range into bins and displaying the count (or proportion) of observations in each bin.

Hypothesis testing (Ch. 23) : A formal statistical procedure for deciding whether observed data provide sufficient evidence to reject a claim (the null hypothesis). Key concepts include p-values, significance levels, and test statistics.

Immutable (Ch. 3) : An object whose value cannot be changed after creation. Strings, tuples, and numbers are immutable in Python. See also mutable.

Imputation (Ch. 8) : The process of replacing missing values with estimated values, such as the column mean, median, or a value predicted by a model.

Independent variable (Ch. 25) : A variable used to predict or explain the dependent variable. Also called a predictor, feature, or explanatory variable.

Index (Ch. 7) : In pandas, the labeled axis of a DataFrame or Series. Provides a way to identify and align rows. Can be integer-based (default) or label-based (e.g., dates, country names).

Inner join (Ch. 9) : A merge operation that returns only rows with matching keys in both tables. Rows without a match in the other table are discarded. See also left join, outer join.

Interquartile range (IQR) (Ch. 19) : The difference between the 75th percentile (Q3) and the 25th percentile (Q1) of a distribution. A robust measure of spread that is not affected by outliers. $\text{IQR} = Q3 - Q1$.

Iteration (Ch. 4) : The process of repeatedly executing a block of code, typically using for or while loops. In pandas, vectorized operations are preferred over explicit iteration for performance.

JSON (JavaScript Object Notation) (Ch. 12) : A lightweight data interchange format that uses key-value pairs and arrays. Common in web APIs. Read in Python with json.load() or pd.read_json().

Jupyter notebook (Ch. 2) : An interactive computing environment that combines code cells, text cells (Markdown), and outputs (including visualizations) in a single document with the .ipynb extension.

Kernel (Ch. 2) : In Jupyter, the computational engine that executes code. A Python kernel runs Python code. Restarting the kernel clears all variables from memory.

Left join (Ch. 9) : A merge operation that returns all rows from the left table and matching rows from the right table. Unmatched rows from the left table appear with NaN values for the right table's columns. See also inner join, outer join.

Linear regression (Ch. 26) : A supervised learning model that predicts a continuous outcome variable as a linear combination of predictor variables. The model finds the line (or hyperplane) that minimizes the sum of squared residuals.

List (Ch. 5) : A mutable, ordered sequence of elements in Python, created with square brackets: [1, 2, 3]. Elements can be of any type and can be added, removed, or modified.

List comprehension (Ch. 4) : A concise Python syntax for creating lists by applying an expression to each item in an iterable, optionally with a filter condition. Example: [x**2 for x in range(10) if x % 2 == 0].

Logistic regression (Ch. 27) : A supervised learning model for binary classification that predicts the probability of an observation belonging to one of two classes. Despite the name, it is a classification method, not a regression method.

Long format (Ch. 9) : A data layout where each row represents a single observation and a variable column identifies what is being measured. Also called tidy format or tall format. See also wide format.

Markdown (Ch. 2) : A lightweight text formatting language used in Jupyter text cells, README files, and documentation. Supports headings, bold, italics, links, images, and code blocks.

matplotlib (Ch. 15) : The foundational Python library for creating static, animated, and interactive visualizations. Provides low-level control over every aspect of a figure.

Mean (Ch. 19) : The arithmetic average of a set of values: the sum of all values divided by the count. Sensitive to outliers. See also median, mode.

Median (Ch. 19) : The middle value when observations are sorted from smallest to largest. If there is an even number of observations, the median is the average of the two middle values. Resistant to outliers. See also mean, mode.

Melt (Ch. 9) : A pandas operation that converts a DataFrame from wide format to long format by "unpivoting" columns into rows. Performed with pd.melt() or df.melt(). See also pivot.

Merge (Ch. 9) : The operation of combining two DataFrames based on a shared column (key). In pandas: pd.merge(left, right, on="key"). Equivalent to a SQL JOIN. See also inner join, left join, outer join.

Missing value (NaN) (Ch. 7) : A placeholder indicating that a data value is absent. In pandas, missing values are represented as NaN (Not a Number) for numeric data and NaT for datetime data. Handled with .isnull(), .dropna(), and .fillna().

Mode (Ch. 19) : The most frequently occurring value in a dataset. A distribution can be unimodal (one peak), bimodal (two peaks), or multimodal. See also mean, median.

Model (Ch. 25) : A simplified representation of a real-world process that captures the essential patterns and relationships in data. Models can be used for prediction, explanation, or both.

Multicollinearity (Ch. 26) : A condition in which two or more predictor variables in a regression model are highly correlated with each other, making it difficult to isolate the individual effect of each predictor.

Mutable (Ch. 3) : An object whose value can be changed after creation. Lists, dictionaries, and sets are mutable in Python. See also immutable.

NaN (Not a Number) (Ch. 7) : See missing value.

Normal distribution (Ch. 21) : A symmetric, bell-shaped probability distribution defined by its mean ($\mu$) and standard deviation ($\sigma$). Many natural phenomena approximately follow a normal distribution. Also called a Gaussian distribution.

Null hypothesis (Ch. 23) : In hypothesis testing, the default assumption that there is no effect, no difference, or no relationship. Denoted $H_0$. The goal of a hypothesis test is to determine whether the data provide sufficient evidence to reject the null hypothesis.

NumPy (Ch. 7) : A Python library for numerical computing, providing efficient array operations, mathematical functions, and random number generation. The foundation on which pandas is built.

Numerical variable (Ch. 7) : A variable that represents a measurable quantity, taking on numerical values that support arithmetic operations. Can be continuous (any value in a range) or discrete (countable values). See also categorical variable.

Observation (Ch. 7) : A single unit of data, represented as a row in a DataFrame. Depending on context, an observation might be a person, a country, a transaction, or a measurement at a specific time.

Ordinal variable (Ch. 19) : A categorical variable whose categories have a natural ordering (e.g., "low," "medium," "high"), but the distances between categories are not necessarily equal.

Outer join (Ch. 9) : A merge operation that returns all rows from both tables, filling in NaN where a key exists in one table but not the other. See also inner join, left join.

Outlier (Ch. 19) : A data point that is significantly different from other observations. Outliers can result from measurement errors, data entry mistakes, or genuine extreme values. They can strongly influence the mean and regression results.

Overfitting (Ch. 25) : When a model learns the noise in the training data rather than the underlying pattern, resulting in excellent performance on training data but poor performance on new data. See also underfitting, bias-variance tradeoff.

p-value (Ch. 23) : The probability of observing data as extreme as (or more extreme than) what was actually observed, assuming the null hypothesis is true. A small p-value (typically < 0.05) suggests the observed result is unlikely under the null hypothesis.

pandas (Ch. 7) : The primary Python library for data manipulation and analysis, providing DataFrame and Series data structures. Named after "panel data," a term from econometrics.

Parameter (Ch. 22) : A numerical characteristic of a population, such as the population mean or population proportion. Parameters are typically unknown and estimated from sample statistics.

Percentile (Ch. 19) : The value below which a given percentage of observations fall. The 75th percentile means 75% of observations are below that value. The 50th percentile is the median.

Pivot (Ch. 9) : A pandas operation that converts a DataFrame from long format to wide format by spreading unique values of one column into multiple columns. Performed with df.pivot() or df.pivot_table(). See also melt.

Plotly (Ch. 17) : A Python library for creating interactive visualizations, including charts, maps, and dashboards. Charts can be zoomed, panned, and hovered for detail.

Population (Ch. 22) : The complete set of individuals or objects about which you want to draw conclusions. Often too large to observe entirely, requiring a sample. See also sample.

Precision (Ch. 29) : Of all observations a model predicted as positive, the proportion that are actually positive. High precision means few false positives. See also recall, F1 score.

Predictive question (Ch. 1) : A question that asks about a future or unknown outcome. For example, "Which customers are likely to churn next month?" See also descriptive question, causal question.

Probability (Ch. 20) : A number between 0 and 1 that measures the likelihood of an event occurring. A probability of 0 means impossible; a probability of 1 means certain.

Proxy variable (Ch. 24) : A variable used as a stand-in for a variable that cannot be directly measured. Using a poor proxy can introduce bias. For example, using ZIP code as a proxy for income.

Q-Q plot (Quantile-Quantile plot) (Ch. 21) : A graphical tool for comparing a data distribution to a theoretical distribution (typically normal). If the data follow the theoretical distribution, the points fall along a diagonal line.

Random forest (Ch. 28) : An ensemble learning method that builds many decision trees on random subsets of the data and features, then averages their predictions. Reduces overfitting compared to a single decision tree.

Recall (Ch. 29) : Of all observations that are actually positive, the proportion that the model correctly identifies. High recall means few false negatives. Also called sensitivity or true positive rate. See also precision, F1 score.

Regression (Ch. 26) : A supervised learning task where the goal is to predict a continuous numerical outcome. See also classification.

Regular expression (regex) (Ch. 10) : A pattern-matching language for searching, matching, and manipulating text. Used in pandas with .str.contains(), .str.extract(), and .str.replace().

Resampling (Ch. 11) : In time-series analysis, the process of changing the frequency of data (e.g., from daily to monthly) by aggregating values. In pandas: df.resample("M").mean().

Residual (Ch. 26) : The difference between an observed value and the value predicted by a model. Residual = observed - predicted. Analyzing residuals helps assess model quality.

Robust statistic (Ch. 19) : A summary statistic that is not heavily influenced by outliers or extreme values. The median and IQR are robust; the mean and standard deviation are not.

Rolling window (Ch. 11) : A fixed-size window that moves across time-series data, computing a summary statistic (e.g., mean, sum) at each position. Used for smoothing data and identifying trends.

R-squared ($R^2$) (Ch. 26) : A measure of how well a regression model fits the data, representing the proportion of variance in the outcome variable that is explained by the predictors. Ranges from 0 to 1, with higher values indicating better fit.

Sample (Ch. 22) : A subset of a population, selected for observation and analysis. The goal is for the sample to be representative of the population. See also population, sampling bias.

Sampling distribution (Ch. 22) : The probability distribution of a statistic (such as the sample mean) computed from repeated random samples from the same population. The spread of the sampling distribution decreases as sample size increases.

Scatter plot (Ch. 15) : A visualization that uses individual points to display the relationship between two numerical variables, with one on the x-axis and the other on the y-axis.

scikit-learn (sklearn) (Ch. 25) : The primary Python library for machine learning, providing implementations of classification, regression, clustering, and model evaluation algorithms.

seaborn (Ch. 16) : A Python visualization library built on top of matplotlib that provides a high-level interface for creating statistical graphics. Integrates closely with pandas DataFrames.

Selection bias (Ch. 22) : A bias that occurs when the sample is not representative of the population because of the way participants were selected. See also bias (in sampling).

Series (Ch. 7) : A one-dimensional labeled array in pandas, representing a single column of data. The building block of a DataFrame.

Significance level ($\alpha$) (Ch. 23) : The threshold probability below which the p-value leads to rejection of the null hypothesis. Commonly set at 0.05 (5%). See also p-value.

Skewness (Ch. 19) : A measure of the asymmetry of a distribution. Right-skewed distributions have a long tail to the right (mean > median); left-skewed distributions have a long tail to the left (mean < median).

Slice (Ch. 3) : A syntax for extracting a portion of a sequence using start:stop:step. For example, "Hello"[1:4] returns "ell".

Split-apply-combine (Ch. 9) : A data analysis pattern where data is split into groups, a function is applied to each group, and the results are combined into a single output. Implemented in pandas with .groupby().

SQL (Structured Query Language) (Ch. 12) : A language for managing and querying relational databases. While not covered in depth in this book, SQL is a critical skill for data science careers.

Standard deviation (Ch. 19) : A measure of the spread of a distribution, computed as the square root of the variance. Expressed in the same units as the original data. See also variance.

Standard error (Ch. 22) : The standard deviation of a sampling distribution. Measures the variability of a sample statistic from sample to sample. For the sample mean: $SE = s / \sqrt{n}$.

Statistic (Ch. 22) : A numerical summary computed from sample data, used to estimate a population parameter. For example, the sample mean $\bar{x}$ estimates the population mean $\mu$.

Structured data (Ch. 1) : Data organized into a predefined format with rows and columns, such as a spreadsheet or database table. See also unstructured data.

Supervised learning (Ch. 25) : A machine learning approach where the model is trained on data that includes both input features and known output labels (the "answers"). See also unsupervised learning.

Test set (Ch. 25) : A portion of the data held out from model training, used exclusively to evaluate model performance on unseen data. See also training set, cross-validation.

Tidy data (Ch. 9) : A data organization principle where: (1) each variable has its own column, (2) each observation has its own row, and (3) each value has its own cell. Equivalent to long format.

Time series (Ch. 11) : A sequence of data points ordered by time. Examples include daily stock prices, monthly sales figures, and hourly temperature readings.

Train-test split (Ch. 25) : The practice of dividing a dataset into a training set (used to build the model) and a test set (used to evaluate it). Prevents evaluation on the same data used for training, which would give misleadingly good results.

Training set (Ch. 25) : The portion of the data used to train (fit) a model. The model learns patterns from the training set and is then evaluated on the test set. See also test set.

Tuple (Ch. 5) : An immutable, ordered sequence of elements in Python, created with parentheses: (1, 2, 3). Used for fixed collections and as dictionary keys.

Type I error (Ch. 23) : Rejecting the null hypothesis when it is actually true (a "false positive" in hypothesis testing). The probability of a Type I error is equal to the significance level $\alpha$. See also Type II error.

Type II error (Ch. 23) : Failing to reject the null hypothesis when it is actually false (a "false negative" in hypothesis testing). The probability depends on the sample size and the true effect size. See also Type I error.

Underfitting (Ch. 25) : When a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data. See also overfitting.

Unstructured data (Ch. 1) : Data that does not follow a predefined format, such as text documents, images, audio, and video. Requires specialized processing before it can be analyzed quantitatively. See also structured data.

Unsupervised learning (Ch. 25) : A machine learning approach where the model is trained on data without known output labels, seeking to discover hidden patterns or structures (e.g., clustering, dimensionality reduction). See also supervised learning.

Variable (programming) (Ch. 3) : A name that refers to a value stored in memory. In Python, variables are labels pointing to objects, not boxes containing values.

Variable (statistical) (Ch. 1) : A characteristic or attribute that can vary among observations in a dataset. Variables can be numerical (quantitative) or categorical (qualitative).

Variance (Ch. 19) : A measure of the spread of a distribution, computed as the average of the squared deviations from the mean. The square root of the variance is the standard deviation.

Variance (in models) (Ch. 25) : The variability in a model's predictions when trained on different datasets. High variance means the model is sensitive to the specific training data and may overfit. See also bias, bias-variance tradeoff.

Vectorized operation (Ch. 7) : An operation applied to an entire array or column at once, without explicit looping. Vectorized operations in pandas and NumPy are much faster than equivalent Python loops.

Visualization (Ch. 14) : The representation of data in graphical form to reveal patterns, trends, outliers, and relationships. Visualization is both an exploratory tool and a communication tool.

Web scraping (Ch. 13) : The automated extraction of data from web pages. Requires parsing HTML and should be done ethically, respecting terms of service and robots.txt directives.

Wide format (Ch. 9) : A data layout where each variable has its own column, often with repeated measurements spread across columns (e.g., Jan_Sales, Feb_Sales, Mar_Sales). See also long format, tidy data.

z-score (Ch. 21) : The number of standard deviations a data point is from the mean: $z = (x - \mu) / \sigma$. A z-score of 2 means the value is 2 standard deviations above the mean. Used for standardization and for computing probabilities under the normal distribution.

Total terms: approximately 200. For additional Python-specific terminology, see Appendix B. For mathematical notation, see Appendix A.