Glossary
All key terms from Chapters 1-28, alphabetized. The format is: Term (Ch.N, section N.X): Definition. Cross-references to related terms are listed with "See also."
A/B testing (Ch.4, section 4.7): A randomized experiment in which users or participants are randomly assigned to one of two versions (A or B) to determine which performs better on a specified metric. Common in technology and marketing. See also: experiment, randomization, two-sample t-test.
accuracy (Ch.24, section 24.8): The proportion of all predictions that are correct: (TP + TN) / (TP + TN + FP + FN). Can be misleading when classes are imbalanced. See also: accuracy paradox, confusion matrix, sensitivity, specificity.
accuracy paradox (Ch.24, section 24.8): The phenomenon where a model that always predicts the majority class achieves high accuracy despite being useless. Occurs when one class is much more common than the other. See also: accuracy, confusion matrix, F1 score.
addition rule (Ch.8, section 8.5): P(A or B) = P(A) + P(B) - P(A and B). For mutually exclusive events, simplifies to P(A or B) = P(A) + P(B). See also: mutually exclusive, probability.
adjusted R-squared (Ch.23, section 23.5): A modified version of R-squared that penalizes for the number of predictors in a multiple regression model. Unlike R-squared, adjusted R-squared can decrease when an unhelpful predictor is added. See also: R-squared, multiple regression.
algorithmic bias (Ch.26, section 26.5): Systematic unfairness in the outputs of an algorithm, typically arising from biased training data, biased labels, proxy variables, or biased feature selection. See also: proxy variable, training data, fairness impossibility theorem.
alternative hypothesis (Ch.13, section 13.3): The claim the researcher is trying to find evidence for. Denoted Ha or H1. Contains an inequality: not-equal-to, less-than, or greater-than. See also: null hypothesis, hypothesis testing, p-value.
Anaconda (Ch.3, section 3.11): A popular distribution of Python that includes Jupyter, pandas, numpy, scipy, and hundreds of other packages for data science. See Appendix C for installation instructions.
ANOVA (Ch.20, section 20.2): Analysis of Variance. A statistical method for comparing the means of three or more groups simultaneously by decomposing total variability into between-group and within-group components. See also: F-statistic, one-way ANOVA, eta-squared, Tukey's HSD.
ANOVA table (Ch.20, section 20.7): A structured summary of ANOVA results showing Source (Between, Within, Total), Sum of Squares, degrees of freedom, Mean Square, F-statistic, and p-value.
Anscombe's Quartet (Ch.22, section 22.1): Four datasets with nearly identical summary statistics but very different visual patterns, demonstrating that you must always visualize data before analyzing it. See also: scatterplot, correlation coefficient.
ASA statement on p-values (Ch.13, section 13.6): The 2016 statement by the American Statistical Association providing six principles for the proper use and interpretation of p-values. Emphasized that p-values do not measure the probability of a hypothesis being true.
AUC (Ch.24, section 24.9): Area Under the ROC Curve. Measures a classifier's ability to discriminate between classes across all possible thresholds. AUC = 1.0 is perfect; AUC = 0.5 is no better than random guessing. See also: ROC curve, logistic regression.
bar chart (Ch.5, section 5.2): A graph that displays the frequency or relative frequency of each category using rectangular bars with heights proportional to counts. Bars do not touch (unlike histograms). Used for categorical variables. See also: histogram, pie chart.
base rate fallacy (Ch.9, section 9.14): The error of ignoring the prior probability (prevalence or base rate) when interpreting conditional probabilities. Common in medical testing and forensic contexts. See also: Bayes' theorem, prior probability, positive predictive value.
Bayes' theorem (Ch.9, section 9.6): P(A|B) = P(B|A) * P(A) / P(B). A formula for updating the probability of an event based on new evidence. Connects prior probability to posterior probability. See also: conditional probability, prior probability, posterior probability.
Belmont Report (Ch.27, section 27.6): A 1979 U.S. government document establishing three core principles for ethical research involving human subjects: Respect for Persons, Beneficence, and Justice. The foundation of modern IRB requirements.
between-group variability (Ch.20, section 20.3): In ANOVA, the variation in group means around the grand mean (SS_Between). Measures the "signal" — how much the groups differ from each other. See also: within-group variability, F-statistic.
bias (Ch.4, section 4.3): A systematic tendency to produce results that are wrong in a particular direction. Can arise from sampling method, measurement, or analysis decisions. See also: selection bias, response bias, nonresponse bias, confounding variable.
bias-variance tradeoff (Ch.26, section 26.4): The principle that simple models have high bias but low variance (consistently wrong), while complex models have low bias but high variance (inconsistently right). Finding the right balance minimizes total prediction error. See also: overfitting.
big data (Ch.26, section 26.6): Datasets with millions or billions of observations. Size alone does not guarantee quality or representativeness; with enough data, even trivially small effects become statistically significant.
bimodal (Ch.5, section 5.7): A distribution with two distinct peaks (modes). May indicate the data comes from two different populations. See also: unimodal, distribution shape.
binary outcome (Ch.24, section 24.1): A response variable that takes only two values (yes/no, 0/1, success/failure). Logistic regression is designed for binary outcomes. See also: logistic regression.
binomial distribution (Ch.10, section 10.3): The distribution of the count of successes in n independent trials, each with the same probability p of success. Requires BINS conditions: Binary outcomes, Independent trials, Number of trials fixed, Same probability each trial. See also: normal distribution, probability distribution.
binning (Ch.7, section 7.7): Converting a continuous numerical variable into discrete categories (e.g., ages into age groups). See also: recoding, feature engineering.
blinding (Ch.4, section 4.7): Keeping participants (single-blind) or both participants and researchers (double-blind) unaware of group assignments in an experiment, to prevent bias in measurement or behavior. See also: double-blind, placebo.
Bonferroni correction (Ch.20, section 20.8): A method for controlling the family-wise error rate when performing multiple comparisons: divide alpha by the number of comparisons. Simple but conservative. See also: Tukey's HSD, multiple comparisons problem.
bootstrap (Ch.18, section 18.2): A resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data. Invented by Bradley Efron in 1979. See also: bootstrap confidence interval, resampling, with replacement.
bootstrap confidence interval (Ch.18, section 18.4): A confidence interval constructed from the percentiles of the bootstrap distribution. For a 95% CI, take the 2.5th and 97.5th percentiles of the bootstrap distribution. See also: bootstrap, confidence interval.
bootstrap distribution (Ch.18, section 18.3): The distribution of a statistic computed across many bootstrap resamples. It approximates the sampling distribution.
box plot (Ch.6, section 6.6): A graph displaying the five-number summary: minimum, Q1, median, Q3, and maximum, with a box from Q1 to Q3 and whiskers extending to the most extreme non-outlier values. Outliers plotted as individual points. Also called box-and-whisker plot. See also: five-number summary, IQR, outlier.
BRFSS (Ch.3, section 3.5): CDC Behavioral Risk Factor Surveillance System. The largest continuously conducted health survey in the world. See Appendix D.
care ethics (Ch.27, section 27.8): An ethical framework that centers the most vulnerable populations and prioritizes maintaining trust and relationships. See also: utilitarian ethics, rights-based ethics.
categorical variable (Ch.2, section 2.2): A variable whose values represent categories or groups rather than quantities. Can be nominal (no natural order) or ordinal (ordered categories). See also: nominal, ordinal, numerical variable.
CCPA (Ch.27, section 27.6): California Consumer Privacy Act (2020). A data privacy regulation based on an opt-out model, with penalties up to $7,500 per violation. See also: GDPR, data privacy.
cell (Ch.3, section 3.2): A block in a Jupyter notebook that contains either code or text (Markdown).
Central Limit Theorem (Ch.11, section 11.4): The sampling distribution of the sample mean is approximately normal for sufficiently large n, regardless of the shape of the population distribution. The mean of the sampling distribution equals the population mean, and the standard deviation equals sigma/sqrt(n). The most important theorem in introductory statistics. See also: sampling distribution, standard error.
chartjunk (Ch.25, section 25.2): Visual elements in a graph that do not convey data — decorative gridlines, 3D effects, unnecessary colors, background images. Term coined by Edward Tufte. See also: data-ink ratio, Tufte's principles.
cherry-picking (Ch.27, section 27.4): Selecting data, time ranges, or subgroups that support a preferred conclusion while ignoring contradictory evidence. A form of intellectual dishonesty. See also: p-hacking, questionable research practices.
chi-square distribution (Ch.19, section 19.2): A right-skewed distribution that is always non-negative. Used for goodness-of-fit and independence tests. Indexed by degrees of freedom. See also: chi-square test, degrees of freedom.
chi-square test (Ch.19, section 19.2): A family of hypothesis tests for categorical data. The test statistic is chi-squared = sum of (O-E)^2 / E. Includes goodness-of-fit and independence tests. See also: goodness-of-fit test, test of independence, Cramer's V.
classification (Ch.24, section 24.13): Predicting a categorical outcome from one or more predictor variables. Logistic regression is the simplest classification method. See also: logistic regression, confusion matrix.
cleaning log (Ch.7, section 7.10): A documented record of every data cleaning decision — what was changed, why, and how. Essential for reproducibility. See also: data cleaning, reproducibility.
cluster sampling (Ch.4, section 4.2): A sampling method in which entire naturally occurring groups (clusters) are randomly selected, and all members within selected clusters are included. See also: stratified sampling, random sample.
coefficient of determination (Ch.22, section 22.8): See R-squared.
Cohen's d (Ch.17, section 17.4): A standardized measure of the difference between two group means: d = (mean1 - mean2) / pooled SD. Benchmarks: 0.2 = small, 0.5 = medium, 0.8 = large. See also: effect size, practical significance.
Cohen's h (Ch.17, section 17.4): An effect size for comparing two proportions: h = 2arcsin(sqrt(p1)) - 2arcsin(sqrt(p2)). Same benchmarks as Cohen's d. See also: effect size.
collaborative filtering (Ch.26, section 26.2): A recommendation technique that predicts a user's preferences based on the preferences of similar users. At its core, a nearest-neighbor prediction method.
Common Rule (Ch.27, section 27.6): Federal regulations (45 CFR 46) requiring IRB review of research involving human subjects. See also: IRB, Belmont Report.
complement (Ch.8, section 8.4): The event that A does not occur. P(not A) = 1 - P(A). See also: probability.
COMPAS (Ch.26, section 26.5): Correctional Offender Management Profiling for Alternative Sanctions. A recidivism prediction algorithm examined by ProPublica for racial disparities. See also: algorithmic bias, Chouldechova impossibility result.
conditional probability (Ch.9, section 9.2): P(A|B), the probability of A given that B is known to be true. Calculated as P(A and B) / P(B). See also: Bayes' theorem, joint probability.
confidence band (Ch.25, section 25.9): A shaded region around a regression line showing the uncertainty in the predicted values at each point. Wider at the extremes. See also: confidence interval, error bars.
confidence interval (Ch.12, section 12.2): A range of plausible values for a population parameter, computed as point estimate plus or minus margin of error. A 95% CI means that if sampling were repeated many times, approximately 95% of such intervals would contain the true parameter. See also: confidence level, margin of error, point estimate.
confidence level (Ch.12, section 12.3): The proportion of confidence intervals that would contain the true parameter if the sampling process were repeated many times. Common levels: 90%, 95%, 99%. Describes the method's long-run reliability, not any single interval. See also: confidence interval.
confounding variable (Ch.4, section 4.6): A variable associated with both the explanatory and response variables, creating a spurious association or masking a real one. The reason correlation does not imply causation in observational studies. Also called a lurking variable. See also: lurking variable, correlation does not imply causation.
confusion matrix (Ch.24, section 24.8): A 2x2 table showing the four possible outcomes of a binary classifier: true positives, false positives, true negatives, and false negatives. See also: sensitivity, specificity, accuracy, precision.
contingency table (Ch.8, section 8.7): A two-way table showing frequencies for combinations of two categorical variables. Also called a cross-tabulation. See also: chi-square test of independence, joint probability, marginal probability.
continuity correction (Ch.10, section 10.8): An adjustment of plus or minus 0.5 when approximating a discrete distribution with a continuous one. Each discrete value "occupies" a half-unit on each side.
continuous variable (Ch.2, section 2.3): A numerical variable that can take any value within a range, including fractions and decimals (e.g., height, temperature, time). See also: discrete variable, numerical variable.
control chart (Ch.6, case-study-02): A graph that plots a process measure over time with control limits (typically plus or minus 3 standard deviations from the center line). Used in quality control to distinguish common cause from special cause variation. Invented by Walter Shewhart.
control group (Ch.4, section 4.7): The group in an experiment that does not receive the treatment. Provides a baseline for comparison. See also: treatment group, experiment.
convenience sample (Ch.4, section 4.2): A sample selected based on ease of access rather than randomness. Highly susceptible to bias. See also: random sample, bias.
correlation coefficient (r) (Ch.22, section 22.3): See Pearson's r.
Cramer's V (Ch.19, section 19.6): An effect size for chi-square tests: V = sqrt(chi-squared / (n * (k-1))), where k = min(rows, columns). Ranges from 0 to 1. Benchmarks: 0.10 = small, 0.30 = medium, 0.50 = large. See also: chi-square test, effect size.
critical value (Ch.12, section 12.4): The value from a reference distribution (z or t) that determines the width of a confidence interval for a given confidence level. For 95% CI: z = 1.960, t depends on degrees of freedom. See also: confidence interval, t-distribution.
cross-sectional study (Ch.2, section 2.7): A study that collects data at a single point in time. Provides a snapshot. See also: longitudinal study.
CSV (Ch.3, section 3.5): Comma-Separated Values. A plain text file format for tabular data where values in each row are separated by commas.
data (Ch.1, section 1.1): Information collected for analysis. Can be numerical or categorical, structured or unstructured.
data cleaning (Ch.7, section 7.1): The process of detecting and correcting errors, inconsistencies, and quality issues in a dataset. See also: data wrangling, cleaning log.
data dictionary (Ch.2, section 2.5): A document describing each variable in a dataset: name, type, possible values, units, and how it was collected. Also called a codebook.
data-ink ratio (Ch.25, section 25.2): The proportion of a graph's ink that represents actual data, as opposed to decorative elements. Tufte advocated maximizing this ratio. See also: chartjunk, Tufte's principles.
data literacy (Ch.1, section 1.4): The ability to read, interpret, and critically evaluate data-based claims.
data mining (Ch.26, section 26.6): Searching large datasets for patterns without pre-specified hypotheses. High risk of spurious findings due to multiple comparisons. See also: big data, p-hacking.
data privacy (Ch.27, section 27.6): The rights of individuals to control their personal information. See also: GDPR, CCPA, re-identification, informed consent.
data quality (Ch.7, section 7.1): The degree to which data is accurate, complete, consistent, and suitable for its intended purpose. See also: data cleaning, missing data.
data science pipeline (Ch.28, section 28.4): The full workflow from question formulation through data collection, cleaning, exploration, analysis, interpretation, and communication.
data storytelling (Ch.25, section 7): The practice of communicating data insights through narrative, combining visualizations with context and interpretation.
data visualization principles (Ch.25, section 25.2): Guidelines for creating effective, honest, and accessible graphs. Includes Tufte's principles, accessibility considerations, and avoiding misleading techniques.
data wrangling (Ch.7, section 7.1): The broader process of transforming raw data into a format suitable for analysis. Includes cleaning, reshaping, and creating new variables. See also: data cleaning, feature engineering.
DataFrame (Ch.3, section 3.4): The core two-dimensional data structure in pandas, with rows (observations) and columns (variables). See also: pandas.
deepfake (Ch.26, section 26.9): AI-generated realistic media that mimics real people. Relevant to statistical literacy as a source of misinformation.
degrees of freedom (Ch.6, section 6.5; Ch.12, section 12.4): The number of independent pieces of information in a calculation. For a one-sample t-test: df = n - 1. For chi-square independence: df = (r-1)(c-1). For ANOVA: df_between = k - 1, df_within = N - k. See also: t-distribution, chi-square distribution, F-distribution.
dependent samples (Ch.16, section 16.4): Paired or matched observations where each data point in one group has a natural partner in the other. See also: paired t-test, independent samples.
descriptive statistics (Ch.1, section 1.1): Methods for summarizing and presenting the data you already have (means, graphs, tables). Does not involve generalizing beyond the sample. See also: inferential statistics.
discrete variable (Ch.2, section 2.3): A numerical variable that can only take specific, countable values (e.g., number of children, number of errors). See also: continuous variable, numerical variable.
distribution-free (Ch.21, section 21.3): A synonym for nonparametric. Tests that do not require assumptions about the shape of the population distribution. See also: nonparametric test.
distribution shape (Ch.5, section 5.7): The overall pattern of a distribution when graphed: symmetric, skewed left, skewed right, unimodal, bimodal. Affects which summary statistics and tests are appropriate.
double-blind (Ch.4, section 4.7): An experimental design in which neither participants nor researchers know who is in the treatment versus control group. The gold standard for reducing bias. See also: blinding, placebo.
ecological fallacy (Ch.27, section 27.3): The error of drawing conclusions about individuals from group-level (aggregate) data. See also: Simpson's paradox.
effect size (Ch.17, section 17.4): A standardized, sample-size-independent measure of the magnitude of an effect. Tells you how big an effect is, not just whether it exists. See also: Cohen's d, R-squared, Cramer's V, eta-squared, practical significance.
Empirical Rule (Ch.6, section 6.7): For bell-shaped (approximately normal) distributions: approximately 68% of data falls within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3. Also called the 68-95-99.7 Rule. See also: normal distribution, standard deviation.
error bars (Ch.25, section 25.9): Visual indicators of uncertainty displayed on bar charts, typically showing standard errors or confidence intervals. See also: confidence interval, standard error.
eta-squared (Ch.20, section 20.10): An effect size for ANOVA: eta-squared = SS_Between / SS_Total. The proportion of total variability explained by the grouping factor. See also: ANOVA, effect size.
event (Ch.8, section 8.2): A collection of one or more outcomes of interest from a random process. See also: probability, sample space, outcome.
executive summary (Ch.25, section 25.11): A brief, non-technical overview of an analysis written for decision-makers. Focuses on key findings, implications, and recommendations.
expected frequency (Ch.19, section 19.2): The count in each cell of a contingency table that would be expected under the null hypothesis. For independence tests: E = (row total * column total) / grand total. See also: observed frequency, chi-square test.
expected value (Ch.10, section 10.2): The long-run average of a random variable: E(X) = sum of x * P(X = x). The "center" of a probability distribution. See also: probability distribution, random variable.
experiment (Ch.4, section 4.1): A study in which the researcher deliberately imposes a treatment on subjects and observes the response. Random assignment to groups is the key feature that allows causal claims. See also: observational study, randomization, control group.
exploratory data analysis (EDA) (Ch.5, section 5.1): The process of using graphs and summary statistics to understand patterns, outliers, and relationships in data before formal testing.
extrapolation (Ch.22, section 22.10): Using a regression model to predict values outside the range of the observed data. Dangerous because the linear relationship may not hold beyond the observed range. See also: regression line.
F-distribution (Ch.20, section 20.6): A right-skewed distribution with two degrees of freedom parameters (df1 and df2). Used for ANOVA. See also: F-statistic, ANOVA.
F-statistic (Ch.20, section 20.5): F = MS_Between / MS_Within. Measures how much the group means vary relative to the variation within groups. Large F suggests group means are genuinely different. See also: ANOVA, between-group variability, within-group variability.
F1 score (Ch.24, section 24.8): The harmonic mean of precision and recall: F1 = 2 * (precision * recall) / (precision + recall). Balances both types of classification error. See also: precision, recall, confusion matrix.
fail to reject (Ch.13, section 13.7): The conclusion when p-value > alpha. Not enough evidence to overturn H0. Does NOT mean H0 is true. See also: null hypothesis, p-value.
fairness impossibility theorem (Ch.24, case-study-02): The mathematical proof (Chouldechova, 2017) that when base rates differ between groups, it is impossible to simultaneously equalize false positive rates, false negative rates, and predictive values. See also: algorithmic bias, COMPAS.
false discovery rate (Ch.13, case-study-01): The proportion of "significant" findings that are actually false positives. In a field where many hypotheses are tested, the false discovery rate can be much higher than alpha.
false negative (Ch.9, section 9.8): A negative test result when the condition is actually present. In hypothesis testing, a Type II error. See also: Type II error, sensitivity.
false positive (Ch.9, section 9.8): A positive test result when the condition is actually absent. In hypothesis testing, a Type I error. See also: Type I error, specificity.
family-wise error rate (Ch.20, section 20.2): The probability of making at least one Type I error when conducting multiple comparisons. P(at least 1 false positive) = 1 - (1 - alpha)^m, where m = number of tests. See also: Bonferroni correction, multiple comparisons problem.
feature engineering (Ch.7, section 7.8): Creating new variables from existing ones to better capture patterns in the data (e.g., creating BMI from height and weight). See also: data wrangling, recoding.
file drawer problem (Ch.17, section 17.9): The tendency for null results to remain unpublished, biasing the scientific literature toward significant findings. Term coined by Rosenthal (1979). See also: publication bias.
five-number summary (Ch.6, section 6.6): Minimum, Q1, Median, Q3, Maximum. The basis for a box plot. See also: box plot, quartile.
frequency distribution (Ch.5, section 5.4): A table or graph showing how often each value (or range of values) occurs in a dataset. See also: relative frequency, histogram.
gambler's fallacy (Ch.8, section 8.3): The mistaken belief that past independent random events influence future ones (e.g., "heads is due" after several tails). See also: independent events, law of large numbers.
garden of forking paths (Ch.13, section 13.12): Andrew Gelman's metaphor for the many analysis decisions a researcher makes (which variables to include, how to handle outliers, which tests to run) that inflate false positive rates even without intentional p-hacking.
GDPR (Ch.27, section 27.6): General Data Protection Regulation (EU, 2018). Requires opt-in consent for data collection, with penalties up to 4% of global revenue. See also: CCPA, data privacy.
general multiplication rule (Ch.9, section 9.11): P(A and B) = P(A) * P(B|A). Applies to all events, including dependent ones. Generalizes the multiplication rule from Ch.8. See also: multiplication rule, conditional probability.
goodness-of-fit test (Ch.19, section 19.3): A chi-square test that compares observed frequencies in a single categorical variable to expected frequencies under a hypothesized distribution. df = k - 1. See also: chi-square test, test of independence.
Google Colab (Ch.3, section 3.2): A free browser-based Jupyter notebook environment provided by Google. Requires no installation. See Appendix C.
grand mean (Ch.20, section 20.4): The mean of all N observations combined, ignoring group membership. Used in ANOVA calculations.
hallucination (AI) (Ch.26, section 26.8): When a large language model generates false information that sounds convincing. Occurs because the model produces statistically likely text, which is not the same as true text.
HARKing (Ch.27, section 27.5): Hypothesizing After Results are Known. Presenting post-hoc findings as though they were predicted in advance. A questionable research practice. See also: p-hacking, pre-registration.
hedging language (Ch.25, section 25.9): Appropriately cautious language used when communicating statistical results (e.g., "the data suggest" rather than "the data prove"). Essential for honest reporting.
histogram (Ch.5, section 5.4): A graph that displays the distribution of a numerical variable by dividing the range into bins and showing the frequency or relative frequency of observations in each bin. Bars touch (unlike bar charts). See also: bar chart, distribution shape.
hypothesis testing (Ch.13, section 13.2): A formal framework for using sample data to decide between two competing claims about a population parameter. Uses indirect reasoning: assume H0, calculate how unlikely the data would be, and decide whether the evidence warrants rejecting H0. See also: null hypothesis, alternative hypothesis, p-value, test statistic.
IDE (Ch.3, section 3.11): Integrated Development Environment. Software for writing and running code, such as JupyterLab, VS Code, or PyCharm.
IMRaD (Ch.25, section 25.11): Introduction, Methods, Results, and Discussion. The standard structure for scientific papers and analysis reports.
imputation (Ch.7, section 7.3): Replacing missing values with estimated values (mean, median, mode, or model-based estimates). See also: missing data, listwise deletion.
independent events (Ch.8, section 8.6): Events where knowing that one occurred does not change the probability of the other. Formally: P(A|B) = P(A). See also: multiplication rule, conditional probability.
independent samples (Ch.16, section 16.3): Two samples where individuals in one group are completely unrelated to individuals in the other. See also: dependent samples, two-sample t-test.
indicator variable (Ch.23, section 23.6): A variable coded as 0 or 1 to represent membership in a category. Used to include categorical predictors in regression. Also called a dummy variable. See also: multiple regression.
inferential statistics (Ch.1, section 1.1): Methods for drawing conclusions about a population based on information from a sample. Includes confidence intervals and hypothesis tests. See also: descriptive statistics, population, sample.
informed consent (Ch.4, section 4.7; Ch.27, section 27.6): The ethical requirement that research participants understand the purpose, procedures, risks, and benefits of a study before agreeing to participate. See also: IRB, Belmont Report.
interaction term (Ch.23, section 23.7): A product of two predictor variables in a regression model that allows the effect of one predictor to depend on the level of another.
intercept (Ch.22, section 22.6): The predicted value of Y when X = 0 in a regression equation. Often has no meaningful interpretation if X = 0 is outside the observed range. See also: slope, regression line.
interquartile range (IQR) (Ch.6, section 6.4): Q3 minus Q1. The spread of the middle 50% of the data. Used for outlier detection: observations below Q1 - 1.5IQR or above Q3 + 1.5IQR are potential outliers. See also: quartile, box plot, outlier.
interval estimate (Ch.12, section 12.2): A range of values used to estimate a population parameter. Synonym for confidence interval. See also: confidence interval, point estimate.
IRB (Ch.4, section 4.7; Ch.27, section 27.6): Institutional Review Board. A committee that reviews research proposals involving human subjects to ensure ethical standards are met. Required by the Common Rule. See also: informed consent, Belmont Report.
joint probability (Ch.8, section 8.7): The probability that two events occur simultaneously. In a contingency table: cell count divided by grand total. See also: marginal probability, conditional probability.
Jupyter notebook (Ch.3, section 3.2): An interactive document combining code, text, and output. The primary tool for data analysis in this textbook. See also: cell, kernel.
kernel (Ch.3, section 3.2): The running Python process that executes code in a Jupyter notebook. Restarting the kernel clears all variables.
Kruskal-Wallis test (Ch.21, section 21.8): A nonparametric alternative to one-way ANOVA that compares mean ranks across three or more groups. H statistic approximately follows a chi-square distribution. See also: ANOVA, Mann-Whitney U test, nonparametric test.
large language model (LLM) (Ch.26, section 26.8): A statistical model of language that predicts likely next words based on training data. The foundation of AI tools like ChatGPT. See also: hallucination (AI), machine learning.
law of large numbers (Ch.8, section 8.3): As the number of trials increases, the relative frequency of an event approaches its true probability. Distinct from the CLT: LLN says the sample mean converges to the population mean; CLT says the distribution of the sample mean is normal. See also: Central Limit Theorem.
least squares (Ch.22, section 22.6): The method for fitting a regression line by minimizing the sum of squared residuals. The line passes through the point (mean of X, mean of Y). See also: regression line, residual.
level of measurement (Ch.2, section 2.6): The classification system for variables: nominal, ordinal, interval, ratio. Determines which mathematical operations and statistical tests are appropriate.
Levene's test (Ch.20, section 20.9): A test for equality of variances across groups. Used to check the equal-variances assumption for ANOVA. If significant, consider Welch's ANOVA. See also: ANOVA.
library (Ch.3, section 3.4): A collection of pre-written Python code that adds capabilities. Key libraries: pandas, numpy, matplotlib, seaborn, scipy, statsmodels.
likelihood ratio (Ch.9, section 13): P(B|A) / P(B|not A). Measures how strongly evidence supports one hypothesis over another. See also: Bayes' theorem.
linear relationship (Ch.22, section 22.3): A relationship between two variables that can be described by a straight line. See also: correlation coefficient, regression line.
listwise deletion (Ch.7, section 7.3): Removing all rows that contain any missing values. Simple but can substantially reduce sample size and introduce bias if data is not MCAR. See also: imputation, missing data.
log-odds (logit) (Ch.24, section 24.3): The natural logarithm of the odds: logit(p) = ln(p / (1-p)). The scale on which logistic regression coefficients are linear. See also: odds, odds ratio, logistic regression.
logistic regression (Ch.24, section 24.4): A regression method for binary outcomes that models the probability of the outcome using the sigmoid function. Coefficients are interpreted as log-odds changes. See also: binary outcome, odds ratio, sigmoid function.
longitudinal study (Ch.2, section 2.7): A study that follows the same subjects over time, collecting data at multiple points. See also: cross-sectional study.
lurking variable (Ch.22, section 22.5): A variable not included in the analysis that is associated with both the explanatory and response variables. Synonym for confounding variable. See also: confounding variable.
machine learning (Ch.26, section 26.2): A branch of AI in which systems learn patterns from data rather than being explicitly programmed. Includes supervised learning (regression, classification), unsupervised learning (clustering), and reinforcement learning.
Mann-Whitney U test (Ch.21, section 21.6): A nonparametric test for comparing two independent groups, based on ranks. Same as the Wilcoxon rank-sum test. See also: Wilcoxon rank-sum test, two-sample t-test.
margin of error (Ch.12, section 12.2): The maximum likely distance between the point estimate and the true parameter. Equals critical value times standard error. What the "plus or minus 3 points" in polls refers to. See also: confidence interval, standard error.
marginal probability (Ch.8, section 8.7): The probability of a single event, calculated from a contingency table as the row or column total divided by the grand total. See also: joint probability, conditional probability.
matched pairs (Ch.16, section 16.4): A study design in which each subject in one group is paired with a similar subject in the other group (or the same subject is measured twice). See also: paired t-test, dependent samples.
maximum likelihood estimation (Ch.24, section 24.7): The method for fitting logistic regression models. Finds the parameter values that make the observed data most likely. See also: logistic regression.
MCAR (Ch.7, section 7.2): Missing Completely At Random. Data is missing for reasons unrelated to any variable in the dataset. Listwise deletion is unbiased under MCAR.
MAR (Ch.7, section 7.2): Missing At Random. Data is missing for reasons related to observed variables but not to the missing values themselves.
mean (Ch.6, section 6.1): The sum of all values divided by the count. The balance point of the data. Sensitive to outliers and skewness. See also: median, mode, resistant measure.
mean square (Ch.20, section 20.5): In ANOVA, a sum of squares divided by its degrees of freedom. MS_Between = SS_Between / (k-1); MS_Within = SS_Within / (N-k). See also: ANOVA, sum of squares.
median (Ch.6, section 6.1): The middle value when data is sorted. The 50th percentile. Resistant to outliers. Preferred over the mean for skewed distributions. See also: mean, resistant measure.
misleading graphs (Ch.25, section 25.3): Visualizations that distort the data through truncated axes, cherry-picked time windows, dual y-axes, 3D effects, or area/volume distortion.
misinformation (Ch.26, section 26.9): False or misleading information. Statistical literacy is a defense against data-driven misinformation.
missing data (NA/NaN) (Ch.7, section 7.2): Values that are absent from the dataset. Can be MCAR, MAR, or MNAR. How you handle missing data affects your conclusions. See also: imputation, listwise deletion, MCAR, MAR, MNAR.
MNAR (Ch.7, section 7.2): Missing Not At Random. Data is missing for reasons related to the missing values themselves (e.g., people with high incomes are less likely to report income).
mode (Ch.6, section 6.1): The most frequently occurring value. A dataset can have multiple modes (bimodal, multimodal) or no mode.
Monte Carlo simulation (Ch.18, section 18.6): Using repeated random sampling to estimate the properties of a statistic or process. Named after the Monte Carlo casino. See also: permutation test, bootstrap.
multicollinearity (Ch.23, section 23.8): When two or more predictor variables in a regression model are highly correlated with each other. Inflates standard errors and makes individual coefficient estimates unreliable. Diagnosed with VIF. See also: VIF, multiple regression.
multiple comparisons problem (Ch.20, section 20.2): The inflation of the overall Type I error rate when conducting many hypothesis tests simultaneously. Running m tests at alpha = 0.05 gives P(at least 1 false positive) = 1 - 0.95^m. See also: Bonferroni correction, Tukey's HSD, family-wise error rate.
multiple regression (Ch.23, section 23.2): A regression model with two or more predictor variables: Y = b0 + b1X1 + b2X2 + ... + error. Allows controlling for confounders statistically. See also: simple linear regression, adjusted R-squared, multicollinearity.
multiplication rule (Ch.8, section 8.6): For independent events: P(A and B) = P(A) * P(B). For dependent events, use the general multiplication rule. See also: independent events, general multiplication rule.
mutually exclusive (Ch.8, section 8.5): Events that cannot both occur simultaneously. Also called disjoint. If A and B are mutually exclusive, P(A and B) = 0. See also: addition rule.
Naive Bayes classifier (Ch.9, section 9.10): A classification algorithm that applies Bayes' theorem with the simplifying assumption that features are independent. Used in spam filters and text classification. See also: Bayes' theorem, classification.
negative predictive value (NPV) (Ch.9, section 9.8): P(healthy | negative test). The probability that a person with a negative test result truly does not have the condition. See also: positive predictive value, sensitivity, specificity.
nominal (Ch.2, section 2.3): A level of measurement for categorical variables where categories have no natural ordering (e.g., blood type, eye color). See also: ordinal, categorical variable.
nonparametric test (Ch.21, section 21.3): A statistical test that does not assume a specific probability distribution for the population. Based on ranks rather than raw values. Less powerful than parametric tests when assumptions are met, but more robust when they aren't. See also: Mann-Whitney U test, Wilcoxon signed-rank test, Kruskal-Wallis test.
nonresponse bias (Ch.4, section 4.3): Bias that occurs when people who choose not to respond differ systematically from those who do. See also: bias, selection bias.
normal distribution (Ch.10, section 10.5): A symmetric, bell-shaped continuous distribution defined by mean mu and standard deviation sigma. The most important distribution in statistics due to the CLT. See also: standard normal distribution, Empirical Rule, Central Limit Theorem.
null hypothesis (Ch.13, section 13.3): The default assumption of no effect, no difference, or status quo. Denoted H0. Always contains an equality. Assumed true until evidence says otherwise. See also: alternative hypothesis, p-value, hypothesis testing.
numerical variable (Ch.2, section 2.2): A variable whose values represent measurable quantities. Can be discrete (countable) or continuous (any value in a range). See also: categorical variable, discrete, continuous.
observation (Ch.1, section 1.1): A single data point or individual in a dataset. One row in a data table. See also: observational unit, variable.
observational study (Ch.4, section 4.1): A study that observes and measures without intervening. Cannot establish causation due to potential confounding. See also: experiment, confounding variable.
observational unit (Ch.2, section 2.1): The entity being measured or observed in a study (e.g., a person, a city, a blood sample). Determines what constitutes one row in the dataset.
observed frequency (Ch.19, section 19.2): The actual count in each cell of a contingency table, as found in the data. See also: expected frequency, chi-square test.
odds (Ch.24, section 24.3): The ratio of the probability of an event occurring to the probability of it not occurring: odds = p / (1-p). See also: odds ratio, log-odds.
odds ratio (Ch.24, section 24.6): The factor by which the odds of the outcome change for a one-unit increase in a predictor. In logistic regression: OR = e^(coefficient). OR > 1 means increased odds; OR < 1 means decreased odds. See also: logistic regression, log-odds.
one-sample t-test (Ch.15, section 15.2): A hypothesis test comparing a sample mean to a hypothesized population mean when sigma is unknown: t = (x-bar - mu0) / (s / sqrt(n)) with df = n - 1. See also: t-distribution, degrees of freedom.
one-sample z-test for proportions (Ch.14, section 14.3): A hypothesis test for a population proportion: z = (p-hat - p0) / sqrt(p0(1-p0)/n). Requires the success-failure condition. See also: sample proportion, success-failure condition*.
one-tailed test (Ch.13, section 13.8): A hypothesis test where Ha specifies a direction (greater than or less than). The p-value uses only one tail of the distribution. See also: two-tailed test.
one-way ANOVA (Ch.20, section 20.2): ANOVA with a single grouping factor. Compares means across k groups. See also: ANOVA, F-statistic.
open data (Ch.27, section 27.5): Making research data publicly available for verification and replication. A key component of scientific reform.
optional stopping (Ch.27, section 27.5): The questionable practice of repeatedly checking for statistical significance during data collection and stopping when p < 0.05. Inflates false positive rates. See also: p-hacking, questionable research practices.
ordinal (Ch.2, section 2.3): A level of measurement for categorical variables where categories have a natural order but the distances between them are not necessarily equal (e.g., education level, pain rating). See also: nominal, categorical variable.
outcome (Ch.8, section 8.2): A single result of a random process. See also: event, sample space.
outlier (Ch.5, section 5.7): An observation that falls far from the bulk of the data. May be identified using the 1.5IQR rule, z-scores, or visual inspection. Should be investigated, not automatically removed. See also: box plot, IQR*.
overfitting (Ch.26, section 26.4): When a model captures noise in the training data rather than the true underlying pattern, resulting in excellent training performance but poor performance on new data. See also: bias-variance tradeoff.
p-hacking (Ch.17, section 17.9): Manipulating data analysis — through selective reporting, flexible outlier removal, optional stopping, or trying many statistical tests — to achieve a statistically significant result. See also: garden of forking paths, pre-registration, questionable research practices.
p-value (Ch.13, section 13.5): The probability of observing data as extreme as or more extreme than what was observed, assuming H0 is true. P(data | H0), NOT P(H0 | data). Small p-values provide evidence against H0. See also: null hypothesis, significance level, hypothesis testing.
paired t-test (Ch.16, section 16.4): A hypothesis test for paired (dependent) data. Computes within-pair differences and tests whether the mean difference is zero: t = d-bar / (s_d / sqrt(n)). Equivalent to a one-sample t-test on the differences. See also: dependent samples, matched pairs.
pandas (Ch.3, section 3.4): A Python library for data loading, manipulation, and analysis. Its core data structure is the DataFrame.
parameter (Ch.2, section 2.4): A numerical summary of a population (e.g., population mean mu, population proportion p). Usually unknown and estimated by a statistic. See also: statistic, population.
Pearson's r (Ch.22, section 22.3): The Pearson correlation coefficient. A measure of the strength and direction of the linear relationship between two numerical variables. Ranges from -1 to +1. r = 0 means no linear relationship. See also: correlation coefficient, scatterplot, R-squared.
percentile (Ch.6, section 6.4): The value below which a given percentage of data falls. The 90th percentile is the value below which 90% of observations fall. See also: quartile.
permutation test (Ch.18, section 18.6): A nonparametric hypothesis test that compares the observed test statistic to the distribution of test statistics obtained by randomly permuting group labels. Also called a randomization test. See also: simulation-based inference, Monte Carlo simulation.
pie chart (Ch.5, section 5.3): A circular graph divided into slices representing category proportions. Generally less effective than bar charts for comparing categories. Acceptable only when showing parts of a whole with few categories.
placebo (Ch.4, section 4.7): An inactive treatment designed to look identical to the real treatment. Controls for the placebo effect. See also: blinding, control group.
plus-four method (Ch.14, section 14.6): An improved confidence interval for proportions that adds 2 successes and 2 failures before computing: p-tilde = (X + 2) / (n + 4). Better coverage than the Wald interval for small samples.
point estimate (Ch.12, section 12.2): A single value used to estimate a population parameter. x-bar estimates mu; p-hat estimates p. See also: interval estimate, confidence interval.
pooled standard error (Ch.16, section 16.3): The standard error for the difference between two statistics, combining variability from both groups. For independent means (Welch's): SE = sqrt(s1^2/n1 + s2^2/n2).
population (Ch.1, section 1.1): The entire group of individuals or objects you want to study. See also: sample, parameter.
population proportion (Ch.14, section 14.3): The true proportion of successes in the entire population. The parameter being estimated or tested. See also: sample proportion.
positive predictive value (PPV) (Ch.9, section 9.8): P(disease | positive test). The probability that a person with a positive test result truly has the condition. Depends heavily on prevalence. See also: negative predictive value, Bayes' theorem, base rate fallacy.
posterior probability (Ch.9, section 9.13): The updated probability of an event after incorporating new evidence, via Bayes' theorem. See also: prior probability, Bayes' theorem.
power (Ch.17, section 17.6): See statistical power.
power analysis (Ch.17, section 17.7): A calculation to determine the sample size needed to detect a specified effect size with a desired level of power. Should be done BEFORE data collection. See also: statistical power, effect size.
practical significance (Ch.17, section 17.11): Whether an effect is large enough to matter in the real world. A statistically significant result is not necessarily practically significant. See also: statistical significance, effect size.
precision (classification) (Ch.24, section 24.8): The proportion of positive predictions that are correct: TP / (TP + FP). Equivalent to positive predictive value. See also: recall, confusion matrix.
prediction vs. inference (Ch.26, section 26.7): Prediction asks "what will happen?" and optimizes for accuracy. Inference asks "why?" and optimizes for understanding causal mechanisms.
pre-registration (Ch.13, section 13.12; Ch.27, section 27.5): Publicly committing to hypotheses and analysis plans before data collection. Prevents p-hacking and HARKing. Platforms include OSF and AsPredicted. See also: registered reports, p-hacking.
prior probability (Ch.9, section 9.13): The probability of an event before considering new evidence. Also called the base rate or prevalence. See also: posterior probability, Bayes' theorem.
probability (Ch.8, section 8.2): A number between 0 and 1 that measures how likely an event is to occur. Can be defined via classical (equally likely outcomes), relative frequency (long-run proportion), or subjective (degree of belief) approaches. See also: event, sample space.
probability density function (PDF) (Ch.10, section 10.4): A curve for continuous distributions where the area under the curve equals probability. Height is density, not probability. Total area under the curve equals 1.
probability distribution (Ch.10, section 10.2): A mathematical description of all possible values a random variable can take and how likely each value (or range of values) is. See also: random variable, binomial distribution, normal distribution.
prosecutor's fallacy (Ch.9, section 9.3): Confusing P(evidence | innocent) with P(innocent | evidence). A dangerous error in legal contexts. See also: conditional probability, Bayes' theorem.
proxy variable (Ch.26, section 26.5): A variable that stands in for another variable of interest. Can introduce bias when the proxy carries embedded historical discrimination (e.g., healthcare spending as a proxy for health needs, zip code as a proxy for race). See also: algorithmic bias, confounding variable.
publication bias (Ch.17, section 17.9): The tendency for journals to publish significant results and reject null results, creating a biased scientific literature. See also: file drawer problem, replication crisis.
QQ-plot (Ch.10, section 10.9): Quantile-Quantile plot. Compares data quantiles to theoretical normal quantiles. If points follow a straight line, the data is approximately normal. The gold standard for visual normality assessment. See also: Shapiro-Wilk test, normal distribution.
quartile (Ch.6, section 6.4): Values that divide sorted data into four equal parts. Q1 = 25th percentile, Q2 = median = 50th percentile, Q3 = 75th percentile. See also: percentile, IQR, five-number summary.
questionable research practices (QRPs) (Ch.27, section 27.5): Practices that inflate false positive rates: p-hacking, HARKing, optional stopping, selective reporting, flexible outlier removal. See also: p-hacking, HARKing.
random sample (Ch.4, section 2): A sample in which every member of the population has an equal (or known) chance of being selected. The foundation of valid statistical inference. See also: stratified sampling, cluster sampling, convenience sample.
random variable (Ch.10, section 10.2): A numerical outcome of a random process. Can be discrete (countable values) or continuous (any value in a range). See also: probability distribution, expected value.
randomization (Ch.4, section 4.5): Using chance to select samples or assign treatments. Protects against both known and unknown biases. The key to causal inference. See also: random sample, experiment.
range (Ch.6, section 6.4): Maximum minus minimum. The total spread of the data. Sensitive to outliers. See also: IQR, standard deviation.
rank (Ch.21, section 21.4): The position of an observation when all values are sorted from smallest to largest. Tied values receive the average of their positions (midrank). See also: nonparametric test, rank-based methods.
recall (Ch.24, section 24.8): The proportion of actual positive cases correctly identified by the model: TP / (TP + FN). Synonym for sensitivity. See also: precision, sensitivity, confusion matrix.
recoding (Ch.7, section 7.6): Creating a new version of a variable with different values or categories (e.g., combining "Agree" and "Strongly Agree" into "Agree"). See also: binning, feature engineering.
recommendation algorithm (Ch.26, section 26.2): A system that predicts user preferences from past behavior. At its core, a prediction model. See also: collaborative filtering, machine learning.
registered reports (Ch.17, section 17.10; Ch.27, section 27.5): A publication model in which journals commit to publishing a study based on its design and methods, regardless of results. Eliminates publication bias. See also: pre-registration, publication bias.
regression line (Ch.22, section 22.6): The line of best fit through a scatterplot, found by minimizing the sum of squared residuals. Equation: y-hat = b0 + b1x. See also: slope, intercept, least squares, residual*.
regression to the mean (Ch.22, section 22.9): The statistical phenomenon whereby extreme observations tend to be followed by less extreme ones. Not a causal process — it's a mathematical consequence of imperfect correlation. Coined by Galton (1886). See also: Pearson's r.
re-identification (Ch.27, section 27.6): Linking supposedly anonymous data back to specific individuals using external information. Sweeney showed 87% of Americans can be uniquely identified by date of birth, zip code, and gender. See also: data privacy.
rejection region (Ch.13, section 13.7): The set of test statistic values that lead to rejecting H0. Determined by alpha and the test direction. See also: significance level, critical value.
relative frequency (Ch.5, section 5.4): The proportion of observations in a category: count / total. Useful for comparing datasets of different sizes. In probability (Ch.8): the long-run proportion of times an event occurs.
replication crisis (Ch.17, section 17.10): The finding that many published scientific results cannot be reproduced when the study is repeated. Caused by a combination of underpowered studies, publication bias, p-hacking, and binary threshold thinking. See also: Open Science Collaboration, p-hacking, publication bias.
reproducibility (Ch.7, section 7.10; Ch.25, section 25.15): The ability for someone else to follow your documented steps and arrive at the same results. Requires code documentation, cleaning logs, and transparent methods. See also: cleaning log.
resampling (Ch.18, section 18.3): Drawing new samples from existing data. Includes bootstrap (sampling with replacement) and permutation tests (shuffling labels). See also: bootstrap, permutation test.
residual (Ch.22, section 22.6): The difference between an observed value and the value predicted by the regression model: residual = y - y-hat. Positive residuals mean the model underpredicted; negative mean it overpredicted. See also: regression line, least squares.
resistant measure (Ch.6, section 6.1): A statistic not heavily influenced by extreme values or outliers. The median and IQR are resistant; the mean, range, and standard deviation are not. See also: median, IQR, outlier.
response bias (Ch.4, section 4.3): Systematic inaccuracy in survey responses due to question wording, social desirability, or recall errors. See also: bias.
rights-based ethics (Ch.27, section 27.8): A deontological ethical framework that emphasizes respecting fundamental rights of every individual, regardless of consequences for the majority. See also: utilitarian ethics, care ethics.
ROC curve (Ch.24, section 24.9): Receiver Operating Characteristic curve. A graph plotting the true positive rate (sensitivity) against the false positive rate (1 - specificity) across all classification thresholds. See also: AUC, sensitivity, specificity.
robustness (Ch.15, section 15.7): A statistical procedure's ability to give approximately correct results even when its assumptions are not perfectly satisfied. The t-test is robust to non-normality for n >= 30. See also: normality assumption.
R-squared (Ch.22, section 22.8): The coefficient of determination. The proportion of variation in Y explained by the regression model. R-squared = r-squared for simple regression. Ranges from 0 to 1. See also: adjusted R-squared, Pearson's r, residual.
sample (Ch.1, section 1.1): A subset of the population actually observed or measured. Used to make inferences about the population. See also: population, random sample, statistic.
sample proportion (Ch.14, section 14.3): The proportion of successes in the sample: p-hat = X / n. The point estimate for the population proportion p. See also: population proportion.
sample size determination (Ch.12, section 12.9): Calculating the minimum sample size needed for a desired margin of error. For means: n = (z * sigma / E)^2. For proportions: n = (z / E)^2 * p-hat * (1 - p-hat). See also: margin of error, power analysis.
sample space (Ch.8, section 2): The set of all possible outcomes of a random process. See also: event, outcome, probability.
sampling distribution (Ch.11, section 11.2): The distribution of a statistic (like x-bar or p-hat) computed from all possible random samples of the same size from the same population. See also: Central Limit Theorem, standard error.
sampling variability (Ch.11, section 11.2): The natural variation in a statistic that occurs because different random samples contain different individuals. Quantified by the standard error. See also: standard error.
scatterplot (Ch.5, section 5.9; Ch.22, section 22.2): A graph displaying the relationship between two numerical variables by plotting each observation as a point. Essential before computing correlation or fitting regression. See also: correlation coefficient, regression line.
selection bias (Ch.4, section 4.3): Bias arising from the method of selecting participants, such that the sample is not representative of the population. See also: bias, convenience sample.
sensitivity (Ch.9, section 9.8): P(positive test | disease present). The test's ability to detect true positives. Also called the true positive rate. In classification: equivalent to recall. See also: specificity, confusion matrix, recall.
Shapiro-Wilk test (Ch.10, section 10.9): A formal hypothesis test for normality. H0: data is normally distributed. A small p-value suggests non-normality. Sensitive to sample size. See also: QQ-plot, normal distribution.
sigmoid function (Ch.24, section 24.2): The S-shaped function that maps any real number to a probability between 0 and 1: sigma(x) = 1 / (1 + e^(-x)). Also called the logistic function. The link function in logistic regression.
sign test (Ch.21, section 21.5): The simplest nonparametric test for paired data. Counts the number of positive and negative differences and tests whether they're equally likely using the binomial distribution. See also: Wilcoxon signed-rank test.
significance level (alpha) (Ch.13, section 13.7): The pre-set threshold for rejecting H0. Equals the maximum acceptable Type I error rate. Typically 0.05. Must be chosen BEFORE examining the data. See also: p-value, Type I error, rejection region.
simple linear regression (Ch.22, section 22.6): A regression model with one predictor: Y = b0 + b1X + error. See also: multiple regression, slope, intercept*.
Simpson's paradox (Ch.27, section 27.2): A trend that appears in aggregated data but reverses or disappears when the data is broken into subgroups. The most famous example is the UC Berkeley admissions study (Bickel et al., 1975). See also: ecological fallacy, confounding variable.
simulation-based inference (Ch.18, section 18.2): Statistical inference conducted through computer simulation (bootstrap, permutation tests) rather than mathematical formulas. See also: bootstrap, permutation test, Monte Carlo simulation.
skewed left (Ch.5, section 5.7): A distribution with a long tail extending to the left. The mean is pulled below the median. See also: skewed right, symmetric.
skewed right (Ch.5, section 5.7): A distribution with a long tail extending to the right. The mean is pulled above the median. Common for income, housing prices, and wait times. See also: skewed left, symmetric.
slope (Ch.22, section 22.6): In a regression equation y-hat = b0 + b1x, the coefficient b1 represents the predicted change in Y for a one-unit increase in X. See also: intercept, regression line*.
specificity (Ch.9, section 9.8): P(negative test | disease absent). The test's ability to correctly clear people who don't have the condition. Also called the true negative rate. See also: sensitivity, confusion matrix.
spurious correlation (Ch.22, section 22.5): A meaningless correlation produced by coincidence or confounding. The fact that two variables are correlated does not mean one causes the other. See also: lurking variable, confounding variable.
standard deviation (Ch.6, section 6.5): The square root of the variance. Measures the typical distance of values from the mean in the original units of measurement. For a sample: s = sqrt(sum of (xi - x-bar)^2 / (n-1)). See also: variance, Empirical Rule, z-score.
standard error (Ch.11, section 11.6): The standard deviation of the sampling distribution of a statistic. For means: SE = sigma / sqrt(n) or s / sqrt(n). For proportions: SE = sqrt(p(1-p) / n). Measures how much a statistic varies from sample to sample. See also: sampling distribution, Central Limit Theorem, margin of error*.
standard normal distribution (Ch.10, section 10.6): The normal distribution with mean 0 and standard deviation 1. Denoted Z ~ N(0, 1). Used as the reference distribution for z-tests and z-scores. See also: normal distribution, z-score, z-table.
standardized residual (Ch.19, section 19.8): In chi-square tests: (Observed - Expected) / sqrt(Expected). Identifies which cells contribute most to the chi-square statistic.
statistic (Ch.2, section 2.4): A numerical summary computed from a sample (e.g., sample mean x-bar, sample proportion p-hat). Used to estimate the corresponding population parameter. See also: parameter, sample.
statistical power (Ch.17, section 17.6): The probability of correctly rejecting H0 when it is actually false: Power = 1 - beta. Typically aim for 80% or higher. Depends on alpha, effect size, and sample size. See also: Type II error, power analysis, effect size.
statistical thinking (Ch.1, section 1.4): Seeing variation, uncertainty, and randomness not as obstacles to understanding but as the raw material of understanding.
statistically significant (Ch.13, section 13.7): A result where p-value is less than or equal to alpha. Does NOT mean the result is important, large, or practically meaningful. See also: practical significance, effect size.
statistics (Ch.1, section 1.1): The science of collecting, organizing, analyzing, and interpreting data in order to make decisions under uncertainty.
STATS checklist (Ch.26, section 26.9): A five-point framework for evaluating AI claims: Source, Training data, Accuracy metrics, Testing, Significance and Size.
stem-and-leaf plot (Ch.5, section 5.5): A display that shows the distribution of numerical data by separating each value into a "stem" (leading digits) and "leaf" (trailing digit). Preserves individual data values while showing the shape of the distribution.
stratified sampling (Ch.4, section 4.2): A sampling method that divides the population into subgroups (strata) and randomly samples within each stratum. Ensures all subgroups are represented. See also: cluster sampling, random sample.
success-failure condition (Ch.14, section 14.3): The requirement that np0 >= 10 and n(1-p0) >= 10 (for tests) or np-hat >= 10 and n(1-p-hat) >= 10 (for CIs). Ensures the normal approximation for proportions is valid.
sum of squares (Ch.20, section 20.4): In ANOVA: SS_Total (total variation), SS_Between (variation between group means), SS_Within (variation within groups). SS_Total = SS_Between + SS_Within.
supervised learning (Ch.26, section 26.2): Machine learning where the algorithm is given input data with correct answers. Regression when the output is numerical; classification when categorical. See also: unsupervised learning, machine learning.
survivorship bias (Ch.4, section 4.3): Bias that occurs when only the "survivors" (successes, remaining members) are studied, ignoring those who dropped out or failed. Example: studying only existing companies to understand business success.
symmetric (Ch.5, section 5.7): A distribution where the left and right sides are mirror images. The mean approximately equals the median. The normal distribution is symmetric. See also: skewed left, skewed right.
systematic sampling (Ch.4, section 4.2): Selecting every kth member from a list after a random start. See also: random sample.
t-distribution (Ch.12, section 12.4): A symmetric, bell-shaped distribution with heavier tails than the normal. Used when sigma is estimated by s. Indexed by degrees of freedom. Converges to the standard normal as df approaches infinity. Discovered by William Gosset (1908). See also: degrees of freedom, one-sample t-test.
technical report (Ch.25, section 25.11): A detailed account of a statistical analysis written for a knowledgeable audience. Includes full methodology, conditions checks, and diagnostic plots.
test of independence (Ch.19, section 19.5): A chi-square test that examines whether two categorical variables are associated. Uses a contingency table with df = (r-1)(c-1). See also: goodness-of-fit test, chi-square test, Cramer's V.
test statistic (Ch.13, section 13.4): A standardized measure of how far the sample data are from the null hypothesis value, expressed in standard errors. Larger values indicate stronger evidence against H0. See also: p-value, hypothesis testing.
threshold (classification) (Ch.24, section 24.8): The probability cutoff used to convert a predicted probability into a binary prediction (yes/no). Default is 0.5, but can be adjusted based on the relative costs of false positives and false negatives.
tidy data (Ch.7, section 7.9): Data organized so that each variable has its own column, each observation has its own row, and each value has its own cell. Concept formalized by Hadley Wickham.
training data (Ch.26, section 26.3): The dataset used to train a machine learning model. Functionally equivalent to a sample — if biased, the model inherits the bias. See also: algorithmic bias, overfitting.
treatment group (Ch.4, section 4.7): The group in an experiment that receives the treatment or intervention. See also: control group, experiment.
tree diagram (Ch.9, section 9.5): A branching visual showing all possible paths through a multi-step probability problem. Useful for calculating conditional and joint probabilities. See also: conditional probability, Bayes' theorem.
Tufte's principles (Ch.25, section 25.2): Data visualization guidelines developed by Edward Tufte: maximize data-ink ratio, use small multiples, show the data, avoid chartjunk. See also: data-ink ratio, chartjunk.
Tukey's HSD (Ch.20, section 20.8): Honestly Significant Difference. A post-hoc test for pairwise comparisons after a significant ANOVA result. Controls the family-wise error rate. See also: ANOVA, post-hoc test, multiple comparisons problem.
two-proportion z-test (Ch.16, section 16.6): A hypothesis test comparing proportions from two independent groups. Uses a pooled proportion under H0. See also: one-sample z-test for proportions.
two-sample t-test (Ch.16, section 16.3): A hypothesis test comparing the means of two independent groups. Welch's version (default, recommended) does not assume equal variances: t = (x-bar1 - x-bar2) / sqrt(s1^2/n1 + s2^2/n2). See also: paired t-test, Welch's t-test.
two-tailed test (Ch.13, section 13.8): A hypothesis test where Ha is non-directional (not equal to). The p-value uses both tails of the distribution. See also: one-tailed test.
Type I error (Ch.13, section 13.9): Rejecting H0 when it is actually true. A false positive or false alarm. The probability of a Type I error equals alpha. See also: Type II error, significance level.
Type II error (Ch.13, section 13.9): Failing to reject H0 when it is actually false. A missed detection or false negative. The probability of a Type II error is beta. Power = 1 - beta. See also: Type I error, statistical power.
Tufte, Edward (Ch.25, section 25.2): Yale professor and author of The Visual Display of Quantitative Information. Established the principles of effective data visualization.
unimodal (Ch.5, section 5.7): A distribution with a single peak. See also: bimodal, distribution shape.
unsupervised learning (Ch.26, section 26.2): Machine learning that finds structure in data without labeled outcomes. Clustering is the most common technique. See also: supervised learning, machine learning.
utilitarian ethics (Ch.27, section 27.8): An ethical framework focused on producing the greatest good for the greatest number. Consequentialist: evaluates actions by their outcomes. See also: rights-based ethics, care ethics.
variable (Ch.1, section 1.1): A characteristic that can take different values across observations. See also: categorical variable, numerical variable, observation.
variance (Ch.6, section 6.5): The average of squared deviations from the mean. For a sample: s^2 = sum of (xi - x-bar)^2 / (n-1). Divides by n-1 (Bessel's correction) to produce an unbiased estimate. See also: standard deviation.
VIF (Ch.23, section 23.8): Variance Inflation Factor. Measures how much a predictor's variance is inflated due to correlation with other predictors. VIF > 10 suggests problematic multicollinearity. See also: multicollinearity, multiple regression.
Welch's ANOVA (Ch.20, section 20.9): A version of one-way ANOVA that does not assume equal variances across groups. Preferred when Levene's test is significant.
Welch's t-test (Ch.16, section 16.3): The default two-sample t-test that does not assume equal variances. Uses the Welch-Satterthwaite approximation for degrees of freedom. Should be the default choice for comparing two independent means. See also: two-sample t-test.
Wilcoxon rank-sum test (Ch.21, section 21.6): A nonparametric test for comparing two independent groups based on ranks. Equivalent to the Mann-Whitney U test. See also: Mann-Whitney U test, two-sample t-test.
Wilcoxon signed-rank test (Ch.21, section 21.7): A nonparametric test for paired data that considers both the signs and magnitudes of the differences (via ranks). More powerful than the sign test. See also: sign test, paired t-test.
Wilson interval (Ch.14, section 14.6): An improved confidence interval for proportions with better coverage properties than the Wald interval, especially for small samples and extreme proportions.
winner's curse (Ch.17, section 17.6): The tendency for effect sizes in underpowered studies that happen to achieve significance to be inflated — the only studies that "win" (reach significance) are those that, by chance, overestimate the true effect. See also: effect size inflation, publication bias.
with replacement (Ch.18, section 18.3): Sampling such that each selected item is returned to the pool before the next draw. Each bootstrap resample includes approximately 63.2% of the original observations. See also: bootstrap, resampling.
within-group variability (Ch.20, section 20.3): In ANOVA, the variation of individual observations around their group means (SS_Within). Measures "noise" — the natural variation that exists even without a group effect. See also: between-group variability, F-statistic.
z-score (Ch.6, section 6.8; Ch.10, section 10.6): The number of standard deviations a value is from the mean: z = (x - mean) / SD. A z-score of 2 means the value is 2 standard deviations above the mean. Used to standardize data and compute probabilities from the normal distribution. See also: standard normal distribution, Empirical Rule.
z-table (Ch.10, section 10.6): A table giving P(Z <= z) for the standard normal distribution. Also called the standard normal table. See Appendix A.