Appendix G: Key Studies Summary
This appendix summarizes landmark studies referenced throughout the textbook. For each study, we provide the citation, key findings, why the study matters for statistics, and where it appears in the text.
G.1 Replication and Research Methods
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407-425.
What they found: Nine experiments appeared to show that people could perceive future events before they happened — "precognition." Eight of nine studies produced statistically significant results (p < 0.05).
Why it matters: Bem's study became a lightning rod for criticism of standard hypothesis testing practices. If standard methods could "prove" precognition, something was wrong with the methods. The study helped spark the replication crisis by demonstrating how flexible data analysis (many tests, selective reporting) could produce significant results for an implausible hypothesis. Multiple replication attempts failed to reproduce the findings.
Chapter references: Ch.1 case-study-01, Ch.13 (p-hacking discussion), Ch.17 (replication crisis).
Citation tier: Tier 1 (verified, peer-reviewed).
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
What they found: A team of 270 researchers attempted to replicate 100 published psychology studies. Only 36% of replications produced statistically significant results (compared to 97% of the originals). Mean effect sizes in replications were half the magnitude of the originals.
Why it matters: This is the most comprehensive systematic replication effort in any scientific field. It quantified the replication crisis: roughly two-thirds of published findings could not be reproduced. The study catalyzed widespread reforms including pre-registration, registered reports, and increased emphasis on effect sizes and power analysis.
Chapter references: Ch.1 case-study-01, Ch.13 (false discovery rate), Ch.17 (replication crisis, four-factor explanation), Ch.27 (ethical reforms).
Citation tier: Tier 1.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.
What they found: Using a mathematical model incorporating study power, bias, and the number of hypotheses tested in a field, Ioannidis argued that more than half of published research findings are likely false. The probability increases when studies are underpowered, when many teams test many hypotheses, and when there is financial or career pressure to find significant results.
Why it matters: This paper is one of the most cited in the history of medical and scientific methodology. It framed the replication crisis as a systemic problem rooted in how science is incentivized, not just how statistics are misused. It remains the foundational reference for understanding false discovery rates.
Chapter references: Ch.13 case-study-01, Ch.17 (publication bias).
Citation tier: Tier 1.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366.
What they found: Through simulation and a real experiment, the authors demonstrated that common "researcher degrees of freedom" — choosing when to stop collecting data, which variables to analyze, which outliers to exclude, and which comparisons to report — could produce false-positive rates as high as 61% (compared to the nominal 5%).
Why it matters: This paper coined the practical understanding of "researcher degrees of freedom" and showed concretely how even well-intentioned researchers could produce false positives through flexible analysis. It led directly to calls for pre-registration.
Chapter references: Ch.13 (garden of forking paths), Ch.17 (p-hacking simulation), Ch.27 (questionable research practices).
Citation tier: Tier 1.
G.2 Probability and Decision-Making
Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47(2), 263-291.
What they found: People do not evaluate probabilities rationally. They overweight small probabilities (explaining lottery ticket purchases), underweight large probabilities, and are more sensitive to losses than to equivalent gains (loss aversion). Decision-making systematically deviates from the predictions of expected utility theory.
Why it matters: This paper launched behavioral economics and earned Kahneman the 2002 Nobel Prize in Economics. It demonstrates that human brains are not built for probabilistic reasoning — which is precisely why formal statistical training matters. Understanding cognitive biases helps explain why people misinterpret p-values, confidence intervals, and risk assessments.
Chapter references: Ch.8 (probability and human intuition), Ch.9 (base rate fallacy context), Ch.26 (AI and human decision-making).
Citation tier: Tier 1.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124-1131.
What they found: People use mental shortcuts (heuristics) that are usually helpful but can lead to systematic errors. Three key heuristics: representativeness (judging probability by similarity), availability (judging frequency by how easily examples come to mind), and anchoring (being influenced by irrelevant starting points).
Why it matters: These heuristics directly explain common statistical mistakes: the gambler's fallacy, base rate neglect, and overconfidence in small samples. Understanding them helps students recognize when their intuitions about probability are likely to mislead them.
Chapter references: Ch.8 (gambler's fallacy), Ch.9 (base rate fallacy, prosecutor's fallacy), Ch.11 (intuition about sample size).
Citation tier: Tier 1.
Gigerenzer, G. (2002). Calculated risks: How to know when numbers deceive you. Simon & Schuster.
What they found: Most physicians, lawyers, and the general public cannot correctly calculate the positive predictive value of a medical test from sensitivity, specificity, and prevalence. However, when the same information is presented as natural frequencies instead of conditional probabilities, accuracy jumps from about 15% to over 75%.
Why it matters: This work demonstrates that the format of statistical information dramatically affects understanding. Natural frequencies (e.g., "Out of 1,000 women, 8 have breast cancer, and of those 8, the mammogram detects 7") are more intuitive than conditional probabilities (sensitivity = 87.5%). This finding shaped how medical testing is presented in Chapters 9 and 26.
Chapter references: Ch.9 (natural frequencies for Bayes' theorem), Ch.26 (evaluating AI diagnostic systems).
Citation tier: Tier 1 (book; underlying research peer-reviewed).
G.3 Algorithmic Bias and Fairness
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.
What they found: A widely used healthcare algorithm systematically underestimated the health needs of Black patients. At a given risk score, Black patients were significantly sicker than White patients. The bias arose because the algorithm used healthcare SPENDING as a proxy for health NEEDS — but Black patients historically had less access to healthcare and therefore spent less, even when equally or more ill. Fixing the bias would increase the percentage of Black patients receiving additional care from 17.7% to 46.5%.
Why it matters: This study, published in Science, demonstrated that algorithmic bias can arise from apparently neutral variables (cost) that carry embedded historical discrimination. It is the premier example of proxy variable bias in AI and illustrates why statistical assumptions must be examined for fairness.
Chapter references: Ch.1 (introduction to algorithmic bias), Ch.4 (confounding), Ch.26 (AI bias), Ch.27 (ethical data practice).
Citation tier: Tier 1.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. ProPublica, May 23, 2016.
What they found: The COMPAS recidivism prediction algorithm used in U.S. courts was twice as likely to falsely label Black defendants as future criminals (false positive rate: 44.9% for Black defendants vs. 23.5% for White defendants). White defendants were more likely to be falsely labeled as low risk (false negative rate: 47.7% vs. 28.0%).
Why it matters: This investigation brought algorithmic fairness into public consciousness. It demonstrates how confusion matrices reveal disparities invisible in aggregate accuracy metrics. The subsequent debate — including Northpointe's rebuttal that calibration was equal across races — led to the Chouldechova impossibility theorem showing that multiple fairness criteria cannot be simultaneously satisfied when base rates differ.
Chapter references: Ch.16 case-study-02 (two-proportion z-test on FPR), Ch.24 (logistic regression and fairness), Ch.26 case-study-01 (full COMPAS analysis), Ch.27 (ethical dimensions).
Citation tier: Tier 1 (investigative journalism with publicly available data and methodology).
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153-163.
What they found: Proved mathematically that when base rates (actual recidivism rates) differ between groups, it is impossible to simultaneously achieve three intuitively desirable fairness criteria: equal false positive rates, equal false negative rates, and equal predictive values across groups. Any algorithm must sacrifice at least one.
Why it matters: This impossibility result transformed the algorithmic fairness debate from "just remove the bias" to "choose which type of unfairness you're willing to accept." It demonstrates that fairness is a values decision, not a purely technical one — and that statistics provides the framework for making that decision explicit.
Chapter references: Ch.16 case-study-02, Ch.24 case-study-02, Ch.26 (fairness impossibility), Ch.27 case-study-02.
Citation tier: Tier 1.
G.4 Classic Experiments and Methodology
Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal and Social Psychology, 67(4), 371-378.
What they found: 65% of participants administered what they believed were lethal electric shocks to another person when instructed by an authority figure, even when the "victim" screamed in pain and begged to stop. Results held across many variations of the experiment.
Why it matters for statistics: Milgram's study is a foundational example for discussions of research ethics and informed consent. Participants were deceived about the study's purpose and experienced significant psychological distress. The study contributed to the development of IRB requirements and the Belmont Report's principles. It also illustrates the gap between what people predict about human behavior and what data shows.
Chapter references: Ch.4 (experimental ethics), Ch.27 (informed consent, IRB).
Citation tier: Tier 1.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638-641.
What they found: Proposed that for every published study showing a significant result, there may be many unpublished studies showing null results — tucked away in researchers' "file drawers." This selective publication biases the scientific literature toward false positives.
Why it matters: This paper introduced the concept of publication bias and the "file drawer problem," which became a cornerstone of the replication crisis discussion. It explains why meta-analyses can be misleading if they only include published studies, and why registered reports and pre-registration are needed.
Chapter references: Ch.17 (file drawer problem, publication bias), Ch.27 (ethical reforms).
Citation tier: Tier 1.
Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd.
What they found: Established the foundations of experimental design including randomization, replication, and blocking. Introduced the concept of the null hypothesis and significance testing. Famously used a "lady tasting tea" experiment to illustrate exact probability calculations.
Why it matters: Fisher created the statistical framework used throughout this textbook — from p-values to ANOVA to regression. Understanding the historical context helps explain why we use alpha = 0.05 (Fisher's pragmatic convention), why the F-distribution is named after him, and why the philosophy of significance testing continues to be debated.
Chapter references: Ch.12 (confidence intervals, Fisher vs. Neyman debate), Ch.13 (hypothesis testing framework), Ch.20 (ANOVA invention).
Citation tier: Tier 1.
G.5 Data Visualization and Communication
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17-21.
What they found: Constructed four datasets ("Anscombe's Quartet") with nearly identical summary statistics (same mean, variance, correlation, and regression line) but dramatically different visual patterns. One is linear, one is curved, one has an outlier, and one has a single influential point.
Why it matters: This elegant demonstration proves that summary statistics alone are insufficient — you MUST visualize your data. It's the single most powerful argument for exploratory data analysis and the reason "always plot first" is a cardinal rule of statistics.
Chapter references: Ch.5 (always plot first), Ch.22 (correlation and regression).
Citation tier: Tier 1.
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press.
What they found: Established principles for effective data visualization: maximize the data-ink ratio (remove chartjunk), use small multiples for comparison, show the data at several levels of detail, and never deceive the viewer.
Why it matters: Tufte's principles remain the gold standard for data visualization 40+ years later. His concepts (data-ink ratio, chartjunk, small multiples) provide the vocabulary for evaluating and improving statistical graphics.
Chapter references: Ch.25 (data visualization principles, Tufte's principles).
Citation tier: Tier 1 (book; foundational reference in the field).
G.6 Medical Testing and Bayes' Theorem
Casscells, W., Schoenberger, A., & Graboys, T. B. (1978). Interpretation by physicians of clinical laboratory results. New England Journal of Medicine, 299(18), 999-1001.
What they found: Presented 60 physicians and medical students with a simple Bayesian reasoning problem: a disease has a 1/1000 prevalence, a test has a 5% false positive rate and 100% sensitivity. "If a patient tests positive, what is the probability they have the disease?" Only 11 out of 60 (18%) gave the correct answer (approximately 2%). Most answered 95%.
Why it matters: This study dramatically demonstrates how difficult probabilistic reasoning is — even for highly educated professionals. The doctors were confusing the sensitivity P(positive | disease) with the positive predictive value P(disease | positive), which is exactly the prosecutor's fallacy. It motivates the need for formal training in Bayes' theorem and conditional probability.
Chapter references: Ch.9 (medical testing, base rate fallacy, natural frequencies).
Citation tier: Tier 1.
G.7 Sampling and Survey Methodology
Literary Digest Presidential Poll (1936)
What happened: The Literary Digest magazine mailed postcards to 10 million people (based on telephone directories and car registrations) to predict the 1936 presidential election. 2.4 million responded. The poll predicted Alf Landon would win in a landslide over Franklin Roosevelt. Roosevelt won with 61% of the popular vote.
Why it matters: The largest poll in history produced one of the worst predictions — because the sample was biased, not random. People who owned telephones and cars in 1936 were disproportionately wealthy and Republican. This disaster established that sample quality matters more than sample size, and it launched George Gallup's career in scientific polling.
Chapter references: Ch.4 (sampling bias), Ch.11 case-study-01, Ch.12 case-study-02 (polling margins of error), Ch.14 (polling context).
Citation tier: Tier 2 (historical event, widely documented).
G.8 Ethics and Research Integrity
Tuskegee Syphilis Study (1932-1972)
What happened: The U.S. Public Health Service studied the progression of untreated syphilis in 399 Black men in Tuskegee, Alabama. Researchers told participants they were receiving free treatment but in fact withheld treatment — including penicillin, which became available in the 1940s. The study continued for 40 years. Participants suffered, went blind, went insane, and died from a treatable disease.
Why it matters: The Tuskegee study is the most cited example of research ethics violations in American history. It led directly to the National Research Act (1974), the Belmont Report (1979), and the modern IRB system. It also explains the justified distrust of medical research in Black communities — a distrust that persists today and affects public health.
Chapter references: Ch.4 (experimental ethics), Ch.27 (Belmont Report, informed consent, IRB).
Citation tier: Tier 2 (historical event, extensively documented).
Facebook Emotional Contagion Study — Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24), 8788-8790.
What they found: Facebook manipulated the News Feeds of 689,003 users for one week, showing some users more positive content and others more negative content. Users exposed to more positive content posted more positive updates, and vice versa — demonstrating "emotional contagion" at scale.
Why it matters: The study generated a firestorm because users were experimented on without their knowledge or explicit consent. Facebook argued its Terms of Service constituted consent, but ethicists disagreed. It raised fundamental questions about informed consent in the digital age and whether corporate A/B testing crosses ethical lines.
Chapter references: Ch.4 (experimental ethics, informed consent), Ch.27 (digital consent, corporate research ethics).
Citation tier: Tier 1.
G.9 Statistical History
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute, 15, 246-263.
What they found: Tall parents tend to have children shorter than themselves, and short parents tend to have children taller than themselves. The children's heights "regress" toward the population average.
Why it matters: This paper coined the term "regression" and identified regression to the mean as a fundamental statistical phenomenon. Regression to the mean explains why the Sports Illustrated cover jinx is not actually a jinx, why extreme medical test results tend to be less extreme on retest, and why second-semester performance after a great first semester is often disappointing. It's one of the most commonly misunderstood phenomena in everyday life.
Chapter references: Ch.22 (regression to the mean, threshold concept).
Citation tier: Tier 1.
Gosset, W. S. [Student]. (1908). The probable error of a mean. Biometrika, 6(1), 1-25.
What they found: Derived the t-distribution for inference about means when the population standard deviation is unknown and must be estimated from a small sample. Published under the pseudonym "Student" because his employer (Guinness Brewery) prohibited employees from publishing.
Why it matters: The t-distribution is one of the most widely used tools in statistics. Every time you compute a confidence interval for a mean or run a t-test, you're using Gosset's work. The story of the Guinness brewery statistician who changed science is also a memorable illustration of how practical problems drive theoretical innovation.
Chapter references: Ch.12 (t-distribution introduction, Gosset biography), Ch.15 (t-test).
Citation tier: Tier 1.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1-26.
What they found: Introduced the bootstrap — a resampling method that estimates the sampling distribution of any statistic by repeatedly sampling with replacement from the observed data. This simple idea allows inference in situations where traditional formulas don't exist.
Why it matters: The bootstrap revolutionized statistics by making inference possible without distributional assumptions. It's the foundation of simulation-based inference and is used extensively in modern data science and machine learning. The key insight — that the sample can serve as a stand-in for the population — is both profound and intuitive.
Chapter references: Ch.18 (bootstrap methods, Efron biography).
Citation tier: Tier 1.
Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975). Sex bias in graduate admissions: Data from Berkeley. Science, 187(4175), 398-404.
What they found: Aggregate admissions data at UC Berkeley appeared to show discrimination against women (men admitted at 44%, women at 35%). But when broken down by department, most departments actually slightly favored women. The apparent bias disappeared because women applied disproportionately to more competitive departments with lower overall admission rates.
Why it matters: This is the most famous real-world example of Simpson's paradox — a trend that appears in aggregated data but reverses when the data is broken into subgroups. It powerfully illustrates why looking at the data at the right level of aggregation is essential for correct conclusions.
Chapter references: Ch.27 (Simpson's paradox, threshold concept).
Citation tier: Tier 1.
G.10 Summary Table
| Study | Year | Key Topic | Chapter(s) |
|---|---|---|---|
| Bem (precognition) | 2011 | Replication crisis, p-hacking | 1, 13, 17 |
| Open Science Collaboration (replication) | 2015 | Replication rates | 1, 13, 17, 27 |
| Ioannidis (false findings) | 2005 | False discovery rate | 13, 17 |
| Simmons et al. (false-positive psychology) | 2011 | Researcher degrees of freedom | 13, 17, 27 |
| Kahneman & Tversky (prospect theory) | 1979 | Probabilistic reasoning | 8, 9, 26 |
| Tversky & Kahneman (heuristics) | 1974 | Cognitive biases in probability | 8, 9, 11 |
| Gigerenzer (calculated risks) | 2002 | Natural frequencies, Bayes | 9, 26 |
| Obermeyer et al. (healthcare algorithm bias) | 2019 | Proxy variable bias in AI | 1, 4, 26, 27 |
| Angwin et al. / ProPublica (COMPAS) | 2016 | Algorithmic fairness | 16, 24, 26, 27 |
| Chouldechova (impossibility theorem) | 2017 | Fairness tradeoffs | 16, 24, 26, 27 |
| Milgram (obedience) | 1963 | Research ethics | 4, 27 |
| Rosenthal (file drawer) | 1979 | Publication bias | 17, 27 |
| Fisher (experimental design) | 1935 | Foundations of hypothesis testing | 12, 13, 20 |
| Anscombe (quartet) | 1973 | Always visualize data | 5, 22 |
| Tufte (visual display) | 1983 | Data visualization principles | 25 |
| Casscells et al. (physician Bayes) | 1978 | Base rate fallacy | 9 |
| Literary Digest poll | 1936 | Sampling bias | 4, 11, 12, 14 |
| Tuskegee syphilis study | 1932-72 | Research ethics | 4, 27 |
| Facebook emotional contagion | 2014 | Digital consent | 4, 27 |
| Galton (regression to mean) | 1886 | Regression | 22 |
| Gosset / Student (t-distribution) | 1908 | t-test | 12, 15 |
| Efron (bootstrap) | 1979 | Resampling methods | 18 |
| Bickel et al. (Berkeley admissions) | 1975 | Simpson's paradox | 27 |