Further Reading: Inference for Means

Books

For Deeper Understanding

Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (2013) Wheelan's treatment of the t-test is refreshingly practical. He explains why the distinction between $\sigma$ and $s$ matters with examples from polling, medical research, and business. If the z-vs.-t distinction feels abstract, Wheelan's real-world framing will make it concrete. Previously recommended for confidence intervals (Chapter 12) and hypothesis testing (Chapter 13) — the t-test coverage is a natural continuation.

David Salsburg, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (2001) Chapter 3 tells the story of William Sealy Gosset (Student) at the Guinness brewery in Dublin. Salsburg describes how Gosset, working with small samples of barley, realized that the normal approximation was inadequate and developed the t-distribution. The human story behind the mathematics — a brewery employee sneaking papers into scientific journals under a pseudonym — is one of the most charming in all of statistics. Essential reading for understanding why the t-distribution exists.

David Freedman, Robert Pisani, and Roger Purves, Statistics, 4th edition (2007) Chapters 26-27 provide an exceptionally careful treatment of the one-sample t-test and its conditions. Freedman was famously meticulous about when statistical procedures are and aren't appropriate, and his discussion of the normality condition — with specific examples of how non-normality affects the t-test at different sample sizes — directly complements Section 15.6. This is the reference for understanding the fine print.

Larry Gonick and Woollcott Smith, The Cartoon Guide to Statistics (1993) Don't let the title fool you — this book covers the t-test with surprising rigor while being genuinely fun to read. The visual explanation of degrees of freedom (using physical degrees of freedom as an analogy) is one of the clearest available. A good option for students who want an accessible review.

For the Mathematically Curious

George Casella and Roger Berger, Statistical Inference, 2nd edition (2002) Chapter 5 derives the t-distribution mathematically, starting from the ratio of a standard normal random variable to the square root of a chi-squared random variable divided by its degrees of freedom. If you want to understand why the t-distribution has the shape it does (not just that it does), this is the definitive treatment. Warning: requires comfort with calculus and probability theory.

Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference (2004) Chapter 10 covers one-sample inference with mathematical precision. Wasserman's treatment of robustness is particularly good — he discusses influence functions and breakdown points, which formalize the intuitive robustness discussion from Section 15.7. Connects naturally to the nonparametric methods in Chapter 21.

Articles and Papers

Student [W. S. Gosset] (1908). "The Probable Error of a Mean." Biometrika, 6(1), 1-25. The original paper that introduced the t-distribution. Remarkably readable for a 118-year-old paper. Gosset's writing is clear and practical — he was solving a real problem (how to do inference with small brewery samples), and the paper reflects that applied orientation. Available free through many university libraries and at https://doi.org/10.2307/2331554. Reading this paper after learning the t-test gives you an appreciation for how elegant the original insight was.

Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). "The Importance of the Normality Assumption in Large Public Health Data Sets." Annual Review of Public Health, 23, 151-169. An excellent discussion of when the normality assumption matters and when it doesn't. The authors demonstrate (using simulation and real public health datasets) that for large samples, the t-test is remarkably robust — but that for skewed data, the mean itself may not be the right summary, regardless of what the t-test says. This paper directly supports the robustness discussion in Section 15.7 and connects to Maya's public health context.

Boneau, C. A. (1960). "The Effects of Violations of Assumptions Underlying the t-Test." Psychological Bulletin, 57(1), 49-64. The classic study on t-test robustness. Boneau systematically varied population shape, sample size, and variance equality to determine when the t-test breaks down. His finding — that the t-test is surprisingly robust to non-normality but sensitive to unequal variances in two-sample settings — has been confirmed by dozens of subsequent studies. The simulation table in Section 15.6 is inspired by this tradition of robustness research.

de Winter, J. C. F. (2013). "Using the Student's t-Test with Extremely Small Sample Sizes." Practical Assessment, Research & Evaluation, 18(10). What happens when you use a t-test with $n = 2$, 3, or 5? This paper explores the limits of the t-test for very small samples. The key finding: for normal populations, the t-test works correctly even with tiny samples (as it should — it was designed for this). For non-normal populations, very small samples lead to severely distorted p-values. Directly relevant to the $n < 15$ guideline in Section 15.6.

Delacre, M., Lakens, D., & Leys, C. (2017). "Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test." International Review of Social Psychology, 30(1), 92-101. While focused on the two-sample case (Chapter 16), this paper includes a thorough discussion of the one-sample t-test and why it's preferred over the z-test. The authors argue that the default should always be the t-test, even when sample sizes are large — exactly the advice in Section 15.10.

Online Resources

Interactive Tools

Seeing Theory — Frequentist Inference https://seeing-theory.brown.edu/frequentist-inference/ Brown University's interactive visualization now comes alive with the t-test. You can set up a one-sample t-test, watch the t-distribution change as you adjust degrees of freedom, and see the p-value as an area under the curve. Particularly powerful for building intuition about how degrees of freedom affect the shape of the distribution.

Interactive t-Distribution Explorer https://rpsychologist.com/d3/tdist/ Kristoffer Magnusson's beautiful interactive tool lets you adjust degrees of freedom and watch the t-distribution morph between heavy-tailed (small df) and nearly normal (large df). You can also set up hypothesis tests and see how the rejection region, power, and p-value change with different parameters. The best single resource for building intuition about the t-distribution.

StatKey: One-Sample t-Test http://www.lock5stat.com/StatKey/ StatKey provides both formula-based and randomization-based one-sample tests. Try running both on the same dataset to see how they compare. The randomization approach (Chapter 18) provides an intuitive check on the formula-based t-test.

GeoGebra: t-Distribution vs. Normal Distribution https://www.geogebra.org/m/mf7qkqg2 An interactive applet that overlays the t-distribution on the standard normal, with a slider for degrees of freedom. Watching the t-distribution gradually converge to the normal as df increases from 1 to 100 is one of the most effective ways to internalize the relationship between the two distributions.

Video Resources

StatQuest with Josh Starmer: "The t-test: Clearly Explained" (YouTube) Josh Starmer provides a characteristically energetic and clear walkthrough of the one-sample t-test. He emphasizes the practical distinction between z and t, explains degrees of freedom with helpful visuals, and works through a complete example. His separate video "t-Distribution: Clearly Explained" is excellent companion viewing.

3Blue1Brown: "Why is the Normal Distribution So Normal?" (YouTube) While not specifically about the t-test, Grant Sanderson's explanation of why the normal distribution appears everywhere (via the CLT) is essential background for understanding why the t-test is so robust. The visual intuition he builds about averaging and normal convergence directly supports the robustness discussion in Section 15.7.

Khan Academy: "One-Sample t-Test" (YouTube/khanacademy.org) Sal Khan walks through multiple one-sample t-test examples with careful attention to each step. Particularly good for students who want more practice with the computational aspects. His coverage of the t-table is more thorough than most online resources.

Crash Course Statistics: "t-Tests: A Matched Pair Made in Heaven" (YouTube) A fast-paced overview of t-tests, including a preview of the paired t-test. Good for a quick review or as preparation for Chapter 16. The animations showing the t-distribution's convergence to the normal are particularly effective.

jbstatistics: "Introduction to the t-Distribution" (YouTube) Jeremy Balka's videos on the t-distribution and one-sample t-test are among the most mathematically precise available on YouTube while remaining accessible. He carefully distinguishes between what the t-distribution assumes and what it's robust to — exactly the distinction emphasized in this chapter.

Software Documentation

SciPy: scipy.stats.ttest_1samp https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html Full documentation for the one-sample t-test function used throughout this chapter. Key note: the alternative parameter (added in SciPy 1.7) supports 'two-sided' (default), 'less', and 'greater'. For older SciPy versions, divide the two-tailed p-value by 2 for one-tailed tests (when the test statistic is in the expected direction).

SciPy: scipy.stats.t https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html The t-distribution object with methods for PDF (.pdf), CDF (.cdf), survival function (.sf), and percent point function / inverse CDF (.ppf). Essential for computing p-values from summary statistics and for finding critical values.

SciPy: scipy.stats.shapiro https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html The Shapiro-Wilk normality test, used for condition checking in this chapter. Key caveat: for large samples ($n > 100$), this test will almost always reject normality, even when the t-test is perfectly robust. Use it as one tool among many (alongside histograms and QQ-plots), not as a sole gatekeeper.

pingouin: One-Sample t-Test https://pingouin-stats.org/build/html/generated/pingouin.ttest.html The pingouin library's ttest() function returns not just the t-statistic and p-value but also Cohen's d (effect size), the confidence interval, and the Bayes factor — all in a single, clean output. Excellent for following the recommendation to report multiple measures of evidence. Preview for Chapter 17's discussion of effect sizes.

statsmodels: DescrStatsW.ttest_mean https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.DescrStatsW.ttest_mean.html An alternative to SciPy for one-sample t-tests, with the advantage of working directly from summary statistics (no raw data needed). Also provides confidence intervals through the .tconfint_mean() method.

Historical Notes

William Sealy Gosset and the Birth of Small-Sample Statistics

The t-distribution was born from a practical problem at the Guinness brewery in Dublin, Ireland.

In the early 1900s, Guinness hired William Sealy Gosset (1876-1937), an Oxford-trained chemist, to apply scientific methods to the brewing process. Gosset needed to evaluate the quality of barley and hops using small samples — often just 3 to 10 measurements. He quickly realized that the standard normal-based methods of the day (developed by Karl Pearson) didn't work well for small samples. The critical values were too small, leading to overconfident conclusions.

Gosset spent years working out the exact distribution of the ratio $(\bar{x} - \mu)/(s/\sqrt{n})$ and published his results in 1908 under the pseudonym "Student" (Guinness prohibited employees from publishing under their own names to protect trade secrets). The paper, "The Probable Error of a Mean," introduced what we now call the t-distribution.

The pseudonym stuck. Even after Gosset's identity became widely known, the distribution continued to be called "Student's t-distribution" — a gentle irony, since Gosset was neither a student nor a professor but a brewery employee.

Ronald Fisher later provided the mathematical theory underlying Gosset's empirical results and popularized the t-test as a standard tool. Fisher's 1925 textbook, Statistical Methods for Research Workers, introduced the t-test to a wide scientific audience and established many of the conventions we still use (including the $\alpha = 0.05$ threshold that Gosset himself never endorsed as a rigid rule).

The Legacy

Gosset's contribution was profound: he showed that small samples need different treatment than large samples, and he provided the exact tool needed. Before Gosset, researchers with small datasets had to either make unwarranted assumptions about $\sigma$ or give up on formal inference entirely.

Today, the one-sample t-test is used millions of times per day — in hospitals, factories, labs, tech companies, and classrooms. Every time someone tests whether a mean is different from a standard, they're using Gosset's 1908 insight. It's arguably the single most widely applied statistical procedure in history.

And it all started because a brewery needed to test its barley.

What's Coming Next

Chapter 16 will extend the t-test to comparing two groups. You'll learn:

The independent-samples t-test for comparing means from two different groups (e.g., treatment vs. control)
The paired t-test for before-and-after designs (which is just a one-sample t-test on the differences — you already know the engine!)
The two-proportion z-test for comparing proportions between groups
How to choose between paired and independent-samples designs

This will let you answer Alex's big question: "Did the new algorithm increase watch time compared to the old one?" And Professor Washington's: "Is the algorithm's false positive rate different for Black defendants vs. white defendants?"

Resources to preview: - StatQuest: "Two-Sample t-Test" (YouTube) — clear walkthrough of the independent-samples t-test - Khan Academy: "Paired t-Test" (khanacademy.org) — multiple worked examples of before-and-after designs - Seeing Theory: Frequentist Inference module — interactive visualization of two-sample comparisons