> "The statistician who supposes that his main job is to test whether effects are significant has a point of view that seems almost frivolous."
Learning Objectives
- Conduct a one-sample t-test for a population mean
- Construct and interpret confidence intervals for means using the t-distribution
- Verify conditions for t-procedures (randomness, normality, independence)
- Understand when to use z vs. t procedures
- Handle violations of assumptions and decide when t-procedures are robust
In This Chapter
- Chapter Overview
- 15.1 A Puzzle Before We Start (Productive Struggle)
- 15.2 The One-Sample t-Test: The Formula
- 15.3 The Five-Step Procedure
- 15.4 Complete Worked Example 1: Maya's Hospital Wait Times
- 15.5 Confidence Intervals for Means: A Deeper Look
- 15.6 Checking Conditions: The Normality Requirement
- 15.7 Robustness: How Sturdy Is the t-Test?
- 15.8 Complete Worked Example 2: Alex's Watch Time Analysis
- 15.9 Complete Worked Example 3: Sam's Points Per Game
- 15.10 z-Test vs. t-Test: The Complete Comparison
- 15.11 The t-Distribution Acknowledges Our Uncertainty
- 15.12 A Preview: Paired Data and the Paired t-Test
- 15.13 Python: Complete t-Test Toolkit
- 15.14 Excel: t-Test and Confidence Interval Reference
- 15.15 Putting It All Together: A Decision Flowchart
- 15.16 Progressive Project: Conduct t-Tests on Numerical Variables
- 15.17 Common Mistakes and How to Avoid Them
- 15.18 Chapter Summary
Chapter 15: Inference for Means
"The statistician who supposes that his main job is to test whether effects are significant has a point of view that seems almost frivolous." — W. Edwards Deming
Chapter Overview
Here's a question that comes up every day in hospitals, factories, classrooms, and tech companies: is this average different from what we expected?
Dr. Maya Chen looks at her sample of emergency department wait times and wonders: is the average wait in her county really exceeding the 4-hour national benchmark? Alex Rivera pulls a sample of viewing sessions and asks: is the average watch time on the new StreamVibe interface actually different from the 45-minute industry standard? Sam Okafor crunches the Riverside Raptors' scoring data and wants to know: did average points per game actually change after the coaching switch?
These are all questions about a population mean — and answering them requires the one-sample t-test, arguably the most commonly used statistical test in all of applied science.
You've actually built almost all the pieces for this already. In Chapter 12, you learned to construct confidence intervals using the t-distribution. In Chapter 13, you learned the logic of hypothesis testing — null and alternative hypotheses, p-values, significance levels, and decision-making. Now we're going to combine these two ideas into a single, powerful procedure specifically designed for testing claims about population means.
Here's the key theme of this chapter: the t-distribution is what honesty looks like in statistics. When we use $s$ instead of $\sigma$ (which is almost always), the t-distribution acknowledges that we're less certain than we'd be if we magically knew $\sigma$. Its heavier tails say: "You've estimated the spread from a finite sample, so we're going to give you wider margins to compensate." That's not a weakness — that's intellectual integrity. The t-distribution doesn't pretend to know more than it does.
In this chapter, you will learn to: - Conduct a one-sample t-test for a population mean - Construct and interpret confidence intervals for means using the t-distribution - Verify conditions for t-procedures (randomness, normality, independence) - Understand when to use z vs. t procedures - Handle violations of assumptions and decide when t-procedures are robust
Fast Track: If you've conducted t-tests before and can state the conditions from memory, skim Sections 15.1–15.3 and jump to Section 15.7 (robustness). Complete quiz questions 1, 8, and 15 to verify your understanding.
Deep Dive: After this chapter, read Case Study 1 (Maya's hospital wait times) for a complete public health application, then Case Study 2 (Alex's A/B testing pipeline) for a look at how tech companies use t-tests at scale. Both case studies include full worked solutions.
15.1 A Puzzle Before We Start (Productive Struggle)
Before we dive into formulas, try this thought experiment.
The Mystery Samples
A quality control engineer at a cereal factory tests whether boxes are being filled to the advertised weight of 500 grams. She pulls a random sample and measures each box.
Sample A: $n = 8$ boxes. Mean = 496 g, standard deviation = 3 g.
Sample B: $n = 50$ boxes. Mean = 498 g, standard deviation = 12 g.
(a) Which sample gives stronger evidence that the factory is underfilling? Trust your gut before doing any calculations.
(b) Sample A has a mean that's 4 grams below the target, while Sample B is only 2 grams below. Does that change your answer?
(c) Sample A has a smaller standard deviation ($s = 3$) and a bigger shortfall (4 g below target). Why might Sample B still give stronger evidence of underfilling?
(d) Here's the twist: Sample A has only 8 boxes. How confident can you really be that $s = 3$ is close to the true population standard deviation? What if the true $\sigma$ is really 10?
Take 3 minutes. Pay special attention to part (d).
Here's what I hope you noticed:
For part (a), your instinct probably said Sample A — it's farther below the target. But if you thought about it a bit more, you might have hesitated. Sample B has more data.
Part (b) highlights the tension: Sample A has a bigger shortfall but fewer observations.
Part (c) gets at the heart of it. The standard deviation matters enormously. Sample A's small standard deviation of 3 makes that 4-gram shortfall look huge in relative terms (more than one SD below the target), but Sample B's large standard deviation of 12 makes the 2-gram shortfall look tiny (less than one-sixth of an SD). The test statistic — which measures the shortfall relative to the variability — will be similar for both samples.
But part (d) is the crucial insight: with only 8 boxes, you can't trust that $s = 3$ is a good estimate of $\sigma$. Maybe the true population SD is 5 or 8 or 12. You just happened to get 8 boxes that were unusually consistent. With $n = 50$, you can trust $s = 12$ much more.
This is exactly the problem the t-distribution solves. It says: "You have to use $s$ instead of $\sigma$, and with small samples, $s$ might be way off. So I'm going to give you wider margins — heavier tails — to compensate for that extra uncertainty."
Let's make this precise.
15.2 The One-Sample t-Test: The Formula
🔄 Spaced Review 1 (from Ch.12): Confidence Intervals and the t-Distribution
In Chapter 12, you learned that when $\sigma$ is unknown (the usual case), you replace the z-distribution with the t-distribution. The confidence interval formula became:
$$\bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}}$$
where $t^*$ comes from the t-distribution with $df = n - 1$, and the heavier tails of the t-distribution produce wider intervals than the z-distribution would. This width isn't a bug — it's honesty about the extra uncertainty from estimating $\sigma$ with $s$.
Now we're going to take that same idea and apply it to hypothesis testing. The t-distribution's role in confidence intervals (widening the interval) becomes its role in hypothesis tests (making the p-value larger, requiring stronger evidence to reject $H_0$).
In Chapter 13, you learned the general form of a test statistic:
$$\text{test statistic} = \frac{\text{sample statistic} - \text{null hypothesis value}}{\text{standard error of the statistic}}$$
For testing a claim about a population mean $\mu$, the sample statistic is $\bar{x}$, the null hypothesis value is $\mu_0$, and the standard error is $s/\sqrt{n}$. Putting it together:
$$\boxed{t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}}$$
This is the one-sample t-test statistic. Let's break it down:
| Component | Symbol | What It Represents |
|---|---|---|
| Sample mean | $\bar{x}$ | The average of your sample data |
| Hypothesized mean | $\mu_0$ | The value you're testing against (from $H_0$) |
| Sample standard deviation | $s$ | Your estimate of the population spread |
| Sample size | $n$ | How many observations you have |
| Standard error | $s/\sqrt{n}$ | How much $\bar{x}$ typically varies from $\mu$ |
The t-statistic tells you: how many standard errors is the sample mean from the hypothesized value? If it's many standard errors away, the data are surprising under $H_0$. If it's close to zero, the data are perfectly consistent with $H_0$.
Concept 1: The One-Sample t-Test
The one-sample t-test tests whether a population mean $\mu$ equals a specific value $\mu_0$. It uses the t-distribution (rather than the normal distribution) because we estimate $\sigma$ with $s$. The test statistic $t = (\bar{x} - \mu_0)/(s/\sqrt{n})$ follows a t-distribution with $df = n - 1$ when the conditions are met and $H_0$ is true.
How the t-Test Differs from the z-Test
In Chapter 13, you saw the z-test statistic for a mean:
$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$
The only difference between the z-test and the t-test is what goes in the denominator:
| Test | Denominator | Distribution | When to Use |
|---|---|---|---|
| z-test | $\sigma / \sqrt{n}$ (known $\sigma$) | Standard normal ($z$) | When you know $\sigma$ (almost never) |
| t-test | $s / \sqrt{n}$ (estimated $\sigma$) | t-distribution with $df = n - 1$ | When you estimate $\sigma$ from data (almost always) |
That's it. That's the whole difference. One little letter changes ($\sigma$ to $s$), and we switch from the normal to the t-distribution. But that small change has a big impact on p-values, especially for small samples.
When would you actually know $\sigma$? Almost never. You might know $\sigma$ if you have a long history of measurements of the same quantity under the same conditions — for example, a manufacturing process that's been running for decades, or a standardized test with published population parameters. But even in these cases, it's usually safer to use $s$ and the t-distribution. The t-test is always valid (when conditions are met), while the z-test is only valid when you truly know $\sigma$.
Rule of thumb: When in doubt, use the t-test. You'll be right 99% of the time.
15.3 The Five-Step Procedure
Let's organize the one-sample t-test into a clean, repeatable procedure. This mirrors the five-step framework from Chapter 13, specialized for testing a mean with $\sigma$ unknown.
Step 1: State the Hypotheses
Write the null and alternative hypotheses in terms of the population mean $\mu$:
| Test Type | $H_0$ | $H_a$ |
|---|---|---|
| Two-tailed | $\mu = \mu_0$ | $\mu \neq \mu_0$ |
| Right-tailed | $\mu = \mu_0$ | $\mu > \mu_0$ |
| Left-tailed | $\mu = \mu_0$ | $\mu < \mu_0$ |
Remember: Hypotheses are about the population parameter $\mu$, not the sample statistic $\bar{x}$.
Step 2: Check the Conditions
Before computing anything, verify that the t-test is appropriate. We need three conditions:
-
Randomness: The data come from a random sample (or random assignment in an experiment).
-
Independence: Individual observations are independent. When sampling without replacement, the 10% condition must hold: $n \leq 0.10 \times N$.
-
Normality: The sampling distribution of $\bar{x}$ is approximately normal. This is satisfied if: - The population is approximately normal, OR - The sample size is large enough for the CLT to kick in
We'll discuss the normality condition in much more detail in Section 15.6. For now, here's the quick guide:
| Sample Size | What to Check |
|---|---|
| $n < 15$ | The data must show no strong skewness, no outliers. The population should be approximately normal. |
| $15 \leq n < 30$ | The t-test can tolerate some skewness, but check for serious outliers and extreme skewness. |
| $n \geq 30$ | The CLT makes the t-test valid for most population shapes. Only severe outliers are a concern. |
Step 3: Compute the Test Statistic
$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$
Step 4: Find the P-Value
Use the t-distribution with $df = n - 1$:
| Alternative | P-Value |
|---|---|
| $H_a: \mu > \mu_0$ | $P(T \geq t)$ — area in the right tail |
| $H_a: \mu < \mu_0$ | $P(T \leq t)$ — area in the left tail |
| $H_a: \mu \neq \mu_0$ | $2 \times P(T \geq |t|)$ — area in both tails |
In Python: scipy.stats.t.sf(abs(t), df) gives the right-tail area; multiply by 2 for two-tailed.
In Excel: =T.DIST.RT(ABS(t), df) gives the right-tail area; =T.DIST.2T(ABS(t), df) gives the two-tailed p-value directly.
Step 5: State the Conclusion in Context
Compare the p-value to $\alpha$: - If $p \leq \alpha$: Reject $H_0$. There is sufficient evidence at the $\alpha$ level that... - If $p > \alpha$: Fail to reject $H_0$. There is not sufficient evidence at the $\alpha$ level that...
Always state the conclusion in terms of the original question, not just "reject" or "fail to reject."
15.4 Complete Worked Example 1: Maya's Hospital Wait Times
Let's walk through a complete t-test with Dr. Maya Chen.
The Scenario
Maya is investigating emergency department wait times in her county. The national benchmark for average ED wait time is 240 minutes (4 hours). She suspects her county's wait times are higher than the national benchmark. She collects a random sample of 36 ED visits from the past month.
Here are the summary statistics: - $n = 36$ - $\bar{x} = 258$ minutes - $s = 45$ minutes
Is there statistically significant evidence that the average ED wait time in her county exceeds the national benchmark?
Step 1: State the Hypotheses
Maya wants to know if wait times exceed the benchmark, so this is a right-tailed test:
$$H_0: \mu = 240 \text{ minutes}$$ $$H_a: \mu > 240 \text{ minutes}$$
where $\mu$ is the true mean ED wait time in Maya's county.
Step 2: Check the Conditions
Randomness: Maya collected a random sample of ED visits. ✓
Independence: 36 visits is almost certainly less than 10% of all ED visits in a month for a county-level hospital system. ✓
Normality: With $n = 36 \geq 30$, the CLT ensures the sampling distribution of $\bar{x}$ is approximately normal, even if individual wait times are right-skewed (which they often are — a few patients wait extremely long). ✓
All conditions met. Proceed.
Step 3: Compute the Test Statistic
$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{258 - 240}{45 / \sqrt{36}} = \frac{18}{45 / 6} = \frac{18}{7.5} = 2.40$$
The sample mean is 2.40 standard errors above the hypothesized mean.
Step 4: Find the P-Value
Using the t-distribution with $df = 36 - 1 = 35$:
$$p\text{-value} = P(T_{35} \geq 2.40)$$
From a t-table or software: $p\text{-value} \approx 0.011$.
In Python:
from scipy import stats
t_stat = 2.40
df = 35
p_value = stats.t.sf(t_stat, df) # right-tail area
print(f"t = {t_stat:.2f}, df = {df}, p-value = {p_value:.4f}")
Output:
t = 2.40, df = 35, p-value = 0.0109
In Excel:
=T.DIST.RT(2.40, 35)
Returns approximately 0.0109.
Step 5: State the Conclusion in Context
At $\alpha = 0.05$: Since $p = 0.011 < 0.05$, we reject $H_0$.
Conclusion: There is statistically significant evidence at the 0.05 level that the average emergency department wait time in Maya's county exceeds the national benchmark of 240 minutes. The sample data suggest the true mean wait time is approximately 258 minutes, though we should construct a confidence interval for a more complete picture (we'll do this in Section 15.5).
Connecting the Test to the Confidence Interval
Let's also build a 95% confidence interval for the true mean wait time:
$$\bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}} = 258 \pm 2.030 \cdot \frac{45}{\sqrt{36}} = 258 \pm 2.030 \cdot 7.5 = 258 \pm 15.2$$
$$\text{95\% CI: } (242.8, \; 273.2) \text{ minutes}$$
Notice that 240 is not in this confidence interval. This is consistent with our rejection of $H_0: \mu = 240$ — the CI-test duality from Chapter 13 in action. The interval also tells us something the hypothesis test alone doesn't: the true mean wait time is plausibly between 243 and 273 minutes. That's useful information for policy decisions.
Maya's Takeaway: "We have strong evidence that wait times exceed the national benchmark. But the confidence interval tells me how much they exceed it — somewhere between 3 and 33 minutes longer. That range matters for deciding whether to request more staff or redesign the triage process."
15.5 Confidence Intervals for Means: A Deeper Look
🔄 Spaced Review 2 (from Ch.10): Normal Distribution Properties and Conditions
In Chapter 10, you learned that the normal distribution is a model — never perfectly correct, but sometimes useful enough. The threshold concept was: "The question is never 'Is my data normal?' (the answer is always no). The question is: 'Is my data close enough to normal that the normal model gives useful, approximately correct answers?'"
That same philosophy applies to t-procedures. The t-test and t-confidence interval assume the sampling distribution of $\bar{x}$ is approximately normal. This assumption can be satisfied either by the population being approximately normal (for small samples) or by the CLT (for larger samples). The tools you learned in Chapter 10 — histograms, QQ-plots, the Shapiro-Wilk test — are exactly the tools you'll use to check this condition.
You already built confidence intervals for means in Chapter 12. Let's deepen that understanding by connecting CIs more tightly to hypothesis tests and introducing some practical considerations.
The CI Formula (Revisited)
$$\boxed{\bar{x} \pm t^*_{df} \cdot \frac{s}{\sqrt{n}}}$$
where $df = n - 1$ and $t^*_{df}$ is the critical value for the desired confidence level.
Common Critical Values
| Confidence Level | $t^*$ (df = 10) | $t^*$ (df = 25) | $t^*$ (df = 50) | $t^*$ (df = 100) | $z^*$ (normal) |
|---|---|---|---|---|---|
| 90% | 1.812 | 1.708 | 1.676 | 1.660 | 1.645 |
| 95% | 2.228 | 2.060 | 2.009 | 1.984 | 1.960 |
| 99% | 3.169 | 2.787 | 2.678 | 2.626 | 2.576 |
Notice the pattern: $t^*$ values are always larger than $z^*$ values (wider intervals), and they approach $z^*$ as $df$ increases. By $df = 100$, the difference is negligible.
The CI-Test Duality (Deepened)
There's an elegant connection between confidence intervals and hypothesis tests that we introduced in Chapter 13. Here it is again, specifically for means:
$$\boxed{\text{Reject } H_0: \mu = \mu_0 \text{ at } \alpha = 0.05 \iff \mu_0 \text{ is NOT in the 95\% CI for } \mu}$$
This means you can always read a hypothesis test result directly from a confidence interval:
- If $\mu_0$ falls outside the CI → reject $H_0$
- If $\mu_0$ falls inside the CI → fail to reject $H_0$
Example: Maya's 95% CI was (242.8, 273.2). Since 240 is below the lower bound, we reject $H_0: \mu = 240$. If the benchmark had been 250 (which falls inside the CI), we would fail to reject.
Python: Building Confidence Intervals
import numpy as np
from scipy import stats
# Maya's wait time data (summary statistics)
x_bar = 258
s = 45
n = 36
df = n - 1
confidence = 0.95
# Critical value
t_star = stats.t.ppf((1 + confidence) / 2, df)
# Margin of error
margin = t_star * (s / np.sqrt(n))
# Confidence interval
ci_lower = x_bar - margin
ci_upper = x_bar + margin
print(f"Sample mean: {x_bar}")
print(f"t* (df={df}): {t_star:.3f}")
print(f"Margin of error: {margin:.1f}")
print(f"{confidence*100:.0f}% CI: ({ci_lower:.1f}, {ci_upper:.1f})")
Output:
Sample mean: 258
t* (df=35): 2.030
Margin of error: 15.2
95% CI: (242.8, 273.2)
Excel: Building Confidence Intervals
# Assuming data in cells A2:A37
Mean: =AVERAGE(A2:A37)
Std Dev: =STDEV.S(A2:A37)
n: =COUNT(A2:A37)
df: =COUNT(A2:A37)-1
t* (95%): =T.INV.2T(0.05, df)
Margin: =T.INV.2T(0.05, df) * STDEV.S(A2:A37) / SQRT(COUNT(A2:A37))
Lower bound: =AVERAGE(A2:A37) - Margin
Upper bound: =AVERAGE(A2:A37) + Margin
# Or use the built-in function:
=CONFIDENCE.T(0.05, STDEV.S(A2:A37), COUNT(A2:A37))
Note on
CONFIDENCE.Tvs.CONFIDENCE.NORM: Excel offers both functions. UseCONFIDENCE.T(which uses the t-distribution) for means when $\sigma$ is unknown. The olderCONFIDENCE.NORMuses the z-distribution and is only appropriate when $\sigma$ is known — which, as we've discussed, is almost never.
15.6 Checking Conditions: The Normality Requirement
The one-sample t-test requires three conditions: randomness, independence, and normality (of the sampling distribution). The first two are about study design — either you have a random sample or you don't. But normality requires judgment, and that judgment depends on your sample size.
The Normality Condition in Detail
The t-test assumes the sampling distribution of $\bar{x}$ is approximately normal. This is guaranteed if: - The population itself is approximately normal, OR - The sample size is large enough for the CLT to produce an approximately normal sampling distribution
Here's the practical guide:
Guidelines for the Normality Condition
Sample Size What to Do $n < 15$ The population must be approximately normal. Check with a histogram, QQ-plot, or Shapiro-Wilk test. If the data show clear skewness or outliers, the t-test is unreliable. Consider a nonparametric alternative (Chapter 21). $15 \leq n < 30$ The t-test can tolerate moderate skewness. However, check for serious outliers or extreme skewness. A single extreme outlier can distort results. Create a histogram and/or boxplot to check. $n \geq 30$ The CLT makes the t-test reliable for most population shapes. Only extreme outliers or very severe skewness are a concern. You're generally safe.
Why These Numbers?
These guidelines aren't arbitrary — they come from simulation studies. Researchers have repeatedly simulated t-tests from populations with various shapes (skewed, bimodal, heavy-tailed) and checked how the actual Type I error rate compares to the nominal $\alpha = 0.05$:
| Population Shape | $n = 5$ | $n = 15$ | $n = 30$ | $n = 100$ |
|---|---|---|---|---|
| Normal | 0.050 | 0.050 | 0.050 | 0.050 |
| Mildly skewed | 0.061 | 0.053 | 0.051 | 0.050 |
| Strongly skewed | 0.092 | 0.063 | 0.054 | 0.051 |
| Very heavy-tailed | 0.085 | 0.060 | 0.053 | 0.050 |
| Bimodal | 0.055 | 0.051 | 0.050 | 0.050 |
(Values are approximate Type I error rates from simulation studies)
Look at the pattern. For normal populations, the t-test nails the 5% error rate at every sample size — no surprise there. For mildly skewed or bimodal populations, even $n = 15$ gives nearly correct results. For strongly skewed populations, $n = 30$ gets you close to 5%. But for small samples from skewed populations, the actual error rate can be nearly double the nominal rate (9.2% instead of 5%). That's why we're so cautious about normality with small samples.
Practical Checks
Histogram: Does the data look roughly symmetric? A little skewness is OK for $n \geq 15$; a lot of skewness is not OK for $n < 30$.
Boxplot: Are there extreme outliers? A single extreme outlier in a small sample can dominate the mean and standard deviation.
QQ-plot: Do the points roughly follow the diagonal line? Systematic curvature suggests the data aren't normally distributed.
Shapiro-Wilk test: From Chapter 10, this is a formal test of normality. A small p-value suggests non-normality. But remember: for large samples, the Shapiro-Wilk test will often reject normality even when the t-test is perfectly robust. It tests whether the data are exactly normal, which is a stricter question than whether the t-test gives reliable results.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Generate some example data (simulating ED wait times)
np.random.seed(42)
wait_times = np.random.exponential(scale=50, size=36) + 200
# Check 1: Histogram
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(wait_times, bins=10, edgecolor='black', alpha=0.7)
plt.xlabel('Wait Time (minutes)')
plt.ylabel('Frequency')
plt.title('Histogram of ED Wait Times')
# Check 2: QQ-plot
plt.subplot(1, 2, 2)
stats.probplot(wait_times, dist="norm", plot=plt)
plt.title('QQ-Plot of ED Wait Times')
plt.tight_layout()
plt.show()
# Check 3: Shapiro-Wilk test
stat, p_value = stats.shapiro(wait_times)
print(f"Shapiro-Wilk test: W = {stat:.4f}, p = {p_value:.4f}")
if p_value > 0.05:
print("No strong evidence against normality (p > 0.05)")
else:
print("Evidence of non-normality (p < 0.05)")
print("But with n = 36, t-test is robust to moderate non-normality")
What If the Conditions Aren't Met?
If the normality condition is violated and your sample is too small for the CLT to help, you have several options:
-
Transform the data. Taking the logarithm or square root of right-skewed data can often produce a more symmetric distribution. (You'll need to interpret results on the transformed scale.)
-
Use a nonparametric test. The Wilcoxon signed-rank test (Chapter 21) doesn't assume normality. It tests the median rather than the mean, which is often more appropriate for skewed data anyway.
-
Use bootstrap methods. The bootstrap (Chapter 18) constructs confidence intervals and tests without assuming any particular population shape. It's a powerful modern alternative.
-
Collect more data. If possible, increasing $n$ lets the CLT bail you out.
15.7 Robustness: How Sturdy Is the t-Test?
Concept 2: Robustness
A statistical procedure is robust if it gives approximately correct results even when its assumptions are not perfectly met. The one-sample t-test is remarkably robust to violations of normality, especially for larger sample sizes.
Robustness is one of the reasons the t-test is so widely used. It doesn't just work when conditions are perfect — it works surprisingly well when they're approximately met. But "robust" doesn't mean "bulletproof." Let's be precise about when the t-test holds up and when it breaks down.
What the t-Test Is Robust To
Moderate skewness (especially with $n \geq 30$): The CLT smooths out moderate skewness in the sampling distribution. Even if the population is moderately right-skewed (like income or wait times), the distribution of $\bar{x}$ will be close enough to normal for the t-test to work well.
Light-tailed distributions: If the population has lighter tails than the normal (like the uniform distribution), the t-test is actually conservative — the true Type I error rate is less than $\alpha$. You're less likely to make a false positive error than you think.
Bimodal distributions (with moderate $n$): Symmetric bimodal distributions don't cause much trouble for the t-test because the CLT still applies. The sampling distribution of $\bar{x}$ converges to normal regardless of bimodality.
What the t-Test Is NOT Robust To
Extreme outliers: A single extreme outlier can dramatically shift $\bar{x}$ and inflate $s$, distorting the test statistic. This is a problem at any sample size, though it's worse for small samples. Always check for outliers.
Heavy-tailed distributions with small $n$: If the population has much heavier tails than the normal (e.g., Cauchy-like distributions), the t-test can be unreliable even for moderate sample sizes because extreme values pull the mean and inflate the variance.
Strong skewness with small $n$: As the simulation table in Section 15.6 showed, strong skewness with $n < 15$ can push the actual Type I error rate well above $\alpha$.
A Practical Robustness Checklist
Before relying on a t-test result, ask:
| Question | If Yes... | If No... |
|---|---|---|
| Is $n \geq 30$? | You're probably fine for most population shapes | Check the data for skewness and outliers |
| Is the histogram roughly symmetric? | Good — normality isn't a concern | Consider how severe the skewness is |
| Are there extreme outliers? | Investigate them — they can wreck the t-test at any $n$ | Good — outliers aren't a concern |
| Is the test's conclusion borderline ($p$ near $\alpha$)? | Be cautious — results might change with different methods | A clear result is likely trustworthy |
The Bottom Line on Robustness: The t-test is like a sturdy pickup truck. It can handle rough roads, muddy conditions, and heavy loads — but if you drive it into a lake, it's going to have problems. For most real-world data with $n \geq 30$ and no extreme outliers, the t-test performs beautifully. For small samples from clearly non-normal populations, consider alternatives.
15.8 Complete Worked Example 2: Alex's Watch Time Analysis
Let's work through another complete example, this time with raw data and a two-tailed test.
The Scenario
Alex Rivera is analyzing watch times on StreamVibe. The industry benchmark for average watch time per session on streaming platforms is 45 minutes. Alex's team wants to know whether their platform's average watch time is different from the industry standard — either higher or lower.
Alex pulls a random sample of 25 viewing sessions from the past week:
38, 52, 41, 47, 55, 39, 48, 44, 50, 42,
61, 35, 46, 53, 40, 49, 43, 57, 37, 51,
45, 48, 54, 36, 44
Step 1: State the Hypotheses
Alex wants to detect a difference in either direction, so this is two-tailed:
$$H_0: \mu = 45 \text{ minutes}$$ $$H_a: \mu \neq 45 \text{ minutes}$$
Step 2: Check the Conditions
Randomness: Alex drew a random sample of sessions. ✓
Independence: 25 sessions is far less than 10% of all streaming sessions. ✓
Normality: With $n = 25$, we're in the "moderate" zone ($15 \leq n < 30$). We need to check for strong skewness and outliers. Let's compute summary statistics and look at the data.
import numpy as np
from scipy import stats
data = np.array([38, 52, 41, 47, 55, 39, 48, 44, 50, 42,
61, 35, 46, 53, 40, 49, 43, 57, 37, 51,
45, 48, 54, 36, 44])
print(f"n = {len(data)}")
print(f"Mean = {np.mean(data):.2f}")
print(f"Median = {np.median(data)}")
print(f"Std Dev = {np.std(data, ddof=1):.2f}")
print(f"Min = {np.min(data)}, Max = {np.max(data)}")
print(f"Skewness = {stats.skew(data):.3f}")
Output:
n = 25
Mean = 45.60
Median = 45.0
Std Dev = 6.87
Min = 35, Max = 61
Skewness = 0.189
The mean and median are very close (45.60 vs. 45.0), and the skewness is small (0.189). The data range from 35 to 61, with no extreme outliers. The Shapiro-Wilk test:
stat, p_shapiro = stats.shapiro(data)
print(f"Shapiro-Wilk: W = {stat:.4f}, p = {p_shapiro:.4f}")
Output:
Shapiro-Wilk: W = 0.9746, p = 0.7617
The Shapiro-Wilk test shows no evidence of non-normality ($p = 0.76$). Combined with the near-zero skewness and no outliers, the normality condition is satisfied. ✓
Step 3: Compute the Test Statistic
$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{45.60 - 45}{6.87 / \sqrt{25}} = \frac{0.60}{6.87 / 5} = \frac{0.60}{1.374} = 0.437$$
The sample mean is only 0.437 standard errors from the hypothesized value. That doesn't seem very far.
Step 4: Find the P-Value
Using the t-distribution with $df = 24$:
$$p\text{-value} = 2 \times P(T_{24} \geq |0.437|) = 2 \times P(T_{24} \geq 0.437)$$
t_stat = 0.437
df = 24
p_value = 2 * stats.t.sf(abs(t_stat), df)
print(f"t = {t_stat:.3f}, df = {df}, p-value = {p_value:.4f}")
Output:
t = 0.437, df = 24, p-value = 0.6663
Step 5: State the Conclusion in Context
At $\alpha = 0.05$: Since $p = 0.666 > 0.05$, we fail to reject $H_0$.
Conclusion: There is not sufficient evidence at the 0.05 level to conclude that StreamVibe's average watch time differs from the industry benchmark of 45 minutes. The sample mean of 45.6 minutes is well within the range of normal sampling variation.
The Full Python Solution (Using scipy.stats.ttest_1samp)
Here's how to do the entire analysis using SciPy's built-in function:
import numpy as np
from scipy import stats
# Raw data
data = np.array([38, 52, 41, 47, 55, 39, 48, 44, 50, 42,
61, 35, 46, 53, 40, 49, 43, 57, 37, 51,
45, 48, 54, 36, 44])
# One-sample t-test
t_stat, p_value = stats.ttest_1samp(data, popmean=45)
print(f"One-sample t-test:")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value (two-tailed): {p_value:.4f}")
print(f" df: {len(data) - 1}")
print(f" Sample mean: {np.mean(data):.2f}")
print(f" Sample std: {np.std(data, ddof=1):.2f}")
# Confidence interval
confidence = 0.95
t_star = stats.t.ppf((1 + confidence) / 2, df=len(data) - 1)
margin = t_star * np.std(data, ddof=1) / np.sqrt(len(data))
ci = (np.mean(data) - margin, np.mean(data) + margin)
print(f"\n 95% CI: ({ci[0]:.2f}, {ci[1]:.2f})")
print(f" Does CI contain 45? {'Yes' if ci[0] <= 45 <= ci[1] else 'No'}")
# Decision
alpha = 0.05
if p_value <= alpha:
print(f"\n Decision: Reject H₀ (p = {p_value:.4f} ≤ {alpha})")
else:
print(f"\n Decision: Fail to reject H₀ (p = {p_value:.4f} > {alpha})")
Output:
One-sample t-test:
t-statistic: 0.4367
p-value (two-tailed): 0.6663
df: 24
Sample mean: 45.60
Sample std: 6.87
95% CI: (42.76, 48.44)
Does CI contain 45? Yes
Decision: Fail to reject H₀ (p = 0.6663 > 0.05)
Notice the CI (42.76, 48.44) contains 45. CI-test duality confirmed.
Alex's Takeaway: "StreamVibe's watch time is right in line with the industry average. That's actually good news — it means our platform isn't underperforming. Now let's see if the new recommendation algorithm changes things." (That comparison will come in Chapter 16.)
15.9 Complete Worked Example 3: Sam's Points Per Game
The Scenario
Sam Okafor is analyzing the Riverside Raptors' performance. The team averaged 105.2 points per game last season. After a midseason coaching change, Sam wants to know if the team's scoring has changed. He collects data from $n = 18$ games played under the new coach:
$$\bar{x} = 110.8 \text{ points}, \quad s = 11.3 \text{ points}$$
Sam has also checked the histogram and QQ-plot of the 18 games: the data are roughly symmetric with no extreme outliers.
Step 1: State the Hypotheses
Sam wants to detect any change (up or down), so this is two-tailed:
$$H_0: \mu = 105.2$$ $$H_a: \mu \neq 105.2$$
Step 2: Check the Conditions
Randomness: The 18 games are the entire set of games under the new coach, not a random sample. However, we can think of them as a random sample from the hypothetical population of all games the team could play under this coaching system. This is a common and reasonable interpretation in sports analytics. ✓ (with caveat)
Independence: Game scores might not be perfectly independent (hot streaks, scheduling, fatigue), but for most practical purposes in sports analytics, this assumption is reasonable. ✓ (approximately)
Normality: With $n = 18$ (in the 15–30 range), we need the data to be free of extreme skewness and outliers. Sam reports the data are roughly symmetric. ✓
Step 3: Compute the Test Statistic
$$t = \frac{110.8 - 105.2}{11.3 / \sqrt{18}} = \frac{5.6}{11.3 / 4.243} = \frac{5.6}{2.664} = 2.103$$
Step 4: Find the P-Value
Using the t-distribution with $df = 17$:
$$p\text{-value} = 2 \times P(T_{17} \geq 2.103)$$
from scipy import stats
p_value = 2 * stats.t.sf(2.103, df=17)
print(f"p-value = {p_value:.4f}")
Output:
p-value = 0.0506
Step 5: State the Conclusion in Context
At $\alpha = 0.05$: Since $p = 0.051 > 0.05$, we fail to reject $H_0$. Just barely.
Conclusion: At the 0.05 significance level, there is not quite sufficient evidence to conclude that the Raptors' average points per game has changed under the new coach. The p-value of 0.051 is tantalizingly close to the threshold — the kind of result that makes you want to collect more data.
Sam's Reflection: "So the data suggest scoring might have increased, but we can't be confident at the 0.05 level. With only 18 games, we don't have a lot of statistical power. I wonder what the result would look like if we had a full season of data..."
This is exactly the right instinct. In Chapter 17, you'll learn about statistical power — the probability of detecting a real effect when one exists. With only 18 games and a moderate effect size, Sam's test doesn't have much power. The borderline result might reflect a real improvement that the test simply can't confirm with the available data.
The Role of Context in Borderline Results
Sam's $p = 0.051$ is a perfect example of why the $\alpha = 0.05$ threshold isn't sacred. A p-value of 0.051 and a p-value of 0.049 represent essentially the same amount of evidence — yet one crosses the threshold and the other doesn't.
The responsible interpretation: "The evidence is suggestive but not conclusive. The sample mean of 110.8 points (compared to 105.2 previously) is in the direction we'd expect if scoring improved, and the p-value is close to our threshold. More games under the new coach would help clarify whether the improvement is real."
🔄 Spaced Review 3 (from Ch.6): Sample Mean and Standard Deviation — The Inputs to t-Tests
Everything in this chapter comes down to three numbers: $\bar{x}$, $s$, and $n$ — the sample mean, the sample standard deviation, and the sample size. You learned to compute these in Chapter 6:
$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \quad \quad s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}$$
The $n - 1$ in the denominator of $s$ — which you might have wondered about in Chapter 6 — is the degrees of freedom, and it's the same $n - 1$ that determines which t-distribution to use. The connection isn't a coincidence: we divide by $n - 1$ when computing $s$ precisely because we've "used up" one degree of freedom by estimating $\bar{x}$ from the same data. This is the deep reason why the t-distribution has $n - 1$ degrees of freedom rather than $n$.
15.10 z-Test vs. t-Test: The Complete Comparison
Let's put the z-test and t-test side by side so you can see exactly when to use each.
The Comparison Table
| Feature | z-Test for a Mean | t-Test for a Mean |
|---|---|---|
| Formula | $z = \dfrac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$ | $t = \dfrac{\bar{x} - \mu_0}{s / \sqrt{n}}$ |
| Distribution | Standard normal ($Z$) | Student's t with $df = n - 1$ |
| Requires | Known $\sigma$ | Only sample data ($\bar{x}$, $s$, $n$) |
| When to use | Almost never (σ rarely known) | Almost always |
| P-value source | Normal (z) table or norm.sf() |
t-table or t.sf() |
| CI formula | $\bar{x} \pm z^* \cdot \sigma/\sqrt{n}$ | $\bar{x} \pm t^* \cdot s/\sqrt{n}$ |
| Effect of small $n$ | No extra penalty | Heavier tails → wider CIs, larger p-values |
| As $n \to \infty$ | N/A | Converges to z-test |
When to Use the z-Test
Honestly? Almost never. Here's the short list of situations where the z-test might be appropriate:
-
Standardized test scores. The SAT and ACT have published population standard deviations. If you're testing whether a specific group's mean differs from the national average, you could use $\sigma$ and the z-test. But even here, the t-test gives nearly identical results.
-
Long-running industrial processes. If a manufacturing process has been measured thousands of times and $\sigma$ is well-established, you might use the z-test. Quality control engineers sometimes do this.
-
Very large samples. When $n > 100$, the t-distribution is virtually identical to the normal distribution, so it doesn't matter which you use.
The practical rule: Default to the t-test. If someone asks "why didn't you use a z-test?" the answer is: "Because I don't know $\sigma$, and pretending I do gives a false sense of precision."
Why do textbooks teach the z-test at all?
Two reasons. First, it's simpler — there's no degrees of freedom to worry about, and the z-table is the same for every sample size. So it's a good entry point for learning hypothesis testing (which is exactly how we used it in Chapter 13). Second, it illustrates the logic of hypothesis testing without the extra complication of the t-distribution.
But in your career as a data analyst, you'll use the t-test far more often than the z-test. Think of the z-test as training wheels — useful for learning, but eventually you take them off.
15.11 The t-Distribution Acknowledges Our Uncertainty
Theme 4 Connection: Uncertainty Is Not Failure
The t-distribution is one of the most beautiful ideas in statistics because it quantifies our uncertainty about our uncertainty.
Think about what happens when you estimate $\sigma$ with $s$. You've already got one source of uncertainty: you don't know $\mu$, so you estimate it with $\bar{x}$. Now you've added a second source: you don't know $\sigma$ either, so you estimate it with $s$. The t-distribution is the honest acknowledgment of this double uncertainty.
The heavier tails of the t-distribution say: "Because you're estimating the spread from a finite sample, extreme t-values are more likely than extreme z-values. So we need to be more cautious — wider confidence intervals, larger p-values — until we have enough data to trust our estimate of $\sigma$."
As your sample size grows, $s$ gets closer to $\sigma$, the second source of uncertainty shrinks, and the t-distribution gradually becomes the normal distribution. It's as if the t-distribution is saying: "Show me more data, and I'll give you narrower intervals. Until then, I'm going to be honest about what we don't know."
This is uncertainty handled with integrity. It's not failure. It's science.
15.12 A Preview: Paired Data and the Paired t-Test
Before we leave this chapter, let's preview a concept that will be central in Chapter 16.
Sometimes the data come in natural pairs. For example:
- Before/after measurements: Blood pressure measured before and after a medication
- Matched subjects: Test scores from students paired by ability level, one in each treatment group
- Repeated measurements: Reaction time of the same person under two different conditions
When data are paired, the observations within each pair are not independent — and that violates one of our conditions for the standard t-test. But there's a clever fix.
Concept 3: Paired Data (Preview)
When observations come in natural pairs, we don't analyze the raw measurements. Instead, we compute the difference within each pair and then perform a one-sample t-test on the differences. The null hypothesis becomes $H_0: \mu_d = 0$ (the mean difference is zero), and the test statistic is:
$$t = \frac{\bar{d} - 0}{s_d / \sqrt{n}}$$
where $\bar{d}$ is the mean of the differences, $s_d$ is the standard deviation of the differences, and $n$ is the number of pairs.
Quick example: Suppose Maya measures systolic blood pressure for 20 patients before and after a new treatment. She doesn't compare the two groups of 20 readings (that would ignore the pairing). Instead, she computes the change for each patient: $d_i = \text{after}_i - \text{before}_i$. Then she runs a one-sample t-test on these 20 differences to see if the mean change is significantly different from zero.
This is a powerful technique because the pairing eliminates person-to-person variability, often dramatically reducing the standard error and increasing the power of the test.
We'll develop this fully in Chapter 16, where you'll learn to choose between the paired t-test and the independent-samples t-test. For now, just recognize that the one-sample t-test you learned in this chapter is the engine that drives the paired t-test — you're just applying it to differences.
15.13 Python: Complete t-Test Toolkit
Here's a comprehensive Python reference for one-sample t-tests and confidence intervals.
One-Sample t-Test with scipy.stats.ttest_1samp()
import numpy as np
from scipy import stats
# ------ Example 1: Maya's wait times (from summary statistics) ------
# When you have raw data:
wait_times = np.array([
285, 230, 310, 245, 270, 220, 255, 290, 265, 240,
300, 250, 275, 235, 260, 295, 225, 280, 250, 310,
245, 270, 255, 230, 285, 240, 295, 260, 275, 250,
305, 265, 280, 235, 290, 255
])
# Test: Is the mean wait time greater than 240 minutes?
t_stat, p_two = stats.ttest_1samp(wait_times, popmean=240)
p_right = p_two / 2 if t_stat > 0 else 1 - p_two / 2 # one-tailed (right)
print("=== Maya's Wait Time Analysis ===")
print(f"n = {len(wait_times)}")
print(f"x̄ = {np.mean(wait_times):.1f} minutes")
print(f"s = {np.std(wait_times, ddof=1):.1f} minutes")
print(f"t = {t_stat:.4f}")
print(f"p-value (two-tailed) = {p_two:.4f}")
print(f"p-value (one-tailed, H_a: μ > 240) = {p_right:.4f}")
print(f"Decision at α = 0.05: {'Reject H₀' if p_right < 0.05 else 'Fail to reject H₀'}")
# --- Using the 'alternative' parameter (SciPy ≥ 1.7) ---
# This is the cleaner approach:
result = stats.ttest_1samp(wait_times, popmean=240, alternative='greater')
print(f"\nUsing alternative='greater':")
print(f"t = {result.statistic:.4f}, p = {result.pvalue:.4f}")
Computing t-Tests from Summary Statistics
from scipy import stats
import numpy as np
def t_test_from_summary(x_bar, s, n, mu_0, alternative='two-sided'):
"""
One-sample t-test from summary statistics.
Parameters:
-----------
x_bar : float - sample mean
s : float - sample standard deviation
n : int - sample size
mu_0 : float - hypothesized population mean
alternative : str - 'two-sided', 'greater', or 'less'
Returns:
--------
dict with t-statistic, p-value, df, and 95% CI
"""
df = n - 1
se = s / np.sqrt(n)
t_stat = (x_bar - mu_0) / se
if alternative == 'two-sided':
p_value = 2 * stats.t.sf(abs(t_stat), df)
elif alternative == 'greater':
p_value = stats.t.sf(t_stat, df)
elif alternative == 'less':
p_value = stats.t.cdf(t_stat, df)
# 95% confidence interval
t_star = stats.t.ppf(0.975, df)
ci_lower = x_bar - t_star * se
ci_upper = x_bar + t_star * se
return {
't_statistic': t_stat,
'p_value': p_value,
'df': df,
'se': se,
'ci_95': (ci_lower, ci_upper),
'alternative': alternative
}
# Example: Sam's scoring data
result = t_test_from_summary(
x_bar=110.8, s=11.3, n=18,
mu_0=105.2, alternative='two-sided'
)
print("=== Sam's Scoring Analysis ===")
print(f"t = {result['t_statistic']:.4f}")
print(f"df = {result['df']}")
print(f"p-value ({result['alternative']}) = {result['p_value']:.4f}")
print(f"95% CI: ({result['ci_95'][0]:.2f}, {result['ci_95'][1]:.2f})")
print(f"Decision at α = 0.05: {'Reject H₀' if result['p_value'] < 0.05 else 'Fail to reject H₀'}")
Output:
=== Sam's Scoring Analysis ===
t = 2.1027
df = 17
p-value (two-sided) = 0.0506
95% CI: (105.18, 116.42)
Decision at α = 0.05: Fail to reject H₀
Visualizing the t-Test
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def plot_t_test(t_stat, df, alternative='two-sided', alpha=0.05):
"""Visualize a t-test with shaded rejection region and p-value."""
x = np.linspace(-4, 4, 1000)
y = stats.t.pdf(x, df)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, y, 'b-', linewidth=2, label=f't-distribution (df={df})')
# Shade p-value region
if alternative == 'greater':
x_fill = x[x >= t_stat]
ax.fill_between(x_fill, stats.t.pdf(x_fill, df),
alpha=0.3, color='red', label='p-value region')
elif alternative == 'less':
x_fill = x[x <= t_stat]
ax.fill_between(x_fill, stats.t.pdf(x_fill, df),
alpha=0.3, color='red', label='p-value region')
else: # two-sided
x_right = x[x >= abs(t_stat)]
x_left = x[x <= -abs(t_stat)]
ax.fill_between(x_right, stats.t.pdf(x_right, df),
alpha=0.3, color='red', label='p-value region')
ax.fill_between(x_left, stats.t.pdf(x_left, df),
alpha=0.3, color='red')
# Mark test statistic
ax.axvline(t_stat, color='red', linestyle='--', linewidth=1.5,
label=f't = {t_stat:.3f}')
ax.set_xlabel('t-value')
ax.set_ylabel('Density')
ax.set_title(f'One-Sample t-Test ({alternative})')
ax.legend()
plt.tight_layout()
plt.show()
# Visualize Maya's test (right-tailed)
plot_t_test(t_stat=2.40, df=35, alternative='greater')
# Visualize Alex's test (two-tailed)
plot_t_test(t_stat=0.437, df=24, alternative='two-sided')
15.14 Excel: t-Test and Confidence Interval Reference
One-Sample t-Test in Excel
Excel doesn't have a built-in one-sample t-test function, but you can compute everything with these formulas:
| What You Need | Excel Formula | Notes |
|---|---|---|
| Mean | =AVERAGE(A2:A37) |
|
| Std Dev ($s$) | =STDEV.S(A2:A37) |
Use .S for sample SD (not .P) |
| Count ($n$) | =COUNT(A2:A37) |
|
| SE | =STDEV.S(A2:A37)/SQRT(COUNT(A2:A37)) |
$s/\sqrt{n}$ |
| $t$-statistic | =(AVERAGE(A2:A37)-mu_0)/SE |
Replace mu_0 with your value |
| $p$-value (two-tailed) | =T.DIST.2T(ABS(t), df) |
Always use ABS(t) |
| $p$-value (right-tailed) | =T.DIST.RT(t, df) |
For $H_a: \mu > \mu_0$ |
| $p$-value (left-tailed) | =T.DIST(t, df, TRUE) |
For $H_a: \mu < \mu_0$ |
| $t^*$ (95% CI) | =T.INV.2T(0.05, df) |
Two-tailed critical value |
| Margin of error | =T.INV.2T(0.05, df)*STDEV.S(A2:A37)/SQRT(COUNT(A2:A37)) |
|
| CI lower | =AVERAGE(A2:A37) - Margin |
|
| CI upper | =AVERAGE(A2:A37) + Margin |
Key Excel Functions
| Function | What It Does | Example |
|---|---|---|
T.DIST(t, df, TRUE) |
Left-tail area: $P(T \leq t)$ | =T.DIST(2.40, 35, TRUE) → 0.9891 |
T.DIST.RT(t, df) |
Right-tail area: $P(T \geq t)$ | =T.DIST.RT(2.40, 35) → 0.0109 |
T.DIST.2T(t, df) |
Two-tailed area: $2 \times P(T \geq |t|)$ | =T.DIST.2T(2.40, 35) → 0.0218 |
T.INV(prob, df) |
Inverse t: left-tail critical value | =T.INV(0.95, 35) → 1.690 |
T.INV.2T(alpha, df) |
Inverse t: two-tailed critical value | =T.INV.2T(0.05, 35) → 2.030 |
CONFIDENCE.T(alpha, s, n) |
Margin of error for CI | =CONFIDENCE.T(0.05, 45, 36) → 15.23 |
Excel Tip: The
T.TESTfunction (orT.TESTin newer versions) is designed for two-sample tests. For a one-sample t-test, compute the t-statistic manually and useT.DIST.2TorT.DIST.RTfor the p-value.
15.15 Putting It All Together: A Decision Flowchart
Here's a flowchart for deciding which test to use when testing a claim about a population mean:
Testing a claim about μ?
│
▼
Is σ known?
╱ ╲
Yes No
│ │
▼ ▼
Use z-test Use t-test
(rare!) (the usual case)
│ │
▼ ▼
z = (x̄-μ₀) t = (x̄-μ₀)
/(σ/√n) /(s/√n)
│ │
▼ ▼
Compare to Compare to
z-distribution t-distribution
(df = n-1)
│
▼
Check conditions:
1. Random sample? ✓/✗
2. Independent? ✓/✗
3. Normality?
╱ │ ╲
n<15 15-30 n≥30
│ │ │
Need Check CLT
near- for handles
normality outliers most
shapes
15.16 Progressive Project: Conduct t-Tests on Numerical Variables
It's time to apply the one-sample t-test to your own dataset.
Your Task
-
Identify a numerical variable in your dataset that you can test against a meaningful reference value. Some ideas: - Average income compared to a regional or national figure - Average life expectancy compared to a WHO benchmark - Average BMI compared to the "normal" threshold - Average temperature compared to a historical average - Average tuition compared to a national average
-
State your hypotheses. Decide whether a one-tailed or two-tailed test is appropriate based on your research question.
-
Check the conditions. Create a histogram and QQ-plot. Compute the Shapiro-Wilk test. Assess whether the t-test is appropriate for your data.
-
Conduct the t-test. Compute the test statistic, p-value, and a 95% confidence interval. Use
scipy.stats.ttest_1samp(). -
Interpret your results. Write a 2-3 sentence conclusion in context. Does the confidence interval tell you anything the hypothesis test doesn't?
Template Code
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Load your data
df = pd.read_csv('your_dataset.csv')
variable = df['your_numerical_variable'].dropna()
# Set your reference value
mu_0 = ___ # Fill in the benchmark/reference value
# ----- Condition Checks -----
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Histogram
axes[0].hist(variable, bins=15, edgecolor='black', alpha=0.7)
axes[0].axvline(mu_0, color='red', linestyle='--', label=f'μ₀ = {mu_0}')
axes[0].set_title('Distribution of Your Variable')
axes[0].legend()
# Boxplot
axes[1].boxplot(variable, vert=True)
axes[1].set_title('Boxplot (check for outliers)')
# QQ-plot
stats.probplot(variable, dist="norm", plot=axes[2])
axes[2].set_title('QQ-Plot (check for normality)')
plt.tight_layout()
plt.show()
# Shapiro-Wilk
w_stat, p_shapiro = stats.shapiro(variable)
print(f"Shapiro-Wilk: W = {w_stat:.4f}, p = {p_shapiro:.4f}")
# ----- t-Test -----
n = len(variable)
x_bar = np.mean(variable)
s = np.std(variable, ddof=1)
print(f"\nSummary Statistics:")
print(f" n = {n}")
print(f" x̄ = {x_bar:.4f}")
print(f" s = {s:.4f}")
# Conduct the test
t_stat, p_value = stats.ttest_1samp(variable, popmean=mu_0)
# For one-tailed, use: alternative='greater' or 'less'
print(f"\nOne-Sample t-Test:")
print(f" H₀: μ = {mu_0}")
print(f" t = {t_stat:.4f}")
print(f" df = {n - 1}")
print(f" p-value = {p_value:.4f}")
# Confidence interval
t_star = stats.t.ppf(0.975, df=n-1)
margin = t_star * s / np.sqrt(n)
print(f"\n95% CI: ({x_bar - margin:.4f}, {x_bar + margin:.4f})")
print(f"Does CI contain μ₀ = {mu_0}? {x_bar - margin <= mu_0 <= x_bar + margin}")
What to Write in Your Notebook
Add a new section titled "Chapter 15: One-Sample t-Tests" to your Data Detective Portfolio. Include: - Your research question and hypotheses - Condition check plots with brief assessment - t-test results with full interpretation - Confidence interval with interpretation - A 2-3 sentence reflection: What did you learn about your variable?
15.17 Common Mistakes and How to Avoid Them
| Mistake | Why It's Wrong | What to Do Instead |
|---|---|---|
| Using the z-test when $\sigma$ is unknown | Understates uncertainty → p-values too small → too many false rejections | Use the t-test (default choice) |
| Ignoring outliers | Outliers distort $\bar{x}$ and $s$, making the t-test unreliable | Always check histograms/boxplots first |
| Applying the t-test to tiny samples from skewed populations | With $n < 15$, the t-test requires approximate normality | Use nonparametric methods or collect more data |
| Confusing one-tailed and two-tailed p-values | A one-tailed p-value is half the two-tailed p-value (when the test statistic is in the right direction) | Decide the direction before seeing the data |
| Reporting "$p < 0.05$, therefore the effect is large" | p-values measure evidence, not effect size | Report the confidence interval — it shows both direction and magnitude |
Using STDEV.P in Excel |
.P divides by $n$ (population SD); the t-test needs $s$ which divides by $n-1$ |
Use STDEV.S for sample data |
15.18 Chapter Summary
Let's step back and see what we've accomplished.
You now know how to test claims about a population mean using the one-sample t-test — the most commonly used statistical test in practice. You know when to use it (when $\sigma$ is unknown — almost always), how to check its conditions (randomness, independence, normality), and how robust it is to violations of those conditions (very robust for $n \geq 30$, less so for small samples from skewed populations).
Here's the key insight one more time: the t-distribution is the honest version of the normal distribution. When you don't know $\sigma$ and have to estimate it with $s$, the t-distribution acknowledges that extra uncertainty with heavier tails. As your sample grows and $s$ converges to $\sigma$, the t-distribution relaxes into the normal distribution. It's uncertainty management at its finest.
The tools you've learned in this chapter — the one-sample t-test, confidence intervals for means, condition checking, robustness assessment — form the foundation for everything in the next several chapters. In Chapter 16, you'll extend these ideas to comparing two groups, including the paired t-test (which is just a one-sample t-test applied to differences) and the independent-samples t-test. In Chapter 17, you'll learn about power and effect sizes, which will help you answer Sam's lingering question: "If the scoring really did improve, how much data would I need to detect it?"
What's Next: Chapter 16 will introduce comparing two groups — the two-sample t-test, the paired t-test, and the two-proportion z-test. You'll finally be able to answer Alex's big question: "Did the new recommendation algorithm actually increase watch time compared to the old one?" And Professor Washington's: "Is the algorithm's false positive rate different for different racial groups?" The one-sample t-test you just learned is the engine that powers these more complex comparisons.
"The best thing about being a statistician is that you get to play in everyone's backyard." — John Tukey