> "The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic."
Learning Objectives
- Explain why comparing multiple groups requires ANOVA instead of multiple t-tests
- Conduct and interpret a one-way ANOVA
- Verify ANOVA assumptions (normality, equal variances, independence)
- Perform and interpret post-hoc pairwise comparisons (Tukey's HSD)
- Report ANOVA results with effect sizes (eta-squared)
In This Chapter
- Chapter Overview
- 20.1 A Puzzle Before We Start (Productive Struggle)
- 20.2 Why Not Multiple t-Tests? The Multiple Comparisons Problem
- 20.3 ANOVA: Comparing Variability, Not Means
- 20.4 The Mathematics of ANOVA: Sums of Squares
- 20.5 Mean Squares and the F-Statistic
- 20.6 The F-Distribution
- 20.7 Putting It All Together: The ANOVA Table
- Group n Mean SD
- 20.8 Post-Hoc Tests: Which Groups Differ?
- 20.9 Checking ANOVA Assumptions
- 20.10 Effect Size: Eta-Squared ($\eta^2$)
- 20.11 Alex's Analysis: Watch Time Across Subscription Tiers
- 20.12 Sam's Analysis: Scoring Across Opponents
- 20.13 Common Mistakes and Misconceptions
- 20.14 Reporting ANOVA Results
- 20.15 The Complete ANOVA Decision Flowchart
- 20.16 Progressive Project: ANOVA in Your Dataset
- 20.17 Chapter Summary
Chapter 20: Analysis of Variance (ANOVA)
"The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic." — Ronald A. Fisher
Chapter Overview
You know how to compare two groups. You learned that in Chapter 16 — the two-sample t-test, the paired t-test, the two-proportion z-test. And those tools are powerful.
But the real world doesn't always cooperate with two groups.
Dr. Maya Chen isn't comparing two communities anymore. She's evaluating four different public health intervention programs — a vaccination-focused campaign, a nutrition education program, a community fitness initiative, and a standard-care control — and she wants to know: do the programs produce different health outcomes? That's not a two-group question. It's a four-group question.
Alex Rivera has three subscription tiers at StreamVibe — Free, Standard, and Premium — and wants to know whether average watch time differs across them. Not Free vs. Standard. Not Standard vs. Premium. All three, simultaneously.
Sam Okafor is studying the Riverside Raptors' performance against their five conference opponents, and he wants to know: does Daria's scoring average depend on who she's playing against? That's a five-group comparison.
Your first instinct might be: "I'll just do a bunch of t-tests." Compare Program A to Program B, then A to C, then A to D, then B to C, and so on. If any of those t-tests comes back significant, you've found a difference. Right?
Wrong. And understanding why it's wrong is one of the most important lessons in this entire course. It connects directly to the multiple comparisons problem you first encountered in Chapter 17, and it's the reason we need an entirely new tool: analysis of variance, or ANOVA.
ANOVA asks a brilliantly simple question: is there more variability between the groups than you'd expect from the variability within the groups? If the answer is yes, then at least one group is different from the others. And the way ANOVA answers that question — by decomposing the total variability in your data into explained and unexplained components — introduces one of the most powerful ideas in all of statistics.
In this chapter, you will learn to: - Explain why comparing multiple groups requires ANOVA instead of multiple t-tests - Conduct and interpret a one-way ANOVA - Verify ANOVA assumptions (normality, equal variances, independence) - Perform and interpret post-hoc pairwise comparisons (Tukey's HSD) - Report ANOVA results with effect sizes (eta-squared)
Fast Track: If you've encountered ANOVA before, skim Sections 20.1–20.3, then jump to Section 20.9 (assumptions and diagnostics). Complete quiz questions 1, 10, and 18 to verify.
Deep Dive: After this chapter, read Case Study 1 (Maya's public health intervention comparison) for a complete worked application with post-hoc analysis, then Case Study 2 (Sam's opponent-by-opponent scoring analysis) for a sports analytics perspective. Both include full Python code.
20.1 A Puzzle Before We Start (Productive Struggle)
Before we jump into formulas, try this thought experiment.
The Coffee Shop Experiment
A coffee chain wants to test whether the type of background music affects customer spending. They randomly assign 60 customers to three groups:
- Group 1 (Classical): 20 customers shop while classical music plays
- Group 2 (Pop): 20 customers shop while pop music plays
- Group 3 (No Music): 20 customers shop in silence
They record how much each customer spends. Here are the group means:
- Classical: \$5.40
- Pop: \$4.80
- No Music: \$4.90
(a) How many two-sample t-tests would you need to compare every pair of groups? Write them all out.
(b) If each t-test uses $\alpha = 0.05$, and there are actually NO differences between the groups (the music doesn't matter), what's the probability that at least one t-test comes back significant by pure chance? (Hint: it's not 5%.)
(c) Imagine the within-group variability is very large — customers in the Classical group spent anywhere from \$1.00 to \$10.00. Now imagine the within-group variability is very small — all Classical customers spent between \$5.20 and \$5.60. In which scenario are the group mean differences more convincing? Why?
(d) In your own words, what's the relationship between within-group variability and your ability to detect between-group differences?
Take 3 minutes. Parts (c) and (d) are the key insight of the entire chapter.
Here's what I hope you noticed:
For part (a), you need three t-tests: Classical vs. Pop, Classical vs. No Music, and Pop vs. No Music. That's $\binom{3}{2} = 3$ pairwise comparisons. With four groups, you'd need $\binom{4}{2} = 6$. With five groups, $\binom{5}{2} = 10$. It scales up fast.
Part (b) is the critical insight. If each test uses $\alpha = 0.05$, the probability of a Type I error on any single test is 5%. But with three independent tests, the probability that at least one gives a false positive is:
$$P(\text{at least one false positive}) = 1 - (1 - 0.05)^3 = 1 - 0.95^3 = 1 - 0.857 = 0.143$$
That's about 14.3% — nearly three times the 5% you intended. Run 10 tests, and the probability jumps to $1 - 0.95^{10} = 0.40$. Run 20 tests, and it's $1 - 0.95^{20} = 0.64$. The more tests you run, the more likely you are to "find" a difference that isn't really there.
Parts (c) and (d) capture the core idea behind ANOVA. When within-group variability is small (customers in each group spend similar amounts), even modest between-group differences stand out. When within-group variability is large (customers in each group spend wildly different amounts), the same between-group differences are drowned in noise. The signal (between-group differences) only matters relative to the noise (within-group variability). That ratio — signal to noise — is exactly what the ANOVA F-statistic measures.
You've just discovered two of the three big ideas in this chapter: (1) multiple t-tests inflate false positives, and (2) detecting group differences requires comparing between-group variability to within-group variability. The third big idea — that total variability can be decomposed into these two components — is coming in Section 20.4.
20.2 Why Not Multiple t-Tests? The Multiple Comparisons Problem
Let's make the problem from Section 20.1 concrete.
🔄 Spaced Review 1 (from Ch.16): The Two-Sample t-Test
In Chapter 16, you learned to compare two groups using the two-sample t-test:
$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}$$
The test asks whether the observed difference between two group means is larger than what random sampling variability alone could produce. At $\alpha = 0.05$, you accept a 5% chance of declaring a difference when none exists (Type I error).
That 5% guarantee is perfect — for one comparison. But what happens when you need to make several comparisons at once?
The Math of Inflated Error Rates
Suppose you have $k$ groups. The number of pairwise comparisons is:
$$\binom{k}{2} = \frac{k(k-1)}{2}$$
Here's how quickly it grows:
| Number of Groups ($k$) | Pairwise Comparisons | $P(\text{at least one false positive})$ |
|---|---|---|
| 2 | 1 | 0.050 |
| 3 | 3 | 0.143 |
| 4 | 6 | 0.265 |
| 5 | 10 | 0.401 |
| 6 | 15 | 0.537 |
| 10 | 45 | 0.901 |
Look at that last row. With 10 groups and 45 pairwise t-tests at $\alpha = 0.05$, there's a 90.1% chance of at least one false positive — even if all 10 groups are identical. You'd almost certainly "discover" a difference that doesn't exist.
This is the multiple comparisons problem: the more tests you run, the more likely you are to get a significant result by pure chance.
🔄 Spaced Review 2 (from Ch.17): Multiple Comparisons and the Family-Wise Error Rate
In Chapter 17, you first encountered this idea when we discussed p-hacking. Running 20 hypothesis tests on the same data at $\alpha = 0.05$ produces at least one false positive about 64% of the time — even if there are no real effects anywhere. The simulation in Section 17.9 demonstrated this vividly.
The probability of at least one false positive across a family of tests is called the family-wise error rate (FWER). One quick fix you learned was the Bonferroni correction: divide $\alpha$ by the number of tests. With three tests, use $\alpha = 0.05/3 = 0.0167$ for each. This controls the FWER at 0.05, but it's conservative — it makes each individual test harder to pass, which reduces power.
ANOVA offers a more elegant solution: test all groups at once in a single test with a single p-value.
The Elegant Solution: Test All Groups at Once
Instead of asking "Is Group A different from Group B?" and "Is Group B different from Group C?" and so on, ANOVA asks a single, omnibus question:
ANOVA question: Is there any difference among the group means? Or could all the observed differences be explained by random variation?
This single test controls the Type I error rate at exactly $\alpha$, no matter how many groups you have. One test. One p-value. One $\alpha$.
The null and alternative hypotheses for a one-way ANOVA with $k$ groups are:
$$H_0: \mu_1 = \mu_2 = \mu_3 = \cdots = \mu_k$$
$$H_a: \text{Not all } \mu_i \text{ are equal (at least one group mean differs)}$$
Notice what $H_a$ says — and, just as importantly, what it doesn't say. It says that at least one group is different. It does not tell you which group, or how many groups differ, or in which direction. Finding that out requires additional analysis (post-hoc tests), which we'll get to in Section 20.8.
20.3 ANOVA: Comparing Variability, Not Means
Here's the part that surprises most students: ANOVA compares means by analyzing variability.
The name itself is a clue. ANOVA stands for Analysis of Variance. Not "analysis of means." Variance.
🔄 Spaced Review 3 (from Ch.6): Variance
In Chapter 6, you learned that the variance measures how spread out data values are from their mean:
$$s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1}$$
You calculated it as the average of the squared deviations from the mean. The standard deviation $s$ — the square root of variance — tells you the "typical distance" of values from the center.
Now we're going to take that same idea and split it in two.
The Key Insight: Two Kinds of Variability
Imagine you're looking at data from three groups. Every data point is different. But why is it different? There are two possible reasons:
-
It's in a different group. If the groups truly have different means, then some of the variability you see is because data points belong to different groups. This is between-group variability — the variability explained by group membership.
-
It just naturally varies. Even within a single group, not every data point is the same. People are different. Measurements fluctuate. This is within-group variability — the natural, unexplained noise within each group.
Total variability = Between-group variability + Within-group variability
Or, in language that will become even more important in Chapters 22–23 when we study regression:
Total variability = Explained variability + Unexplained variability
This decomposition is the threshold concept of this chapter. Let me say it again because it's that important:
🎯 Threshold Concept: Decomposing Variability
The total variation in your data can be split into two pieces: variation between the groups (explained by group membership) and variation within the groups (unexplained noise). ANOVA compares the size of these two pieces. If the between-group piece is large relative to the within-group piece, at least one group mean is genuinely different.
This isn't just an ANOVA idea. It's the fundamental principle behind regression ($R^2$ measures the proportion of variance "explained" by predictors), and it's the foundation of virtually all statistical modeling. Getting it here — really getting it — will make everything that follows easier.
Visual Intuition: When Groups Are Different vs. When They're Not
Let me paint two pictures.
Scenario 1: No real difference. Imagine three groups of 10 students each, all from the same population. Their test scores look like this:
- Group A: scattered between 60 and 90, mean = 75
- Group B: scattered between 58 and 88, mean = 74
- Group C: scattered between 62 and 92, mean = 77
The group means (74, 75, 77) are slightly different, but each group is internally spread across a 30-point range. The between-group differences (a few points) are tiny compared to the within-group spread (30 points). ANOVA would say: "The between-group variability is not much larger than what you'd expect from the within-group noise alone."
Scenario 2: Real difference. Same three groups, but now:
- Group A: clustered between 68 and 82, mean = 75
- Group B: clustered between 83 and 97, mean = 90
- Group C: clustered between 53 and 67, mean = 60
The group means (60, 75, 90) are very different, and each group is internally clustered within a 14-point range. The between-group differences (15–30 points) are enormous compared to the within-group spread (14 points). ANOVA would say: "The between-group variability is far larger than what the within-group noise can explain. Something real is going on."
The F-statistic is essentially the ratio of these two types of variability. A large F means the between-group signal is strong relative to the within-group noise. A small F means the groups look about the same.
20.4 The Mathematics of ANOVA: Sums of Squares
Now let's formalize what we just described intuitively. This is where the arithmetic gets a bit involved, but every piece has a clear meaning. I'll walk you through it step by step.
Setting Up the Notation
We have: - $k$ groups - $n_i$ observations in group $i$ (for $i = 1, 2, \ldots, k$) - $N = n_1 + n_2 + \cdots + n_k$ = total number of observations - $x_{ij}$ = the $j$-th observation in group $i$ - $\bar{x}_i$ = the mean of group $i$ - $\bar{x}$ = the grand mean (the mean of all $N$ observations combined)
Three Sums of Squares
1. Total Sum of Squares (SST): How much total variability is there in the data?
$$SS_{\text{Total}} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar{x})^2$$
This is just the familiar numerator of the variance formula from Chapter 6, applied to all $N$ observations together. Every data point's squared distance from the grand mean, added up.
2. Between-Group Sum of Squares (SSB): How much variability is due to differences among the group means?
$$SS_{\text{Between}} = \sum_{i=1}^{k} n_i (\bar{x}_i - \bar{x})^2$$
This measures how far each group mean is from the grand mean, weighted by the group size. If all group means equal the grand mean, $SS_{\text{Between}} = 0$. The more the group means differ, the larger $SS_{\text{Between}}$ gets.
3. Within-Group Sum of Squares (SSW): How much variability is just noise within each group?
$$SS_{\text{Within}} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2$$
This adds up the squared deviations within each group — how far each data point is from its own group mean. This is the variability that group membership can't explain.
The Fundamental Decomposition
Here's the beautiful part:
$$SS_{\text{Total}} = SS_{\text{Between}} + SS_{\text{Within}}$$
Total variability = Explained variability + Unexplained variability.
This isn't just a convenient approximation. It's an exact mathematical identity. Every drop of variability in your data is accounted for: some of it is explained by group membership, and the rest is leftover noise within the groups.
Plain English: Think of it like a pie chart of variability. The whole pie is $SS_{\text{Total}}$. One slice is $SS_{\text{Between}}$ (the part explained by groups), and the other slice is $SS_{\text{Within}}$ (the part that's just noise). The bigger the "explained" slice, the stronger the evidence that the groups are genuinely different.
20.5 Mean Squares and the F-Statistic
We can't just compare $SS_{\text{Between}}$ to $SS_{\text{Within}}$ directly, because they have different numbers of terms. More groups means more terms in $SS_{\text{Between}}$. More observations means more terms in $SS_{\text{Within}}$. We need to adjust for that by dividing each sum of squares by its degrees of freedom.
Degrees of Freedom
| Component | Degrees of Freedom | Why |
|---|---|---|
| Between | $df_{\text{Between}} = k - 1$ | There are $k$ group means, but they're constrained by the grand mean; only $k - 1$ are free to vary |
| Within | $df_{\text{Within}} = N - k$ | There are $N$ observations, and we've estimated $k$ group means, leaving $N - k$ degrees of freedom |
| Total | $df_{\text{Total}} = N - 1$ | There are $N$ observations, constrained by one grand mean |
Notice: $df_{\text{Total}} = df_{\text{Between}} + df_{\text{Within}}$, just like the sums of squares decompose.
Mean Squares
Dividing each sum of squares by its degrees of freedom gives us the mean squares:
$$MS_{\text{Between}} = \frac{SS_{\text{Between}}}{k - 1}$$
$$MS_{\text{Within}} = \frac{SS_{\text{Within}}}{N - k}$$
$MS_{\text{Between}}$ is the average between-group variability per degree of freedom. $MS_{\text{Within}}$ is the average within-group variability per degree of freedom — essentially a pooled estimate of the common variance within each group.
The F-Statistic
Now we take their ratio:
$$F = \frac{MS_{\text{Between}}}{MS_{\text{Within}}}$$
In plain English:
$$F = \frac{\text{How much the group means vary around the grand mean}}{\text{How much individual observations vary within their groups}}$$
Or even simpler:
$$F = \frac{\text{Signal (between-group)}}{\text{Noise (within-group)}}$$
If the null hypothesis is true (all group means are equal), then $MS_{\text{Between}}$ and $MS_{\text{Within}}$ are both estimating the same thing — the common population variance $\sigma^2$. Their ratio should be approximately 1. Not exactly 1, because of random sampling variability, but close.
If the null hypothesis is false (at least one group mean differs), then $MS_{\text{Between}}$ will be inflated by the real differences among group means, while $MS_{\text{Within}}$ still estimates only the noise. The ratio $F$ will be larger than 1.
This is why ANOVA always produces a one-sided test, even though we're testing for any difference. The F-statistic is always positive, and we always look at the right tail: large values of $F$ provide evidence against $H_0$.
20.6 The F-Distribution
The F-statistic follows the F-distribution when $H_0$ is true. Like the $t$-distribution, the F-distribution is actually a family of distributions, indexed by two degrees of freedom:
- $df_1 = k - 1$ (numerator, from $MS_{\text{Between}}$)
- $df_2 = N - k$ (denominator, from $MS_{\text{Within}}$)
The F-distribution has several distinctive features:
- It's right-skewed — you can't have a negative F (it's a ratio of positive quantities)
- It starts at 0 and has a long right tail
- As $df_2$ gets large, the distribution becomes more concentrated around $F = 1$
- It was named in honor of Ronald A. Fisher, the statistician who developed ANOVA in the 1920s
The p-value for an ANOVA test is the probability of getting an F-statistic as large or larger than what you observed, assuming $H_0$ is true. Since $F$ must always be positive, the test is always right-tailed.
In Python:
from scipy import stats
# p-value from F-statistic
p_value = 1 - stats.f.cdf(F_observed, dfn=k-1, dfd=N-k)
# Or equivalently:
p_value = stats.f.sf(F_observed, dfn=k-1, dfd=N-k)
20.7 Putting It All Together: The ANOVA Table
All of these calculations are traditionally organized into an ANOVA table. This table is the standard way to report ANOVA results, and you'll see it in journal articles, software output, and textbooks everywhere.
| Source | SS | df | MS | F | p |
|---|---|---|---|---|---|
| Between Groups | $SS_B$ | $k-1$ | $MS_B = SS_B/(k-1)$ | $F = MS_B/MS_W$ | $P(F \geq F_{\text{obs}})$ |
| Within Groups | $SS_W$ | $N-k$ | $MS_W = SS_W/(N-k)$ | ||
| Total | $SS_T$ | $N-1$ |
Worked Example: Maya's Health Interventions
Let's work through a complete example with real numbers.
Dr. Maya Chen randomly assigns 40 patients to four intervention programs and measures their health improvement score (on a 0–100 scale) after 12 weeks:
| Vaccination (Group 1) | Nutrition (Group 2) | Fitness (Group 3) | Control (Group 4) |
|---|---|---|---|
| 72 | 68 | 81 | 55 |
| 78 | 75 | 85 | 62 |
| 65 | 71 | 79 | 58 |
| 74 | 62 | 88 | 51 |
| 80 | 69 | 83 | 60 |
| 71 | 74 | 86 | 57 |
| 76 | 66 | 80 | 63 |
| 69 | 72 | 90 | 54 |
| 77 | 67 | 84 | 59 |
| 73 | 70 | 82 | 56 |
Step 1: Compute the group means and grand mean.
- $\bar{x}_1 = (72 + 78 + 65 + 74 + 80 + 71 + 76 + 69 + 77 + 73)/10 = 73.5$
- $\bar{x}_2 = (68 + 75 + 71 + 62 + 69 + 74 + 66 + 72 + 67 + 70)/10 = 69.4$
- $\bar{x}_3 = (81 + 85 + 79 + 88 + 83 + 86 + 80 + 90 + 84 + 82)/10 = 83.8$
- $\bar{x}_4 = (55 + 62 + 58 + 51 + 60 + 57 + 63 + 54 + 59 + 56)/10 = 57.5$
- Grand mean: $\bar{x} = (73.5 + 69.4 + 83.8 + 57.5)/4 = 71.05$
(Since all groups have the same size, the grand mean is just the average of the group means.)
Step 2: Compute Sums of Squares.
$SS_{\text{Between}}$: How far are the group means from the grand mean?
$$SS_B = 10(73.5 - 71.05)^2 + 10(69.4 - 71.05)^2 + 10(83.8 - 71.05)^2 + 10(57.5 - 71.05)^2$$
$$SS_B = 10(6.0025) + 10(2.7225) + 10(162.5625) + 10(183.6025)$$
$$SS_B = 60.025 + 27.225 + 1625.625 + 1836.025 = 3548.9$$
$SS_{\text{Within}}$: How far are individual observations from their group means?
For Group 1: $(72-73.5)^2 + (78-73.5)^2 + \cdots + (73-73.5)^2 = 182.5$
For Group 2: $(68-69.4)^2 + (75-69.4)^2 + \cdots + (70-69.4)^2 = 132.4$
For Group 3: $(81-83.8)^2 + (85-83.8)^2 + \cdots + (82-83.8)^2 = 114.6$
For Group 4: $(55-57.5)^2 + (62-57.5)^2 + \cdots + (56-57.5)^2 = 132.5$
$$SS_W = 182.5 + 132.4 + 114.6 + 132.5 = 562.0$$
$SS_{\text{Total}} = SS_B + SS_W = 3548.9 + 562.0 = 4110.9$
Step 3: Compute Mean Squares.
$$MS_B = \frac{3548.9}{4 - 1} = \frac{3548.9}{3} = 1182.97$$
$$MS_W = \frac{562.0}{40 - 4} = \frac{562.0}{36} = 15.61$$
Step 4: Compute the F-statistic.
$$F = \frac{MS_B}{MS_W} = \frac{1182.97}{15.61} = 75.78$$
Step 5: The ANOVA table.
| Source | SS | df | MS | F | p |
|---|---|---|---|---|---|
| Between | 3548.9 | 3 | 1182.97 | 75.78 | < 0.001 |
| Within | 562.0 | 36 | 15.61 | ||
| Total | 4110.9 | 39 |
Step 6: Interpret.
With $F(3, 36) = 75.78$ and $p < 0.001$, we reject $H_0$ at any conventional significance level. There is overwhelming evidence that the four intervention programs do not all produce the same average health improvement score.
But wait — which programs are different? ANOVA tells us that at least one group differs, but not which one. For that, we need post-hoc tests (Section 20.8).
Plain English interpretation: "The differences in average health improvement scores across the four intervention programs are far too large to be explained by random variation alone ($F(3, 36) = 75.78$, $p < 0.001$). At least one program produces a different level of health improvement compared to the others."
Python Implementation
import numpy as np
from scipy import stats
# Maya's health intervention data
vaccination = [72, 78, 65, 74, 80, 71, 76, 69, 77, 73]
nutrition = [68, 75, 71, 62, 69, 74, 66, 72, 67, 70]
fitness = [81, 85, 79, 88, 83, 86, 80, 90, 84, 82]
control = [55, 62, 58, 51, 60, 57, 63, 54, 59, 56]
# One-way ANOVA
F_stat, p_value = stats.f_oneway(vaccination, nutrition, fitness, control)
print("=" * 50)
print("ONE-WAY ANOVA: Health Intervention Programs")
print("=" * 50)
print(f" F-statistic: {F_stat:.2f}")
print(f" p-value: {p_value:.6f}")
print(f" Conclusion: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'}")
print(f" at α = 0.05")
# Group summaries
groups = {'Vaccination': vaccination, 'Nutrition': nutrition,
'Fitness': fitness, 'Control': control}
print(f"\n{'Group':<15} {'n':>5} {'Mean':>8} {'SD':>8}")
print("-" * 40)
for name, data in groups.items():
print(f"{name:<15} {len(data):>5} {np.mean(data):>8.1f} {np.std(data, ddof=1):>8.2f}")
grand_mean = np.mean(vaccination + nutrition + fitness + control)
print(f"\n Grand Mean: {grand_mean:.2f}")
Output: ``` ================================================== ONE-WAY ANOVA: Health Intervention Programs ================================================== F-statistic: 75.78 p-value: 0.000000 Conclusion: Reject H₀ at α = 0.05
Group n Mean SD
Vaccination 10 73.5 4.50 Nutrition 10 69.4 3.84 Fitness 10 83.8 3.57 Control 10 57.5 3.84
Grand Mean: 71.05 ```
Excel: Single-Factor ANOVA in Data Analysis ToolPak
If you're using Excel, here's how to run the same analysis:
- Enter your data in columns (one column per group)
- Go to Data tab → Data Analysis (if you don't see it, enable the Analysis ToolPak add-in under File → Options → Add-ins)
- Select Anova: Single Factor
- Set the Input Range to cover all your data columns
- Choose Columns for "Grouped By"
- Set $\alpha = 0.05$
- Click OK
Excel produces the same ANOVA table we computed by hand: SS, df, MS, F, p-value, and $F_{\text{critical}}$. If $F > F_{\text{critical}}$ (or equivalently, if $p < \alpha$), reject $H_0$.
20.8 Post-Hoc Tests: Which Groups Differ?
Suppose your ANOVA comes back significant — great, you know that not all group means are equal. But that raises an immediate question: which groups differ from which?
ANOVA is an omnibus test. It tells you something is going on, but not what specifically. Think of it as a smoke detector that tells you there's a fire somewhere in the building, but not which room.
To find the specific rooms on fire, you need post-hoc tests (from the Latin "post hoc," meaning "after this"). These are pairwise comparison procedures designed to control the family-wise error rate while making all possible comparisons.
Tukey's HSD (Honestly Significant Difference)
The most commonly used post-hoc test is Tukey's Honestly Significant Difference (HSD) test, developed by John Tukey (the same statistician who invented the box plot in Chapter 6!). Tukey's HSD compares every pair of group means while controlling the overall Type I error rate at $\alpha$.
The idea: for each pair of groups $(i, j)$, compute:
$$q = \frac{\bar{x}_i - \bar{x}_j}{\sqrt{MS_W / n}}$$
(for equal group sizes $n$; a modified formula handles unequal sizes)
This $q$-statistic is compared to the studentized range distribution to get a p-value. Tukey's HSD finds the minimum difference between group means that would be considered statistically significant.
Key Point: Only run post-hoc tests after a significant ANOVA result. If ANOVA fails to reject $H_0$, there's no evidence of any differences, and post-hoc tests shouldn't be conducted. This two-stage approach — omnibus test first, pairwise comparisons second — is what controls the family-wise error rate.
Python: Tukey's HSD
import numpy as np
import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Prepare data in long format (Tukey's HSD needs it this way)
data = (vaccination + nutrition + fitness + control)
groups_list = (['Vaccination']*10 + ['Nutrition']*10 +
['Fitness']*10 + ['Control']*10)
# Run Tukey's HSD
tukey = pairwise_tukeyhsd(endog=data, groups=groups_list, alpha=0.05)
print(tukey)
Output: ``` Multiple Comparison of Means - Tukey HSD, FWER=0.05 ================================================================= group1 group2 meandiff p-adj lower upper reject
Control Fitness 26.3 0.0000 22.16 30.44 True Control Nutrition 11.9 0.0000 7.76 16.04 True Control Vaccination 16.0 0.0000 11.86 20.14 True Fitness Nutrition -14.4 0.0000 -18.54 -10.26 True Fitness Vaccination -10.3 0.0000 -14.44 -6.16 True Nutrition Vaccination 4.1 0.0169 0.00 8.24 True
```
Every pairwise comparison is significant — all four programs produce statistically different average health improvement scores. The Fitness program has the highest scores, followed by Vaccination, Nutrition, and Control.
Maya can now rank the programs: Fitness > Vaccination > Nutrition > Control, and all differences are statistically significant after controlling for multiple comparisons.
Bonferroni as an Alternative
You could also use the Bonferroni correction from Chapter 17: run pairwise t-tests but divide $\alpha$ by the number of comparisons. With $k = 4$ groups and $\binom{4}{2} = 6$ comparisons, each test uses $\alpha = 0.05/6 = 0.0083$.
Bonferroni is simpler to explain, but it's more conservative than Tukey's HSD — it's harder to find significance. For ANOVA follow-up, Tukey's HSD is generally preferred because:
- It was designed specifically for pairwise comparisons after ANOVA
- It's less conservative than Bonferroni when the number of comparisons is moderate
- It uses $MS_W$ from the ANOVA, pooling information across all groups for a more stable variance estimate
20.9 Checking ANOVA Assumptions
Like all statistical tests, ANOVA makes assumptions. Violating them doesn't automatically invalidate your results — ANOVA is reasonably robust — but knowing the assumptions helps you decide when to trust the results and when to be cautious.
The Three Conditions
1. Independence. Observations are independent within and across groups. This is primarily about study design, not something you can test statistically.
- Satisfied by: Random sampling from each population, or random assignment to treatment groups
- Violated by: Repeated measures on the same subjects, clustered data (students within classrooms), time-series data
- Not fixable with larger samples. If independence is violated, ANOVA is not appropriate (you'd need repeated-measures ANOVA or mixed models)
2. Normality. The data within each group are approximately normally distributed. This is the same condition you've been checking since Chapter 15.
- Check with: Histograms, QQ-plots, or the Shapiro-Wilk test for each group
- ANOVA is robust to this when group sizes are roughly equal and each group has $n \geq 15$–$20$. With very small groups ($n < 10$) and skewed data, consider the Kruskal-Wallis test (Chapter 21)
- Most important when group sizes are small and unequal
3. Equal variances (homogeneity of variance). All $k$ groups have approximately the same population variance: $\sigma_1^2 \approx \sigma_2^2 \approx \cdots \approx \sigma_k^2$.
- Check with: Levene's test, or the rule of thumb that the largest group SD should be no more than twice the smallest group SD
- ANOVA is robust to this when group sizes are equal. When group sizes are unequal and variances are unequal, ANOVA can give misleading results
- Alternative if violated: Welch's ANOVA (doesn't assume equal variances)
Levene's Test for Equal Variances
Levene's test formally tests whether the population variances are equal:
$$H_0: \sigma_1^2 = \sigma_2^2 = \cdots = \sigma_k^2$$
$$H_a: \text{Not all variances are equal}$$
A significant result ($p < 0.05$) suggests the equal-variance assumption is violated.
from scipy import stats
# Levene's test for Maya's data
stat, p_levene = stats.levene(vaccination, nutrition, fitness, control)
print(f"Levene's test: statistic = {stat:.3f}, p-value = {p_levene:.4f}")
# Check: largest SD / smallest SD
sds = [np.std(g, ddof=1) for g in [vaccination, nutrition, fitness, control]]
print(f"SD ratio (largest/smallest): {max(sds)/min(sds):.2f}")
print(f"Rule of thumb: ratio < 2 is OK → {'OK' if max(sds)/min(sds) < 2 else 'Concern'}")
Output:
Levene's test: statistic = 0.221, p-value = 0.8811 SD ratio (largest/smallest): 1.26 Rule of thumb: ratio < 2 is OK → OK
For Maya's data, Levene's test is not significant ($p = 0.88$), and the ratio of the largest to smallest SD is 1.26 — well under 2. The equal-variance assumption is satisfied.
Normal distribution as a model (from Ch.10)
Remember the threshold concept from Chapter 10: "The question is never 'Is my data normal?' (the answer is always no). The question is: 'Is my data close enough to normal that the model gives useful, approximately correct answers?'"
The same philosophy applies to ANOVA assumptions. Your data will never be perfectly normal or have perfectly equal variances. The question is whether the violations are severe enough to distort your conclusions. With balanced designs (equal group sizes) and moderate sample sizes, ANOVA is remarkably robust to moderate violations of both normality and equal variance.
Welch's ANOVA: When Variances Are Unequal
If the equal-variance assumption fails, you can use Welch's ANOVA, which adjusts the degrees of freedom (similar to how Welch's t-test from Chapter 16 handled unequal variances for two groups):
from scipy import stats
# Welch's ANOVA (does not assume equal variances)
# scipy doesn't have a built-in Welch's ANOVA, but pingouin does:
# import pingouin as pg
# pg.welch_anova(dv='score', between='group', data=df)
# Alternative: use the standard f_oneway and compare with Levene's test
# If Levene's test is not significant, f_oneway is fine
20.10 Effect Size: Eta-Squared ($\eta^2$)
Just like we learned in Chapter 17 that a significant p-value doesn't tell you whether the effect is important, a significant ANOVA F-test doesn't tell you how much of the variation in your data is explained by group membership.
For that, we need an effect size measure. The standard one for ANOVA is eta-squared ($\eta^2$):
$$\eta^2 = \frac{SS_{\text{Between}}}{SS_{\text{Total}}}$$
Eta-squared tells you the proportion of total variability that is explained by group membership. It ranges from 0 (group membership explains nothing) to 1 (group membership explains everything).
Interpretation Benchmarks
Jacob Cohen (1988) proposed the following rough guidelines:
| $\eta^2$ | Interpretation |
|---|---|
| 0.01 | Small effect |
| 0.06 | Medium effect |
| 0.14 | Large effect |
These are guidelines, not laws. In some fields (like psychology), $\eta^2 = 0.06$ might be important. In others (like physics), it might be trivial. Always interpret effect sizes in context.
Maya's Effect Size
For Maya's intervention data:
$$\eta^2 = \frac{SS_B}{SS_T} = \frac{3548.9}{4110.9} = 0.863$$
This is an extremely large effect. Group membership (which intervention program a patient received) explains 86.3% of the variability in health improvement scores. Only 13.7% is due to individual differences within groups.
# Calculate eta-squared
all_data = np.concatenate([vaccination, nutrition, fitness, control])
grand_mean = np.mean(all_data)
# SS_Between
ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2
for g in [vaccination, nutrition, fitness, control])
# SS_Total
ss_total = np.sum((all_data - grand_mean)**2)
eta_squared = ss_between / ss_total
print(f"\nEffect Size:")
print(f" η² = {eta_squared:.3f}")
print(f" Interpretation: {'Small' if eta_squared < 0.06 else 'Medium' if eta_squared < 0.14 else 'Large'} effect")
print(f" {eta_squared*100:.1f}% of variability in health scores")
print(f" is explained by intervention group")
A note on partial eta-squared and omega-squared:
Eta-squared tends to slightly overestimate the effect in the population (it's biased upward). An adjusted version called omega-squared ($\omega^2$) provides a less biased estimate:
$$\omega^2 = \frac{SS_B - (k-1) \cdot MS_W}{SS_T + MS_W}$$
In published research, you may also encounter partial eta-squared ($\eta_p^2$), which is the same as $\eta^2$ for one-way ANOVA but differs for more complex designs. For this course, $\eta^2$ is sufficient.
20.11 Alex's Analysis: Watch Time Across Subscription Tiers
Alex Rivera wants to compare average daily watch time (in minutes) across StreamVibe's three subscription tiers: Free, Standard, and Premium.
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
np.random.seed(42)
# Simulated watch time data (minutes per day)
free = [22, 18, 25, 15, 30, 20, 12, 28, 19, 24,
17, 23, 14, 26, 21]
standard = [35, 42, 38, 31, 45, 40, 33, 37, 44, 36,
39, 41, 34, 43, 32]
premium = [48, 55, 52, 43, 58, 50, 45, 53, 56, 49,
51, 54, 46, 57, 47]
print("=" * 55)
print("ALEX'S ANALYSIS: Watch Time by Subscription Tier")
print("=" * 55)
# Group summaries
tiers = {'Free': free, 'Standard': standard, 'Premium': premium}
print(f"\n{'Tier':<12} {'n':>5} {'Mean':>8} {'SD':>8}")
print("-" * 35)
for name, data in tiers.items():
print(f"{name:<12} {len(data):>5} {np.mean(data):>8.1f} {np.std(data, ddof=1):>8.2f}")
# ANOVA
F_stat, p_value = stats.f_oneway(free, standard, premium)
print(f"\nOne-Way ANOVA:")
print(f" F(2, 42) = {F_stat:.2f}")
print(f" p-value = {p_value:.8f}")
# Effect size
all_data = np.array(free + standard + premium)
grand_mean = np.mean(all_data)
ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2
for g in [free, standard, premium])
ss_total = np.sum((all_data - grand_mean)**2)
eta_sq = ss_between / ss_total
print(f" η² = {eta_sq:.3f} ({'Small' if eta_sq < 0.06 else 'Medium' if eta_sq < 0.14 else 'Large'} effect)")
# Assumptions check
stat, p_levene = stats.levene(free, standard, premium)
print(f"\nAssumptions Check:")
print(f" Levene's test: F = {stat:.3f}, p = {p_levene:.4f}")
# Post-hoc: Tukey's HSD
data_all = free + standard + premium
groups_all = ['Free']*15 + ['Standard']*15 + ['Premium']*15
tukey = pairwise_tukeyhsd(endog=data_all, groups=groups_all, alpha=0.05)
print(f"\nTukey's HSD Post-Hoc Comparisons:")
print(tukey)
Alex's interpretation for the StreamVibe analytics team:
"Average daily watch time differs significantly across subscription tiers ($F(2, 42) = 109.5$, $p < 0.001$, $\eta^2 = 0.84$). Post-hoc comparisons show that all three tiers differ from each other: Premium users watch significantly more than Standard users, who watch significantly more than Free users. Subscription tier explains approximately 84% of the variation in daily watch time."
Alex's Caveat (Theme 5 — Correlation vs. Causation):
Alex is careful to note that this is an observational comparison, not an experiment. Customers who pay for Premium might be inherently different from Free users — they may be more engaged, have more leisure time, or have different content preferences. The subscription tier didn't cause the watch time difference; it's associated with it. As you learned in Chapter 4, only a randomized experiment can establish causation.
20.12 Sam's Analysis: Scoring Across Opponents
Sam Okafor has been tracking Daria's scoring averages against the Raptors' five conference opponents throughout the season:
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Daria's points per game against each opponent (6 games each)
hawks = [22, 18, 25, 20, 24, 19] # Defensive team
wolves = [28, 32, 25, 30, 27, 34] # Fast-paced
bears = [15, 18, 12, 20, 16, 17] # Strong interior defense
eagles = [26, 22, 28, 24, 30, 26] # Average defense
tigers = [20, 23, 19, 22, 18, 24] # Physical play
print("=" * 55)
print("SAM'S ANALYSIS: Daria's Scoring by Opponent")
print("=" * 55)
opponents = {'Hawks': hawks, 'Wolves': wolves, 'Bears': bears,
'Eagles': eagles, 'Tigers': tigers}
print(f"\n{'Opponent':<12} {'Games':>6} {'Mean':>8} {'SD':>8}")
print("-" * 38)
for name, data in opponents.items():
print(f"{name:<12} {len(data):>6} {np.mean(data):>8.1f} {np.std(data, ddof=1):>8.2f}")
# ANOVA
F_stat, p_value = stats.f_oneway(hawks, wolves, bears, eagles, tigers)
print(f"\nOne-Way ANOVA:")
print(f" F(4, 25) = {F_stat:.2f}")
print(f" p-value = {p_value:.6f}")
# Effect size
all_data = np.array(hawks + wolves + bears + eagles + tigers)
grand_mean = np.mean(all_data)
ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2
for g in [hawks, wolves, bears, eagles, tigers])
ss_total = np.sum((all_data - grand_mean)**2)
eta_sq = ss_between / ss_total
print(f" η² = {eta_sq:.3f} ({'Small' if eta_sq < 0.06 else 'Medium' if eta_sq < 0.14 else 'Large'} effect)")
print(f" {eta_sq*100:.1f}% of scoring variability explained by opponent")
# Levene's test
stat, p_lev = stats.levene(hawks, wolves, bears, eagles, tigers)
print(f"\nLevene's test: F = {stat:.3f}, p = {p_lev:.4f}")
# Tukey's HSD
data_all = hawks + wolves + bears + eagles + tigers
groups_all = (['Hawks']*6 + ['Wolves']*6 + ['Bears']*6 +
['Eagles']*6 + ['Tigers']*6)
tukey = pairwise_tukeyhsd(endog=data_all, groups=groups_all, alpha=0.05)
print(f"\nTukey's HSD:")
print(tukey)
Sam's report to the coaching staff: "Daria's scoring average varies significantly by opponent ($F(4, 25) = 15.2$, $p < 0.001$, $\eta^2 = 0.71$). She scores highest against the Wolves (29.3 ppg) and lowest against the Bears (16.3 ppg). Opponent identity explains 71% of the game-to-game variability in her scoring. This information is valuable for game planning — when facing the Bears' strong interior defense, the team may want to design more plays for Daria's perimeter game."
20.13 Common Mistakes and Misconceptions
🚫 Common Pitfall: Running ANOVA When You Should Use a Different Test
ANOVA is for comparing means of a numerical variable across categorical groups. Here's a quick decision guide:
Situation Correct Test Compare means of 2 independent groups Two-sample t-test (Ch.16) Compare means of 3+ independent groups One-way ANOVA (this chapter) Compare proportions across groups Chi-square test (Ch.19) Compare ranks when normality fails Kruskal-Wallis test (Ch.21) Compare means with 2+ grouping factors Two-way ANOVA (beyond this course)
Mistake 1: "ANOVA Was Significant, So All Groups Are Different"
No! A significant ANOVA tells you that at least one group mean differs from the others. It's entirely possible that three of four groups are identical and only one is different. Always follow up with post-hoc tests to identify which specific comparisons are significant.
Mistake 2: "ANOVA Was Not Significant, So All Groups Are the Same"
This is the same mistake as saying "fail to reject" means "accept $H_0$." A non-significant ANOVA means you don't have enough evidence to conclude that the groups differ. The groups might be different, but your data aren't powerful enough to detect it. Consider effect sizes and power analysis (Chapter 17) before concluding that the groups are truly identical.
Mistake 3: "I Can Skip the Post-Hoc Test Because the Means Tell Me Which Groups Differ"
Looking at group means and guessing which differences are "real" is exactly the kind of informal multiple comparison that ANOVA was designed to replace. A post-hoc test accounts for the multiple comparisons problem; your eyeballs don't.
Mistake 4: Ignoring Assumptions
ANOVA is robust, but not invincible. If your data are heavily skewed with small, unequal group sizes, or if variances differ by a factor of 4 or more, your results could be misleading. Always check — at minimum, use the SD ratio rule of thumb and eyeball your distributions.
20.14 Reporting ANOVA Results
Good statistical writing follows conventions. Here's how to report ANOVA results:
In-Text Format (APA Style)
"A one-way ANOVA revealed a statistically significant difference in health improvement scores across the four intervention programs, $F(3, 36) = 75.78$, $p < .001$, $\eta^2 = .86$."
The format is: $F(df_{\text{between}}, df_{\text{within}}) = F\text{-value}$, $p = \text{value}$, $\eta^2 = \text{value}$.
Full Reporting Checklist
When reporting ANOVA results, include:
- [ ] The research question or hypothesis in plain language
- [ ] Group descriptive statistics: $n$, mean, and SD for each group
- [ ] Assumption checks: Levene's test result; note on normality assessment
- [ ] ANOVA results: $F$-statistic with both degrees of freedom, exact $p$-value (or "$p < .001$" for very small values)
- [ ] Effect size: $\eta^2$ with interpretation
- [ ] Post-hoc results (if ANOVA was significant): which pairs differ, with adjusted $p$-values and/or confidence intervals
- [ ] Interpretation in context: what the result means for the research question, not just "reject $H_0$"
20.15 The Complete ANOVA Decision Flowchart
Here's a step-by-step guide for conducting a one-way ANOVA:
START: Do group means differ?
│
├─ Step 1: Check conditions
│ ├─ Independence? (study design)
│ ├─ Normality? (histograms, QQ-plots, Shapiro-Wilk per group)
│ └─ Equal variances? (Levene's test, SD ratio < 2)
│ ├─ All conditions met → Proceed to standard ANOVA
│ ├─ Normality violated, large n → ANOVA is robust; proceed
│ ├─ Equal variance violated → Use Welch's ANOVA
│ └─ Serious violations, small n → Kruskal-Wallis (Ch.21)
│
├─ Step 2: Run one-way ANOVA
│ ├─ Compute SS_B, SS_W, SS_T
│ ├─ Compute MS_B, MS_W
│ ├─ Compute F = MS_B / MS_W
│ └─ Find p-value from F-distribution
│
├─ Step 3: Compute effect size (η²)
│
├─ Step 4: Interpret
│ ├─ p > α → Fail to reject H₀ (no significant difference)
│ └─ p ≤ α → Reject H₀ (at least one group differs)
│ │
│ └─ Step 5: Post-hoc tests (Tukey's HSD)
│ └─ Identify which specific pairs of groups differ
│
└─ Step 6: Report results with descriptive stats,
F-statistic, p-value, η², and post-hoc details
20.16 Progressive Project: ANOVA in Your Dataset
It's time to apply ANOVA to your own Data Detective Portfolio.
Your Task
-
Identify a grouping variable in your dataset that has three or more categories (e.g., region, education level, income bracket, or any categorical variable with 3+ groups)
-
Identify a numerical outcome variable whose mean you want to compare across those groups
-
Check the assumptions: - Are observations independent? - Create histograms or QQ-plots for each group to check normality - Run Levene's test for equal variances
-
Conduct a one-way ANOVA using
scipy.stats.f_oneway() -
Calculate $\eta^2$ and interpret the effect size
-
If significant, run Tukey's HSD to identify which groups differ
-
Write a one-paragraph interpretation of your results in context
Starter Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Define your grouping variable and outcome variable
group_var = 'your_grouping_variable' # e.g., 'region', 'education'
outcome_var = 'your_outcome_variable' # e.g., 'income', 'score'
# Group summaries
print("Group Summaries:")
print(df.groupby(group_var)[outcome_var].describe())
# Visual check: box plots
df.boxplot(column=outcome_var, by=group_var, figsize=(10, 6))
plt.title(f'{outcome_var} by {group_var}')
plt.suptitle('') # Remove automatic title
plt.ylabel(outcome_var)
plt.show()
# Levene's test
groups = [group[outcome_var].values
for name, group in df.groupby(group_var)]
stat, p_lev = stats.levene(*groups)
print(f"\nLevene's test: F = {stat:.3f}, p = {p_lev:.4f}")
# ANOVA
F, p = stats.f_oneway(*groups)
print(f"\nOne-Way ANOVA: F = {F:.2f}, p = {p:.6f}")
# Effect size (eta-squared)
all_vals = df[outcome_var].values
grand_mean = np.mean(all_vals)
ss_between = sum(len(g) * (np.mean(g) - grand_mean)**2 for g in groups)
ss_total = np.sum((all_vals - grand_mean)**2)
eta_sq = ss_between / ss_total
print(f"η² = {eta_sq:.3f}")
# Post-hoc (if significant)
if p < 0.05:
tukey = pairwise_tukeyhsd(
endog=df[outcome_var],
groups=df[group_var],
alpha=0.05
)
print("\nTukey's HSD:")
print(tukey)
Add this analysis to your Jupyter notebook under a new heading: "Comparing Multiple Groups: ANOVA."
20.17 Chapter Summary
What We Learned
This chapter introduced ANOVA — the tool for comparing means across three or more groups. Here's the arc:
-
Why not multiple t-tests? Because the multiple comparisons problem inflates the family-wise error rate. With $k$ groups and $\binom{k}{2}$ pairwise tests, the probability of at least one false positive far exceeds $\alpha$.
-
ANOVA's insight: Instead of comparing pairs of means, compare two types of variability — between-group and within-group. The F-statistic is their ratio: $F = MS_B / MS_W$.
-
The threshold concept — decomposing variability: Total variability = Between-group + Within-group. This is exact. It's also the foundation for regression ($R^2$) and virtually all statistical modeling.
-
The ANOVA table organizes SS, df, MS, F, and p in a standardized format.
-
If ANOVA is significant, follow up with Tukey's HSD to find which specific groups differ — while controlling the family-wise error rate.
-
Effect sizes matter. Eta-squared ($\eta^2 = SS_B / SS_T$) tells you how much of the total variability is explained by group membership.
-
Assumptions: Independence, normality within groups, equal variances. ANOVA is robust to moderate violations with balanced designs.
Key Formulas
| Formula | Description |
|---|---|
| $SS_T = \sum\sum(x_{ij} - \bar{x})^2$ | Total sum of squares |
| $SS_B = \sum n_i(\bar{x}_i - \bar{x})^2$ | Between-group sum of squares |
| $SS_W = \sum\sum(x_{ij} - \bar{x}_i)^2$ | Within-group sum of squares |
| $SS_T = SS_B + SS_W$ | Fundamental decomposition |
| $MS_B = SS_B / (k-1)$ | Mean square between |
| $MS_W = SS_W / (N-k)$ | Mean square within |
| $F = MS_B / MS_W$ | F-statistic |
| $\eta^2 = SS_B / SS_T$ | Eta-squared (effect size) |
Connections to What's Next
In Chapter 21, you'll learn what to do when ANOVA assumptions fail. The Kruskal-Wallis test is a nonparametric alternative to one-way ANOVA that doesn't require normality — it works with ranks instead of raw values. The decomposition idea from this chapter (total = between + within) also reappears in Chapter 22, where regression decomposes variability into the part explained by a predictor and the part left over as residual noise. The $R^2$ you'll meet there is the regression analogue of $\eta^2$.
What's Next: Chapter 21 tackles nonparametric methods — your toolbox for when the assumptions that make t-tests, z-tests, and ANOVA work are badly violated. You'll learn the Kruskal-Wallis test (the nonparametric ANOVA), the Wilcoxon rank-sum test, and more. These methods sacrifice some power in exchange for making fewer assumptions about your data.
"The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. The arithmetic does not tell us the answer to our questions — but it may help us ask better ones." — Adapted from Ronald Fisher