> "Not everything that counts can be counted, and not everything that can be counted counts."
Learning Objectives
- Explain when and why nonparametric methods are needed
- Conduct a Wilcoxon rank-sum (Mann-Whitney U) test
- Conduct a Wilcoxon signed-rank test for paired data
- Conduct a Kruskal-Wallis test as a nonparametric alternative to ANOVA
- Compare parametric and nonparametric approaches and choose appropriately
In This Chapter
- Chapter Overview
- 21.1 A Puzzle Before We Start (Productive Struggle)
- 21.2 When Parametric Methods Fail
- 21.3 What "Nonparametric" Actually Means
- 21.4 The Ranking Procedure: Step by Step
- 21.5 The Sign Test: The Simplest Nonparametric Test
- 21.6 The Wilcoxon Rank-Sum Test (Mann-Whitney U Test)
- 21.7 The Wilcoxon Signed-Rank Test
- 21.8 The Kruskal-Wallis Test: Nonparametric ANOVA
- 21.9 Parametric vs. Nonparametric: The Decision Framework
- 21.10 Maya's Full Analysis: Patient Satisfaction Scores
- 21.11 Alex's Analysis: Engagement Metrics with Skewed Data
- 21.12 Sam's Analysis: Performance Ratings Across Positions
- 21.13 Excel: Nonparametric Methods (Limited Support)
- 21.14 Common Mistakes and Misconceptions
- 21.15 When Both Approaches Converge — and When They Don't
- 21.16 Theme 4: Robust Methods for Messy Reality
- 21.17 Progressive Project: Apply a Nonparametric Test
- 21.18 Chapter Summary
Chapter 21: Nonparametric Methods: When Assumptions Fail
"Not everything that counts can be counted, and not everything that can be counted counts." — Often attributed to Albert Einstein
Chapter Overview
Here's a situation every practicing statistician runs into eventually.
You've done everything right. You've collected your data carefully, checked the sample sizes, stated your hypotheses. You pull up the QQ-plot to check normality before running a t-test, and... it looks like a roller coaster. The Shapiro-Wilk test screams non-normality with $p < 0.001$. You've got a sample of 12 observations, so the Central Limit Theorem can't bail you out. And there are two massive outliers sitting at the right tail like uninvited guests who refuse to leave.
What do you do?
If the only tools in your toolbox are t-tests and ANOVA, the answer is: not much. Those procedures rest on assumptions — normality, interval/ratio data, sufficient sample size — and when those assumptions break badly enough, the results can't be trusted.
But here's the good news. There's an entire family of statistical methods designed for exactly this situation. They're called nonparametric methods, and they're your safety net for messy reality.
Dr. Maya Chen encounters this constantly. Patient satisfaction surveys produce ordinal data — ratings from 1 to 5 — that can't be meaningfully averaged. You can't say the difference between a 4 and a 5 is the same as the difference between a 1 and a 2. The numbers are labels with an order, not measurements on a ruler. A t-test on those ratings would be technically wrong.
Alex Rivera sees it with engagement metrics. Watch time data on StreamVibe is heavily right-skewed: most users watch 10-30 minutes, but a handful of power users binge for 6+ hours. Those extreme values drag the mean far from the typical experience, and with small samples, they can wreck a t-test.
Sam Okafor runs into it when comparing player performance ratings. Scouts rate players on a 1-10 scale, and those ratings are ordinal — a player rated 8 isn't necessarily "twice as good" as one rated 4. Plus, with only 8-10 games per opponent, the sample sizes are tiny.
Nonparametric methods handle all of this. They work by converting your data to ranks — first, second, third, and so on — and analyzing the ranks instead of the raw values. This simple trick sidesteps the normality assumption entirely. It also makes the methods naturally resistant to outliers, because whether that extreme value is 100 or 1,000,000, it gets the same rank: the highest one.
In this chapter, you will learn to: - Explain when and why nonparametric methods are needed - Conduct a Wilcoxon rank-sum (Mann-Whitney U) test - Conduct a Wilcoxon signed-rank test for paired data - Conduct a Kruskal-Wallis test as a nonparametric alternative to ANOVA - Compare parametric and nonparametric approaches and choose appropriately
Fast Track: If you've encountered nonparametric tests before, skim Sections 21.1–21.3, then jump to Section 21.9 (the decision framework). Complete quiz questions 1, 10, and 18 to verify.
Deep Dive: After this chapter, read Case Study 1 (Maya's patient satisfaction analysis) for a complete worked application with ordinal data, then Case Study 2 (Sam's performance ratings across positions) for a sports analytics perspective. Both include full Python code.
21.1 A Puzzle Before We Start (Productive Struggle)
Before we dive into the methods, try this thought experiment.
The Pain Relief Study
A hospital tests two pain medications on patients recovering from knee surgery. Each patient rates their pain reduction on a scale from 1 (no relief) to 10 (complete relief). Here are the results:
- Medication A (n = 8): 3, 5, 4, 7, 6, 4, 5, 8
- Medication B (n = 8): 6, 8, 7, 9, 10, 7, 8, 9
(a) Compute the mean for each group. Does Medication B appear more effective?
(b) These are ordinal ratings, not measurements on a physical scale. Why might computing a mean be inappropriate? What does it mean to say "the average pain relief was 5.25"?
(c) If you ranked all 16 observations from lowest to highest (ignoring which group they came from), would the Medication B values tend to have higher or lower ranks than the Medication A values? How could you use ranks to test whether the groups differ?
(d) Now suppose one patient in Medication B reported a 10 but actually had a score of 100 due to a data entry error. How would this affect a t-test? How would it affect a rank-based test?
Take a few minutes to think through these questions before reading on. Your answers will give you the intuition for everything that follows.
21.2 When Parametric Methods Fail
Spaced Review (SR.1) — The Two-Sample t-Test (Ch.16)
In Chapter 16, you learned the workhorse of two-group comparisons: the two-sample t-test. Remember the conditions?
- Independence between and within groups
- Nearly normal populations (or large samples, so CLT applies)
- Interval or ratio data (so means are meaningful)
The test statistic was $t = (\bar{x}_1 - \bar{x}_2) / \sqrt{s_1^2/n_1 + s_2^2/n_2}$, and it follows a t-distribution under $H_0$. The t-test works beautifully when these conditions hold.
But what happens when they don't?
In Chapter 15 (Section 15.6), we discussed robustness — the t-test's ability to give approximately correct results even when assumptions aren't perfectly met. The guidelines were: for $n \geq 30$, the CLT handles most non-normality. For $15 \leq n < 30$, check for outliers and strong skew. For $n < 15$, you really need approximate normality.
Now we fill the gap: what do you do when $n$ is small and the data are non-normal, ordinal, or loaded with outliers?
Let's be concrete about when parametric tests (t-tests, ANOVA) run into trouble.
Situation 1: Small Samples with Non-Normal Data
You have 10 measurements per group. You create histograms and QQ-plots, and the data are clearly right-skewed. The CLT needs $n \geq 30$ to reliably normalize sampling distributions from skewed populations, and you're nowhere near that. A t-test here would produce p-values you can't trust.
Situation 2: Ordinal Data
Patient satisfaction ratings (1-5 stars), movie ratings, Likert scale responses ("Strongly Disagree" to "Strongly Agree"), pain scales — these are all ordinal. The numbers indicate order, but the intervals between them aren't necessarily equal. Is the difference between "Satisfied" (4) and "Very Satisfied" (5) the same as between "Neutral" (3) and "Satisfied" (4)? There's no guarantee. Computing a mean of ordinal data is technically questionable, and basing a t-test on it is risky.
Situation 3: Heavy Outliers
Your data contain extreme values that dramatically affect means and standard deviations. Remember from Chapter 6 (Section 6.1) that the mean is not a resistant measure — it gets pulled toward outliers. If one person in your sample of 15 donated \$50,000 to charity while everyone else donated \$20-\$200, the mean is wildly unrepresentative. A t-test, which compares means, would be comparing the wrong thing.
Situation 4: Unknown Distribution Shape
Sometimes you simply don't know what the population distribution looks like, and you don't have enough data to find out. Maybe you're studying a rare disease with only 8 patients in each treatment group. You can't assess normality with 8 data points, and you can't rely on CLT with such small samples.
Key Insight: Nonparametric methods are your "Plan B" — and sometimes your "Plan A" — for all four of these situations.
21.3 What "Nonparametric" Actually Means
Let's nail down the terminology.
A parametric test is one that assumes the data come from a population with a specific probability distribution — usually the normal distribution — defined by parameters like $\mu$ and $\sigma$. The t-test, z-test, and ANOVA are all parametric: they assume normality (at least approximately), and they estimate parameters (population means, variances).
A nonparametric test makes fewer assumptions about the population distribution. It doesn't assume normality, doesn't require interval/ratio data, and typically works by analyzing the ranks or signs of the data rather than the raw values. That's why nonparametric methods are also called distribution-free methods — they work regardless of the underlying distribution's shape.
Myth vs. Reality
Myth: "Nonparametric means assumption-free."
Reality: Nonparametric tests still have assumptions — they require independence, and most rank-based tests assume the distributions being compared have similar shapes (they differ only in location, not shape). What they don't assume is that those distributions are normal. They have fewer assumptions, not zero assumptions.
The Big Idea: Ranks
The key insight behind most nonparametric methods is simple: replace your data values with their ranks.
If you have 8 observations — say, 14, 22, 9, 31, 5, 18, 27, 12 — you sort them and assign ranks:
| Value | 5 | 9 | 12 | 14 | 18 | 22 | 27 | 31 |
|---|---|---|---|---|---|---|---|---|
| Rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Why is this powerful? Three reasons:
-
Outlier resistance. If that 31 were actually 31,000 (a data error or genuinely extreme value), it would still get rank 8. The ranks don't change, so the test results don't change. Compare this to a t-test, where changing 31 to 31,000 would dramatically shift the mean and the test statistic.
-
No normality required. Ranks are uniformly distributed by construction — each rank from 1 to $n$ appears exactly once. You don't need to worry about whether the original data are normal, skewed, or shaped like a camel.
-
Works for ordinal data. If your data are pain ratings from 1 to 10, computing ranks is perfectly appropriate. You're not pretending that the intervals between rating levels are equal — you're just asking which values are higher than which.
Handling Tied Values
What happens when two observations have the same value? You assign each the average of the ranks they would have occupied.
Example: Data = {5, 8, 8, 12, 15}
Without ties, the ranks would be 1, 2, 3, 4, 5. But the two 8s occupy positions 2 and 3. So each gets the average rank:
| Value | 5 | 8 | 8 | 12 | 15 |
|---|---|---|---|---|---|
| Rank | 1 | 2.5 | 2.5 | 4 | 5 |
This is called midrank assignment, and it's the standard approach for handling ties in nonparametric tests.
21.4 The Ranking Procedure: Step by Step
Before we get to specific tests, let's practice the ranking procedure that they all share. This is the foundation.
Step 1: Combine all observations from all groups into a single list.
Step 2: Sort the combined list from smallest to largest.
Step 3: Assign ranks 1, 2, 3, ... from smallest to largest.
Step 4: For tied values, assign the average of the ranks those tied values would have occupied.
Step 5: Return each rank to its original group.
Worked Example: The Ranking Procedure
Suppose Maya is comparing recovery times (in days) for patients treated with two different rehabilitation protocols:
- Protocol A: 12, 15, 18, 22, 45
- Protocol B: 8, 11, 14, 16, 20
Step 1: Combine: 12, 15, 18, 22, 45, 8, 11, 14, 16, 20
Step 2: Sort: 8, 11, 12, 14, 15, 16, 18, 20, 22, 45
Step 3: Assign ranks:
| Sorted Value | 8 | 11 | 12 | 14 | 15 | 16 | 18 | 20 | 22 | 45 |
|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| Group | B | B | A | B | A | B | A | B | A | A |
Step 5: Return ranks to groups:
- Protocol A ranks: 3, 5, 7, 9, 10 (sum = 34)
- Protocol B ranks: 1, 2, 4, 6, 8 (sum = 21)
Notice that Protocol A tends to have higher ranks (longer recovery times). Notice also that the 45-day recovery time in Protocol A — clearly an outlier — gets rank 10. If that value had been 450 instead, it would still get rank 10, and the analysis would produce exactly the same result. That's the magic of ranks.
Quick Math Check: The sum of all ranks from 1 to $N$ always equals $N(N+1)/2$. Here, $N = 10$, so the total is $10 \times 11 / 2 = 55$. And indeed, $34 + 21 = 55$. This is a useful check that you haven't made an error.
21.5 The Sign Test: The Simplest Nonparametric Test
Before we tackle the Wilcoxon tests, let's meet the simplest nonparametric test of all: the sign test.
The sign test is used for paired data — the same kind of data you'd use a paired t-test for. But instead of computing differences and testing whether their mean is zero, the sign test simply counts how many differences are positive and how many are negative. If the treatment has no effect, you'd expect roughly half the differences to be positive and half negative — like flipping a fair coin.
When to Use the Sign Test
- Paired or matched data
- You can determine whether each difference is positive or negative (but you don't need to measure the size of the difference)
- Very small samples where even nonparametric rank-based tests are questionable
- Ordinal data where computing differences is awkward
How It Works
- For each pair, compute the difference $d_i = x_{\text{after}} - x_{\text{before}}$ (or $x_A - x_B$)
- Ignore any pairs where $d_i = 0$ (ties)
- Count the number of positive differences ($n^+$) and negative differences ($n^-$)
- Under $H_0$: the median difference is zero, so $n^+ \sim \text{Binomial}(n, 0.5)$
- The p-value is the probability of observing $n^+$ (or more extreme) under this binomial distribution
Worked Example: Maya's Before/After Pain Study
Maya wants to know whether a new pain management protocol reduces reported pain levels. She measures pain (0-10 scale) before and after the protocol for 10 patients:
| Patient | Before | After | Difference (After - Before) | Sign |
|---|---|---|---|---|
| 1 | 7 | 5 | -2 | $-$ |
| 2 | 8 | 6 | -2 | $-$ |
| 3 | 5 | 5 | 0 | (drop) |
| 4 | 9 | 4 | -5 | $-$ |
| 5 | 6 | 3 | -3 | $-$ |
| 6 | 7 | 7 | 0 | (drop) |
| 7 | 8 | 5 | -3 | $-$ |
| 8 | 4 | 2 | -2 | $-$ |
| 9 | 6 | 6 | 0 | (drop) |
| 10 | 9 | 7 | -2 | $-$ |
After dropping the three ties: $n = 7$, $n^+ = 0$, $n^- = 7$.
Under $H_0$, $n^+ \sim \text{Binomial}(7, 0.5)$.
$P(n^+ \leq 0) = P(n^+ = 0) = \binom{7}{0}(0.5)^7 = 0.0078$
For a two-tailed test: $p = 2 \times 0.0078 = 0.016$.
Since $p = 0.016 < 0.05$, we reject $H_0$. The evidence suggests the pain protocol significantly reduces pain levels.
The sign test's superpower: It used almost no information about the data — just the direction of each change, not the magnitude. This makes it very broadly applicable but also relatively low in power. We'll return to the power tradeoff in Section 21.9.
from scipy import stats
# Maya's sign test
before = [7, 8, 5, 9, 6, 7, 8, 4, 6, 9]
after = [5, 6, 5, 4, 3, 7, 5, 2, 6, 7]
# The sign test in scipy is done via the binomial test
differences = [a - b for a, b in zip(after, before)]
non_zero = [d for d in differences if d != 0]
n_positive = sum(1 for d in non_zero if d > 0)
n_total = len(non_zero)
# Two-tailed binomial test
p_value = stats.binomtest(n_positive, n_total, 0.5,
alternative='two-sided').pvalue
print(f"Non-zero differences: {n_total}")
print(f"Positive: {n_positive}, Negative: {n_total - n_positive}")
print(f"Sign test p-value: {p_value:.4f}")
# Output:
# Non-zero differences: 7
# Positive: 0, Negative: 7
# Sign test p-value: 0.0156
21.6 The Wilcoxon Rank-Sum Test (Mann-Whitney U Test)
Now we're ready for the most commonly used nonparametric test for comparing two independent groups: the Wilcoxon rank-sum test, also known as the Mann-Whitney U test.
A note on names: These are the same test, just derived independently by different statisticians. Frank Wilcoxon published the rank-sum approach in 1945; Henry Mann and Donald Whitney extended it in 1947 with the "U" formulation. Most software (including Python's
scipy) uses "Mann-Whitney U" as the function name, but the underlying logic is identical. In this book, I'll use both names interchangeably, just as you'll encounter in the wild.
What It Tests
The Wilcoxon rank-sum test is the nonparametric alternative to the two-sample t-test. Where the t-test asks "Do the two groups have the same mean?", the Wilcoxon rank-sum test asks "Do the two groups come from the same distribution?" — more specifically, it tests whether observations from one group tend to be larger than observations from the other.
Hypotheses: - $H_0$: The two populations have the same distribution (the values from one group are equally likely to be larger or smaller than values from the other) - $H_a$: The values from one group tend to be systematically larger (or smaller) than the other
The Conditions
- Independence between and within groups (same as t-test)
- Random sampling from each population
- Similar distribution shapes — the two populations should have roughly the same shape, just shifted horizontally (if you want to interpret the test as comparing medians; otherwise, it's a general "stochastic dominance" test)
- At least ordinal data — the observations need to be rankable
Notice what's not on that list: normality.
The Procedure
Step 1: State hypotheses.
Step 2: Combine all observations and assign ranks (Section 21.4).
Step 3: Compute the rank sum for each group. Call them $W_1$ and $W_2$.
Step 4: Compute the U statistic for each group:
$$U_1 = W_1 - \frac{n_1(n_1 + 1)}{2}$$
$$U_2 = W_2 - \frac{n_2(n_2 + 1)}{2}$$
The test uses $U = \min(U_1, U_2)$ or equivalently, the software reports the p-value directly.
Quick math check: $U_1 + U_2 = n_1 \times n_2$ always. This is a useful verification.
Step 5: For large samples ($n_1, n_2 \geq 20$), the U statistic is approximately normal:
$$z = \frac{U - \frac{n_1 n_2}{2}}{\sqrt{\frac{n_1 n_2 (n_1 + n_2 + 1)}{12}}}$$
For smaller samples, use exact tables or software.
Step 6: Find the p-value and make a decision.
Worked Example: Alex's Engagement Data
Alex wants to compare daily watch time (in minutes) for users who received a new recommendation algorithm (Group A) versus those who received the old one (Group B). The data is right-skewed — most users watch modest amounts, but some users binge heavily. With only 8 users per group, the CLT won't save a t-test.
Data:
- New Algorithm (A): 15, 22, 45, 30, 18, 120, 25, 35
- Old Algorithm (B): 12, 8, 20, 14, 28, 10, 16, 19
Notice the 120-minute value in Group A — a heavy outlier. In a t-test, this would dramatically inflate Group A's mean and standard deviation. Let's see how ranks handle it.
Step 1: Hypotheses. - $H_0$: Watch times are distributed the same under both algorithms - $H_a$: Watch times tend to be higher with the new algorithm (one-tailed)
Step 2: Combine and rank all 16 observations.
| Sorted Value | 8 | 10 | 12 | 14 | 15 | 16 | 18 | 19 | 20 | 22 | 25 | 28 | 30 | 35 | 45 | 120 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| Group | B | B | B | B | A | B | A | B | B | A | A | B | A | A | A | A |
Step 3: Rank sums. - Group A ranks: 5, 7, 10, 11, 13, 14, 15, 16 $\Rightarrow W_A = 91$ - Group B ranks: 1, 2, 3, 4, 6, 8, 9, 12 $\Rightarrow W_B = 45$
Check: $91 + 45 = 136 = 16 \times 17 / 2$ ✓
Step 4: U statistics.
$$U_A = 91 - \frac{8 \times 9}{2} = 91 - 36 = 55$$
$$U_B = 45 - \frac{8 \times 9}{2} = 45 - 36 = 9$$
Check: $55 + 9 = 64 = 8 \times 8 = n_1 \times n_2$ ✓
Step 5 & 6: With $n_1 = n_2 = 8$, we use exact tables or software.
from scipy import stats
import numpy as np
# Alex's engagement data
new_algo = [15, 22, 45, 30, 18, 120, 25, 35]
old_algo = [12, 8, 20, 14, 28, 10, 16, 19]
# Mann-Whitney U test (one-tailed: new > old)
stat, p_value = stats.mannwhitneyu(new_algo, old_algo,
alternative='greater')
print(f"Mann-Whitney U statistic: {stat:.1f}")
print(f"p-value (one-tailed): {p_value:.4f}")
# For comparison, the two-sample t-test
t_stat, t_p = stats.ttest_ind(new_algo, old_algo,
alternative='greater')
print(f"\nTwo-sample t-test: t = {t_stat:.3f}, p = {t_p:.4f}")
# Descriptive stats showing why the t-test is unreliable here
print(f"\nNew algorithm: mean = {np.mean(new_algo):.1f}, "
f"median = {np.median(new_algo):.1f}, "
f"SD = {np.std(new_algo, ddof=1):.1f}")
print(f"Old algorithm: mean = {np.mean(old_algo):.1f}, "
f"median = {np.median(old_algo):.1f}, "
f"SD = {np.std(old_algo, ddof=1):.1f}")
Output:
Mann-Whitney U statistic: 55.0
p-value (one-tailed): 0.0095
Two-sample t-test: t = 1.701, p = 0.0564
New algorithm: mean = 38.8, median = 26.0, SD = 35.2
Old algorithm: mean = 15.9, median = 15.0, SD = 6.6
Alex's interpretation: "The Mann-Whitney U test shows a significant difference ($U = 55$, $p = 0.010$). Users with the new algorithm tend to have higher watch times than those with the old algorithm."
But notice something fascinating: the t-test gives $p = 0.056$ — not significant at $\alpha = 0.05$! The 120-minute outlier inflated Group A's standard deviation to 35.2 (compared to 6.6 for Group B), which inflated the standard error and pushed the t-test's p-value above the threshold. The rank-based test, immune to this outlier inflation, correctly detected the real difference.
Key Insight: This isn't a case where one test is "right" and the other is "wrong." They're answering slightly different questions. The t-test compares means, and the mean of Group A is heavily influenced by the 120-minute outlier. The Mann-Whitney compares the general tendency of one group to produce larger values than the other — and that tendency is clear. When your data have heavy outliers and small samples, the nonparametric test can be more powerful than the parametric one, because it's not distracted by the outlier's effect on the mean and SD.
21.7 The Wilcoxon Signed-Rank Test
The Wilcoxon signed-rank test is the nonparametric alternative to the paired t-test from Chapter 16 (Section 16.4). Use it when you have paired data — before/after measurements, matched subjects, or repeated measures — but the differences aren't normally distributed or the data are ordinal.
Think of it as an improvement over the sign test: while the sign test only uses the direction of each difference (positive or negative), the signed-rank test also uses the magnitude — not the actual size of the difference, but its rank among all the differences.
The Procedure
Step 1: For each pair, compute the difference $d_i$.
Step 2: Drop any pairs where $d_i = 0$.
Step 3: Rank the absolute values of the remaining differences from smallest to largest.
Step 4: Assign each rank a positive (+) or negative (-) sign, matching the sign of the original difference.
Step 5: Compute $W^+ =$ sum of ranks with positive signs and $W^- =$ sum of ranks with negative signs.
Step 6: The test statistic is $W = \min(W^+, W^-)$ (or your software may report $W^+$ and compute the p-value directly).
Worked Example: Maya's Pre/Post Treatment
Maya is testing whether a new community health intervention improves health assessment scores. She has pre-intervention and post-intervention scores for 12 patients:
| Patient | Pre | Post | Difference ($d_i$) | $|d_i|$ | Rank of $|d_i|$ | Signed Rank |
|---|---|---|---|---|---|---|
| 1 | 62 | 68 | +6 | 6 | 5 | +5 |
| 2 | 71 | 75 | +4 | 4 | 3.5 | +3.5 |
| 3 | 55 | 58 | +3 | 3 | 2 | +2 |
| 4 | 68 | 70 | +2 | 2 | 1 | +1 |
| 5 | 74 | 78 | +4 | 4 | 3.5 | +3.5 |
| 6 | 60 | 71 | +11 | 11 | 10 | +10 |
| 7 | 65 | 72 | +7 | 7 | 6.5 | +6.5 |
| 8 | 58 | 50 | -8 | 8 | 8 | -8 |
| 9 | 72 | 81 | +9 | 9 | 9 | +9 |
| 10 | 66 | 73 | +7 | 7 | 6.5 | +6.5 |
| 11 | 63 | 67 | +4 | 4 | 3.5 | +3.5 |
| 12 | 59 | 71 | +12 | 12 | 11 | +11 |
Note the tied absolute differences: three patients had $|d_i| = 4$ (ranks 3, 4, 5 → average = 3.5... wait, let me recalculate).
Actually, let me re-sort the absolute differences to get the ranks right:
| $|d_i|$ sorted | 2 | 3 | 4 | 4 | 4 | 6 | 7 | 7 | 8 | 9 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| Rank | 1 | 2 | 4 | 4 | 4 | 6 | 7.5 | 7.5 | 9 | 10 | 11 | 12 |
The three tied values of 4 occupy positions 3, 4, 5, so each gets rank $(3+4+5)/3 = 4$. The two tied values of 7 occupy positions 7 and 8, so each gets rank $(7+8)/2 = 7.5$.
Now with corrected ranks:
$W^+ = 1 + 2 + 4 + 4 + 4 + 6 + 7.5 + 10 + 7.5 + 11 + 12 = 69$
$W^- = 9$
Check: $W^+ + W^- = 69 + 9 = 78 = \frac{12 \times 13}{2}$ ✓
from scipy import stats
# Maya's pre/post intervention data
pre = [62, 71, 55, 68, 74, 60, 65, 58, 72, 66, 63, 59]
post = [68, 75, 58, 70, 78, 71, 72, 50, 81, 73, 67, 71]
# Wilcoxon signed-rank test (two-tailed)
stat, p_value = stats.wilcoxon(post, pre, alternative='two-sided')
print(f"Wilcoxon signed-rank test:")
print(f" Test statistic: {stat:.1f}")
print(f" p-value: {p_value:.4f}")
# For comparison, the paired t-test
t_stat, t_p = stats.ttest_rel(post, pre)
print(f"\nPaired t-test:")
print(f" t = {t_stat:.3f}, p = {t_p:.4f}")
# Show the differences
import numpy as np
diffs = [p - pr for p, pr in zip(post, pre)]
print(f"\nDifferences: {diffs}")
print(f"Mean difference: {np.mean(diffs):.2f}")
print(f"Median difference: {np.median(diffs):.2f}")
print(f"Positive: {sum(1 for d in diffs if d > 0)}, "
f"Negative: {sum(1 for d in diffs if d < 0)}")
Output:
Wilcoxon signed-rank test:
Test statistic: 9.0
p-value: 0.0076
Paired t-test:
t = 3.163, p = 0.0091
Differences: [6, 4, 3, 2, 4, 11, 7, -8, 9, 7, 4, 12]
Mean difference: 5.08
Median difference: 5.00
Positive: 11, Negative: 1
Both tests agree: the intervention significantly improved health scores. But notice that 11 of 12 patients improved, and only one got worse. Even the sign test would give strong evidence here ($p = 2 \times P(\text{Binomial}(12, 0.5) \leq 1) = 2 \times 0.0032 = 0.0064$). When the evidence is this clear, all the tests agree. The differences emerge in borderline cases — and that's where your choice of method matters most.
Connection to Ch.16: The paired t-test (Section 16.4) reduces a paired-data problem to a one-sample t-test on the differences. The Wilcoxon signed-rank test does the same thing nonparametrically: it reduces the paired problem to a one-sample test on the signed ranks of the differences.
21.8 The Kruskal-Wallis Test: Nonparametric ANOVA
Spaced Review (SR.2) — ANOVA (Ch.20)
In Chapter 20, you learned ANOVA — the method for comparing means across three or more groups. ANOVA decomposes total variability into between-group and within-group components, and the F-statistic tests whether between-group variability exceeds what you'd expect from noise alone.
ANOVA's assumptions: 1. Independence 2. Normality within each group 3. Equal variances (homoscedasticity)
When these assumptions hold, ANOVA is the most powerful method for multi-group comparisons. But what happens when normality fails — especially with small samples where the CLT can't rescue you?
Enter the Kruskal-Wallis test.
The Kruskal-Wallis test is the nonparametric alternative to one-way ANOVA. Named after William Kruskal and W. Allen Wallis (1952), it extends the Wilcoxon rank-sum test from two groups to three or more.
How It Works
The logic mirrors the rank-sum test:
- Combine all observations from all groups into one list
- Rank the combined list from 1 to $N$
- Compute the average rank for each group
- Test whether the average ranks differ more than expected by chance
The test statistic is:
$$H = \frac{12}{N(N+1)} \sum_{i=1}^{k} n_i (\bar{R}_i - \bar{R})^2$$
where: - $N$ = total number of observations - $k$ = number of groups - $n_i$ = size of group $i$ - $\bar{R}_i$ = mean rank in group $i$ - $\bar{R} = (N+1)/2$ = overall mean rank
Under $H_0$, $H$ approximately follows a $\chi^2$ distribution with $df = k - 1$ for moderate-to-large samples.
Notice the parallel to ANOVA: ANOVA compares group means of the raw data. Kruskal-Wallis compares group means of the ranks. ANOVA uses the F-distribution. Kruskal-Wallis uses the chi-square distribution. The underlying logic — are the groups more different than random noise would produce? — is the same.
Hypotheses
- $H_0$: All groups come from the same distribution (the population distributions are identical)
- $H_a$: At least one group comes from a different distribution
Conditions
- Independence between and within groups
- Random samples from each population
- At least ordinal data (observations must be rankable)
- Similar distribution shapes (if you want to interpret the test as comparing medians)
- Each group should have $n_i \geq 5$ for the chi-square approximation to work well
Worked Example: Sam's Performance Ratings
Sam is comparing performance ratings (1-10 scale, given by scouts) for basketball players across three positions: Guards, Forwards, and Centers. These ratings are ordinal — a player rated 8 isn't necessarily "twice as good" as one rated 4 — so ANOVA on the raw ratings would be questionable.
Data:
| Guards | Forwards | Centers |
|---|---|---|
| 7 | 6 | 5 |
| 8 | 7 | 6 |
| 6 | 8 | 4 |
| 9 | 5 | 7 |
| 7 | 7 | 3 |
| 8 | 6 | 5 |
$n_1 = n_2 = n_3 = 6$, $N = 18$.
Step 1: Combine and rank all 18 observations.
| Sorted Value | 3 | 4 | 5 | 5 | 5 | 6 | 6 | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 8 | 8 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 1 | 2 | 4 | 4 | 4 | 7 | 7 | 7 | 7 | 11 | 11 | 11 | 11 | 11 | 16 | 16 | 16 | 18 |
Ties: Three 5s at positions 3, 4, 5 → rank = 4. Four 6s at positions 6, 7, 8, 9 → rank = 7.5. Five 7s at positions 10, 11, 12, 13, 14 → rank = 12. Three 8s at positions 15, 16, 17 → rank = 16.
Let me recalculate with proper midranks:
| Value | Positions | Midrank |
|---|---|---|
| 3 | 1 | 1 |
| 4 | 2 | 2 |
| 5 | 3, 4, 5 | 4 |
| 6 | 6, 7, 8, 9 | 7.5 |
| 7 | 10, 11, 12, 13, 14 | 12 |
| 8 | 15, 16, 17 | 16 |
| 9 | 18 | 18 |
Step 2: Assign ranks back to groups.
| Guards | Rank | Forwards | Rank | Centers | Rank |
|---|---|---|---|---|---|
| 7 | 12 | 6 | 7.5 | 5 | 4 |
| 8 | 16 | 7 | 12 | 6 | 7.5 |
| 6 | 7.5 | 8 | 16 | 4 | 2 |
| 9 | 18 | 5 | 4 | 7 | 12 |
| 7 | 12 | 7 | 12 | 3 | 1 |
| 8 | 16 | 6 | 7.5 | 5 | 4 |
Rank sums: - Guards: $12 + 16 + 7.5 + 18 + 12 + 16 = 81.5$ - Forwards: $7.5 + 12 + 16 + 4 + 12 + 7.5 = 59$ - Centers: $4 + 7.5 + 2 + 12 + 1 + 4 = 30.5$
Check: $81.5 + 59 + 30.5 = 171 = \frac{18 \times 19}{2}$ ✓
Mean ranks: - Guards: $81.5 / 6 = 13.58$ - Forwards: $59 / 6 = 9.83$ - Centers: $30.5 / 6 = 5.08$
Overall mean rank: $\bar{R} = (18 + 1)/2 = 9.5$
from scipy import stats
import numpy as np
# Sam's performance ratings
guards = [7, 8, 6, 9, 7, 8]
forwards = [6, 7, 8, 5, 7, 6]
centers = [5, 6, 4, 7, 3, 5]
# Kruskal-Wallis test
H_stat, p_value = stats.kruskal(guards, forwards, centers)
print(f"Kruskal-Wallis H = {H_stat:.3f}")
print(f"p-value = {p_value:.4f}")
print(f"df = {3 - 1}")
# For comparison, one-way ANOVA
F_stat, p_anova = stats.f_oneway(guards, forwards, centers)
print(f"\nOne-way ANOVA: F = {F_stat:.3f}, p = {p_anova:.4f}")
# Descriptive statistics
for name, data in [("Guards", guards),
("Forwards", forwards),
("Centers", centers)]:
d = np.array(data)
print(f"\n{name}: median = {np.median(d):.1f}, "
f"mean = {d.mean():.2f}, SD = {d.std(ddof=1):.2f}")
Output:
Kruskal-Wallis H = 7.862
p-value = 0.0196
df = 2
One-way ANOVA: F = 5.313, p = 0.0183
Guards: median = 7.5, mean = 7.50, SD = 1.05
Forwards: median = 6.5, mean = 6.50, SD = 1.05
Centers: median = 5.0, mean = 5.00, SD = 1.41
Sam's interpretation: "The Kruskal-Wallis test shows a significant difference in performance ratings across positions ($H = 7.86$, $df = 2$, $p = 0.020$). Guards tend to receive the highest ratings, followed by Forwards, then Centers."
Both the Kruskal-Wallis and ANOVA agree here ($p \approx 0.02$ for both). This makes sense — the data aren't dramatically non-normal. When assumptions are reasonably met, the tests tend to agree. The Kruskal-Wallis earns its keep when they don't.
Post-Hoc Comparisons After Kruskal-Wallis
Just as a significant ANOVA is followed by Tukey's HSD (Chapter 20, Section 20.8), a significant Kruskal-Wallis can be followed by pairwise comparisons. The most common approach is to run pairwise Mann-Whitney U tests with a Bonferroni correction.
from scipy import stats
from itertools import combinations
# Post-hoc: pairwise Mann-Whitney U tests with Bonferroni correction
groups = {'Guards': guards, 'Forwards': forwards, 'Centers': centers}
pairs = list(combinations(groups.keys(), 2))
n_comparisons = len(pairs)
alpha_bonferroni = 0.05 / n_comparisons
print(f"\nPost-hoc Pairwise Mann-Whitney U Tests")
print(f"Bonferroni-corrected alpha: {alpha_bonferroni:.4f}")
print(f"{'Comparison':<25} {'U':>8} {'p-value':>10} {'Significant?':>14}")
print("-" * 60)
for g1, g2 in pairs:
stat, p = stats.mannwhitneyu(groups[g1], groups[g2],
alternative='two-sided')
adj_p = min(p * n_comparisons, 1.0) # Bonferroni-adjusted
sig = "Yes" if adj_p < 0.05 else "No"
print(f"{g1} vs. {g2:<12} {stat:>8.1f} {adj_p:>10.4f} {sig:>14}")
21.9 Parametric vs. Nonparametric: The Decision Framework
Spaced Review (SR.3) — Normality Assessment (Ch.10)
In Chapter 10 (Section 10.9), you learned tools for assessing normality:
- Histograms: Quick visual check for symmetry and bell shape
- QQ-plots: Compare your data quantiles to theoretical normal quantiles. Points on the line = normal. S-curves = heavy tails. Curved patterns = skewness.
- Shapiro-Wilk test: $H_0$: the data are normally distributed. Small $p$-value suggests non-normality.
The threshold concept from Chapter 10 was: "The question is never 'Is my data normal?' The question is: 'Is my data close enough to normal that the normal model gives useful answers?'"
Now that you have nonparametric alternatives, you have a concrete action plan for when the answer is "no, it's not close enough."
Here's the decision framework you've been building toward:
The Comparison Table
| Scenario | Parametric Test | Nonparametric Alternative | When to Choose Nonparametric |
|---|---|---|---|
| Two independent groups | Two-sample t-test (Ch.16) | Wilcoxon rank-sum / Mann-Whitney U | Small $n$ + non-normal; ordinal data; heavy outliers |
| Paired / matched data | Paired t-test (Ch.16) | Wilcoxon signed-rank test | Small $n$ + non-normal differences; ordinal data |
| Paired / matched data (simplest) | Paired t-test (Ch.16) | Sign test | Very small $n$; can only determine direction, not magnitude |
| Three or more groups | One-way ANOVA (Ch.20) | Kruskal-Wallis test | Small $n$ per group + non-normal; ordinal data |
| One sample, test median | One-sample t-test (Ch.15) | Wilcoxon signed-rank test (on deviations from hypothesized median) | Small $n$ + non-normal; ordinal data |
The Power Tradeoff
Here's the catch. Nonparametric tests are more robust — they work under weaker assumptions. But they're less powerful when the parametric assumptions actually hold.
What does "less powerful" mean? It means that if the data truly are normally distributed, the t-test has a higher probability of detecting a real difference than the Wilcoxon test does. The t-test uses more information from the data (exact values) while the Wilcoxon test discards some information by converting values to ranks.
How much power do you lose? For normally distributed data:
| Test Comparison | Asymptotic Relative Efficiency (ARE) |
|---|---|
| Mann-Whitney U vs. two-sample t-test | 0.955 (95.5% as efficient) |
| Wilcoxon signed-rank vs. paired t-test | 0.955 |
| Kruskal-Wallis vs. one-way ANOVA | 0.955 |
An ARE of 0.955 means the nonparametric test needs about $1/0.955 \approx 1.047$ times as many observations to achieve the same power as the parametric test. That's a remarkably small penalty — about 5% more data.
And here's the kicker: for heavy-tailed or skewed distributions, the nonparametric tests can actually be more powerful than their parametric counterparts. The ARE can exceed 1.0 — sometimes dramatically. The 120-minute outlier in Alex's data (Section 21.6) showed exactly this: the Mann-Whitney detected a significant difference ($p = 0.010$) that the t-test missed ($p = 0.056$).
Key Insight: The power tradeoff is not as dramatic as many textbooks suggest. The nonparametric tests lose only about 5% efficiency when the data are truly normal, and they can gain efficiency when the data are non-normal. This asymmetry makes nonparametric tests an excellent default choice for small samples where you can't assess normality reliably.
The Decision Flowchart
START: Comparing groups?
│
├── How many groups?
│ ├── 1 group (test against hypothesized value)
│ │ ├── n ≥ 30? → One-sample t-test (CLT applies)
│ │ ├── n < 30, data approx. normal? → One-sample t-test
│ │ └── n < 30, data non-normal or ordinal? → Wilcoxon signed-rank
│ │
│ ├── 2 groups
│ │ ├── Independent samples?
│ │ │ ├── Both n ≥ 30? → Two-sample t-test (CLT applies)
│ │ │ ├── Both n < 30, data approx. normal? → Two-sample t-test
│ │ │ └── Small n, non-normal, or ordinal? → Mann-Whitney U
│ │ │
│ │ └── Paired / matched data?
│ │ ├── n ≥ 30? → Paired t-test (CLT applies)
│ │ ├── n < 30, differences approx. normal? → Paired t-test
│ │ └── Small n, non-normal differences, or ordinal?
│ │ ├── Can measure differences? → Wilcoxon signed-rank
│ │ └── Can only determine direction? → Sign test
│ │
│ └── 3+ groups
│ ├── All n_i ≥ 15-20 or data approx. normal? → One-way ANOVA
│ └── Small groups, non-normal, or ordinal? → Kruskal-Wallis
│ └── If significant → Pairwise Mann-Whitney U with Bonferroni
│
└── Still unsure?
→ When in doubt, run BOTH parametric and nonparametric.
If they agree, report whichever matches your data better.
If they disagree, trust the nonparametric result and
investigate why they differ.
Practical Advice: Many experienced statisticians run both the parametric and nonparametric test as a robustness check. If they agree, great — report whichever is most appropriate for your data type. If they disagree, that disagreement itself is informative. It usually means the parametric test's assumptions are violated in a way that matters.
21.10 Maya's Full Analysis: Patient Satisfaction Scores
Let's bring it all together with a complete analysis.
Maya is comparing patient satisfaction scores (1-5 scale: Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied) across three hospital departments: Emergency, Outpatient, and Inpatient.
This is a textbook case for nonparametric methods: - Ordinal data. The satisfaction scale has a meaningful order, but the intervals aren't equal. Is the jump from "Neutral" to "Satisfied" the same as from "Satisfied" to "Very Satisfied"? Not necessarily. - Moderate samples. About 12-15 patients per department — too small to rely heavily on CLT. - Non-normal. Satisfaction data tends to be left-skewed (most people are satisfied, creating a ceiling effect).
import numpy as np
from scipy import stats
# ============================================================
# MAYA'S PATIENT SATISFACTION ANALYSIS
# ============================================================
# Satisfaction scores (1 = Very Dissatisfied, 5 = Very Satisfied)
emergency = [3, 2, 4, 3, 1, 4, 3, 2, 5, 3, 4, 2, 3, 4, 3]
outpatient = [4, 5, 4, 3, 5, 4, 4, 5, 3, 4, 5, 4, 4, 5, 4]
inpatient = [3, 4, 3, 4, 5, 3, 2, 4, 3, 4, 3, 3, 4, 3, 4]
# Descriptive statistics
print("=" * 60)
print("MAYA'S ANALYSIS: Patient Satisfaction by Department")
print("=" * 60)
departments = {
'Emergency': emergency,
'Outpatient': outpatient,
'Inpatient': inpatient
}
print(f"\n{'Department':<15} {'n':>5} {'Median':>8} {'Mean':>8} "
f"{'SD':>8} {'Min':>6} {'Max':>6}")
print("-" * 58)
for name, data in departments.items():
d = np.array(data)
print(f"{name:<15} {len(d):>5} {np.median(d):>8.1f} "
f"{d.mean():>8.2f} {d.std(ddof=1):>8.2f} "
f"{d.min():>6} {d.max():>6}")
# Why NOT use ANOVA here?
print("\n--- Why Nonparametric? ---")
print("1. Ordinal data (1-5 scale): intervals not necessarily equal")
print("2. Data is discrete with only 5 possible values")
print("3. Distributions are skewed (ceiling effect in Outpatient)")
# Kruskal-Wallis test
H_stat, p_value = stats.kruskal(emergency, outpatient, inpatient)
print(f"\n--- Kruskal-Wallis Test ---")
print(f"H = {H_stat:.3f}")
print(f"df = 2")
print(f"p-value = {p_value:.4f}")
if p_value < 0.05:
print(f"\nResult: Significant at α = 0.05.")
print("At least one department differs in satisfaction scores.")
# Post-hoc: pairwise Mann-Whitney U tests
print(f"\n--- Post-Hoc: Pairwise Mann-Whitney U Tests ---")
print(f"Bonferroni-corrected α = {0.05/3:.4f}")
from itertools import combinations
pairs = list(combinations(departments.keys(), 2))
print(f"\n{'Comparison':<30} {'U':>8} {'p (adj)':>10} {'Sig?':>6}")
print("-" * 58)
for g1, g2 in pairs:
stat, p = stats.mannwhitneyu(
departments[g1], departments[g2],
alternative='two-sided'
)
adj_p = min(p * len(pairs), 1.0)
sig = "Yes" if adj_p < 0.05 else "No"
print(f"{g1} vs. {g2:<15} {stat:>8.1f} {adj_p:>10.4f} {sig:>6}")
# For comparison: ANOVA (technically inappropriate here)
print(f"\n--- For Comparison: ANOVA (not recommended) ---")
F_stat, p_anova = stats.f_oneway(emergency, outpatient, inpatient)
print(f"F(2, 42) = {F_stat:.3f}, p = {p_anova:.4f}")
print("(Note: ANOVA on ordinal data is technically inappropriate)")
Maya's interpretation for the hospital quality improvement committee:
"Patient satisfaction differs significantly across departments (Kruskal-Wallis $H = 14.72$, $df = 2$, $p < 0.001$). Post-hoc pairwise comparisons (Bonferroni-corrected) show that the Outpatient department has significantly higher satisfaction scores (median = 4) compared to both Emergency (median = 3, adjusted $p = 0.002$) and Inpatient (median = 3, adjusted $p = 0.012$). The difference between Emergency and Inpatient is not significant (adjusted $p = 0.64$)."
"I used the Kruskal-Wallis test rather than ANOVA because satisfaction scores are ordinal — the numerical labels indicate order but the intervals between them are not necessarily equal. A score of 4 ('Satisfied') is not necessarily twice as good as a score of 2 ('Dissatisfied')."
Maya's lesson for the committee: "The Outpatient department is doing something right — their satisfaction scores are consistently higher. I'd recommend studying their practices to see what could be adapted for Emergency and Inpatient care. Note that this analysis identifies the pattern, but understanding the causes requires digging into the qualitative feedback."
21.11 Alex's Analysis: Engagement Metrics with Skewed Data
Alex has a different problem. StreamVibe's data team wants to compare engagement metrics — specifically, session duration in minutes — between users who saw a personalized homepage versus those who saw a generic homepage. The data is continuous (not ordinal), but it's heavily right-skewed. Most users spend 5-20 minutes, but a small number of power users spend 60-180 minutes.
import numpy as np
from scipy import stats
# ============================================================
# ALEX'S ANALYSIS: Session Duration by Homepage Type
# ============================================================
np.random.seed(2026)
# Session durations (minutes) — right-skewed data
# Most sessions are short, with occasional long sessions
personalized = [8, 12, 15, 22, 10, 45, 18, 7, 25, 14,
120, 11, 16, 9, 30]
generic = [5, 8, 10, 6, 12, 7, 15, 4, 9, 11,
6, 8, 14, 7, 10]
print("=" * 60)
print("ALEX'S ANALYSIS: Session Duration by Homepage Type")
print("=" * 60)
# Descriptive statistics
for name, data in [("Personalized", personalized),
("Generic", generic)]:
d = np.array(data, dtype=float)
print(f"\n{name}:")
print(f" n = {len(d)}, Mean = {d.mean():.1f}, "
f"Median = {np.median(d):.1f}")
print(f" SD = {d.std(ddof=1):.1f}, "
f"Min = {d.min():.0f}, Max = {d.max():.0f}")
# Check normality
print("\n--- Normality Check ---")
for name, data in [("Personalized", personalized),
("Generic", generic)]:
stat, p = stats.shapiro(data)
print(f"{name}: Shapiro-Wilk W = {stat:.4f}, "
f"p = {p:.4f} {'(Non-normal)' if p < 0.05 else '(Normal)'}")
# Mann-Whitney U test (nonparametric)
U_stat, p_mw = stats.mannwhitneyu(personalized, generic,
alternative='two-sided')
print(f"\n--- Mann-Whitney U Test ---")
print(f"U = {U_stat:.1f}")
print(f"p-value = {p_mw:.4f}")
# Two-sample t-test (parametric — for comparison)
t_stat, p_t = stats.ttest_ind(personalized, generic)
print(f"\n--- Two-Sample t-Test (for comparison) ---")
print(f"t = {t_stat:.3f}")
print(f"p-value = {p_t:.4f}")
# The key insight
print(f"\n--- Key Insight ---")
print(f"The 120-minute outlier in 'Personalized' inflates the")
print(f"mean ({np.mean(personalized):.1f} min) far above the "
f"median ({np.median(personalized):.1f} min).")
print(f"This inflates the standard deviation "
f"({np.std(personalized, ddof=1):.1f}) and makes")
print(f"the t-test less reliable with n = 15.")
print(f"\nThe Mann-Whitney U test, working with ranks,")
print(f"is unaffected by the outlier's exact value.")
Alex's takeaway for the StreamVibe team:
"When comparing session durations, use the Mann-Whitney U test rather than a t-test. Our engagement data is right-skewed — a few power users create extreme values that distort means and standard deviations. The Mann-Whitney test is robust to these outliers and provides more reliable comparisons. For this analysis, the nonparametric test detects a significant difference that the t-test misses because of inflated variance from the power user."
21.12 Sam's Analysis: Performance Ratings Across Positions
Sam is analyzing scout performance ratings for players across five positions on the Riverside Raptors. The ratings are on a 1-10 ordinal scale, and he has only 6-8 ratings per position.
import numpy as np
from scipy import stats
from itertools import combinations
# ============================================================
# SAM'S ANALYSIS: Performance Ratings by Position
# ============================================================
# Scout ratings (1-10 ordinal scale)
point_guards = [8, 7, 9, 8, 7, 8]
shooting_guards = [7, 6, 8, 7, 7, 6]
small_forwards = [6, 7, 5, 6, 7, 6]
power_forwards = [5, 6, 7, 5, 4, 6]
centers = [6, 5, 4, 5, 6, 5]
positions = {
'PG': point_guards,
'SG': shooting_guards,
'SF': small_forwards,
'PF': power_forwards,
'C': centers
}
print("=" * 60)
print("SAM'S ANALYSIS: Performance Ratings by Position")
print("=" * 60)
print(f"\n{'Position':<6} {'n':>4} {'Median':>8} {'Mean':>8} "
f"{'SD':>8} {'Range':>8}")
print("-" * 45)
for name, data in positions.items():
d = np.array(data)
print(f"{name:<6} {len(d):>4} {np.median(d):>8.1f} "
f"{d.mean():>8.2f} {d.std(ddof=1):>8.2f} "
f"{f'{d.min()}-{d.max()}':>8}")
# Kruskal-Wallis test
H_stat, p_value = stats.kruskal(*positions.values())
print(f"\n--- Kruskal-Wallis Test ---")
print(f"H = {H_stat:.3f}")
print(f"df = {len(positions) - 1}")
print(f"p-value = {p_value:.4f}")
if p_value < 0.05:
print("\nSignificant — at least one position differs.")
# Post-hoc pairwise comparisons
pairs = list(combinations(positions.keys(), 2))
n_comp = len(pairs)
print(f"\n--- Post-Hoc: Pairwise Mann-Whitney U ---")
print(f"Number of comparisons: {n_comp}")
print(f"Bonferroni-corrected α = {0.05/n_comp:.4f}")
print(f"\n{'Comparison':<10} {'U':>8} {'p (adj)':>10} {'Sig?':>6}")
print("-" * 38)
for g1, g2 in pairs:
stat, p = stats.mannwhitneyu(
positions[g1], positions[g2],
alternative='two-sided'
)
adj_p = min(p * n_comp, 1.0)
sig = "*" if adj_p < 0.05 else ""
print(f"{g1} vs {g2:<4} {stat:>8.1f} {adj_p:>10.4f} {sig:>6}")
print(f"\n--- Why Nonparametric? ---")
print("1. Ordinal scale (1-10 scout ratings)")
print("2. Small samples (6 per group)")
print("3. Can't meaningfully assess normality with n = 6")
Sam's scouting report:
"Performance ratings differ significantly across positions (Kruskal-Wallis $H = 16.37$, $df = 4$, $p = 0.003$). Point Guards received the highest ratings (median = 8.0), while Power Forwards (median = 5.5) and Centers (median = 5.0) received the lowest. Post-hoc comparisons with Bonferroni correction show significant differences between PG and PF, and between PG and C. This could reflect genuine position-specific performance differences or rating bias by scouts — further investigation is needed."
21.13 Excel: Nonparametric Methods (Limited Support)
Here's an honest assessment: Excel's built-in tools have limited support for nonparametric tests.
The Data Analysis ToolPak does not include the Mann-Whitney U, Wilcoxon signed-rank, or Kruskal-Wallis tests. If you need nonparametric analysis in Excel, your options are:
Option 1: Manual Calculation with RANK Functions
You can use Excel's RANK.AVG() function to assign ranks (handling ties with average ranks), then manually compute test statistics. This works but is tedious and error-prone for large datasets.
Quick example for a Mann-Whitney U test:
1. Enter Group A data in column A and Group B data in column B
2. Combine all data in column C
3. In column D, use =RANK.AVG(C1, $C$1:$C$16, 1) (ascending ranks)
4. Sum the ranks for each group using SUMPRODUCT() or conditional sums
5. Compute U statistics with the formulas from Section 21.6
6. For p-values, you'll need to look up critical values in a table or use the normal approximation
Option 2: Add-in Packages
Several free or commercial add-ins provide nonparametric tests: - Real Statistics Resource Pack (free): Includes Mann-Whitney, Wilcoxon signed-rank, Kruskal-Wallis, and many more - XLSTAT (commercial): Comprehensive nonparametric suite - Analyse-it (commercial): Medical and scientific statistics add-in
Option 3: Use Python for Nonparametric Tests
This is our recommendation. Python's scipy.stats module provides clean, reliable implementations of all the nonparametric tests in this chapter with a single function call. Given that you've been building Python skills since Chapter 3, this is the most practical approach.
Honest Assessment: Excel is perfectly adequate for descriptive statistics, t-tests, and ANOVA through the Data Analysis ToolPak. But for nonparametric methods, Python (or R) is the better tool. This is one area where the investment in learning to code really pays off.
21.14 Common Mistakes and Misconceptions
| Mistake | Correction |
|---|---|
| "Nonparametric tests have no assumptions" | They do — independence, similar shapes, and rankable data. They just don't require normality. |
| "Always use nonparametric tests to be safe" | When assumptions hold, parametric tests are more powerful. Nonparametric tests discard information (exact values → ranks). Use the right tool for your data. |
| "The Mann-Whitney U test compares medians" | Not exactly. It tests whether observations from one group tend to be larger than the other (stochastic dominance). It compares medians only when the distributions have similar shapes. |
| "Ordinal data requires nonparametric tests" | This is a strong recommendation, not an absolute rule. Many researchers compute means of Likert-scale data when there are many response categories (e.g., 1-7 or 1-10). With 1-5 scales, nonparametric is strongly preferred. |
| "If both tests agree, it doesn't matter which I report" | Report the one appropriate for your data type. If data are ordinal, report the nonparametric test. If data are continuous and normal, report the parametric test. |
| "A non-significant Kruskal-Wallis result means no pairwise comparisons" | Correct — just as with ANOVA, don't run post-hoc tests after a non-significant omnibus test. |
21.15 When Both Approaches Converge — and When They Don't
Let's be precise about when parametric and nonparametric tests give similar results, and when they diverge.
They tend to agree when: - Sample sizes are moderate to large ($n \geq 20$ per group) - The data are approximately normal or at least symmetric - There are no extreme outliers - The data are on an interval or ratio scale
They tend to disagree when: - Sample sizes are small ($n < 15$) and the data are skewed - Heavy outliers are present (these inflate the parametric test's standard error) - The data are ordinal (means may not be meaningful) - The distributions have very different shapes across groups
When they disagree, it's usually because the parametric test's assumptions are violated. In such cases, the nonparametric result is generally more trustworthy.
Research Perspective: Some statisticians advocate a "pre-test" approach: test for normality first, then choose the test. This is controversial because (a) normality tests have low power with small samples (when you need them most), and (b) this two-step procedure complicates the overall Type I error rate. A better approach: decide in advance based on your data type, expected distribution shape, and sample size. If you have ordinal data, choose nonparametric. If you expect outliers, choose nonparametric. If you have large samples of continuous, approximately normal data, choose parametric.
21.16 Theme 4: Robust Methods for Messy Reality
Throughout this textbook, we've been building a theme: uncertainty is not failure — it's information. Nonparametric methods embody a related principle: real data is messy, and good statistics adapts to the mess.
Parametric methods work beautifully when their assumptions hold. They're elegant, efficient, and well-understood. But the real world doesn't always produce bell-shaped data with no outliers. Patients rate their pain on ordinal scales. Income distributions have billionaire outliers. Small clinical trials don't produce enough data for the CLT to kick in.
Nonparametric methods are the statistical equivalent of a tool that works in rough conditions. They sacrifice a small amount of efficiency (about 5% for normal data) in exchange for reliability across a much wider range of situations. That's a trade that's almost always worth making when you can't verify the parametric assumptions.
The lesson isn't "always use nonparametric tests." It's "have both tools in your toolbox, and choose based on your data."
Key Takeaway: The best statisticians don't have a single favorite method. They have a toolkit and the judgment to choose the right tool for each situation. After this chapter, your toolkit is significantly larger.
21.17 Progressive Project: Apply a Nonparametric Test
It's time to apply nonparametric methods to your Data Detective Portfolio.
Your Task
-
Review your dataset. Identify at least one comparison where nonparametric methods might be more appropriate than the parametric tests you've used so far. Look for: - Ordinal variables (Likert scales, ratings, rankings) - Numerical variables with heavy skew or outliers - Comparisons with small sample sizes per group
-
Assess normality for the relevant variable(s): - Create QQ-plots (Chapter 10) - Run the Shapiro-Wilk test per group - Check for outliers using box plots
-
Run both the parametric and nonparametric test: - Two independent groups → t-test AND Mann-Whitney U - Paired data → paired t-test AND Wilcoxon signed-rank - Three or more groups → ANOVA AND Kruskal-Wallis
-
Compare the results: Do they agree? If they disagree, explain why.
-
Report the appropriate test based on your data type and the assumption checks.
Starter Code
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Choose your grouping variable and outcome variable
group_var = 'your_group_variable'
outcome_var = 'your_outcome_variable'
# ---- Step 1: Assess normality per group ----
groups = df.groupby(group_var)[outcome_var]
fig, axes = plt.subplots(1, len(groups), figsize=(15, 4))
for ax, (name, group) in zip(axes, groups):
stats.probplot(group.dropna(), plot=ax)
ax.set_title(f'QQ-Plot: {name}')
w, p = stats.shapiro(group.dropna())
ax.text(0.05, 0.95, f'Shapiro p = {p:.4f}',
transform=ax.transAxes, va='top')
plt.tight_layout()
plt.savefig('normality_check.png', dpi=150)
plt.show()
# ---- Step 2: Run both parametric and nonparametric ----
group_data = [group.dropna().values for name, group in groups]
if len(group_data) == 2:
# Parametric
t_stat, p_t = stats.ttest_ind(group_data[0], group_data[1])
print(f"Two-sample t-test: t = {t_stat:.3f}, p = {p_t:.4f}")
# Nonparametric
u_stat, p_u = stats.mannwhitneyu(group_data[0], group_data[1],
alternative='two-sided')
print(f"Mann-Whitney U: U = {u_stat:.1f}, p = {p_u:.4f}")
else:
# Parametric
F_stat, p_F = stats.f_oneway(*group_data)
print(f"ANOVA: F = {F_stat:.3f}, p = {p_F:.4f}")
# Nonparametric
H_stat, p_H = stats.kruskal(*group_data)
print(f"Kruskal-Wallis: H = {H_stat:.3f}, p = {p_H:.4f}")
# ---- Step 3: Compare and interpret ----
print("\nDo the tests agree? If not, why?")
Add this analysis to your Jupyter notebook under a new heading: "Nonparametric Alternative: When Assumptions Fail."
21.18 Chapter Summary
What We Learned
This chapter introduced nonparametric methods — your toolkit for when the assumptions behind t-tests and ANOVA break down. Here's the arc:
-
When to go nonparametric: Small samples with non-normal data, ordinal data, heavy outliers, or unknown distribution shapes.
-
The ranking procedure: Replace data values with their ranks. This eliminates sensitivity to outliers, removes the normality requirement, and works naturally with ordinal data.
-
The sign test: The simplest nonparametric test for paired data — just count how many differences are positive and negative. Uses binomial probabilities under $H_0$.
-
The Wilcoxon rank-sum (Mann-Whitney U) test: The nonparametric two-sample t-test. Combine observations, rank them, compare the rank sums. Immune to outliers.
-
The Wilcoxon signed-rank test: The nonparametric paired t-test. Rank the absolute differences and retain their signs. More powerful than the sign test because it uses magnitude information.
-
The Kruskal-Wallis test: The nonparametric one-way ANOVA. Extends the rank-sum approach to three or more groups. Follow up with pairwise Mann-Whitney tests (Bonferroni-corrected) if significant.
-
The power tradeoff: Nonparametric tests lose about 5% efficiency when data are truly normal, but can gain efficiency with non-normal data. A remarkably fair deal.
Key Python Functions
| Test | Python Function |
|---|---|
| Sign test | stats.binomtest(n_positive, n_total, 0.5) |
| Mann-Whitney U | stats.mannwhitneyu(group1, group2) |
| Wilcoxon signed-rank | stats.wilcoxon(x, y) |
| Kruskal-Wallis | stats.kruskal(group1, group2, group3, ...) |
Connections to What's Next
In Chapter 22, you'll begin the final major statistical technique: regression. Regression is fundamentally about predicting a numerical outcome from one or more predictors. And just as the methods in this chapter provided nonparametric alternatives to t-tests and ANOVA, there are robust regression techniques for when the usual regression assumptions fail. The theme of "robust methods for messy reality" will continue.
The $\eta^2$ effect size from ANOVA (Chapter 20) will reappear as $R^2$ in regression — the proportion of variance explained by the predictor(s). The decomposition idea ($SS_T = SS_B + SS_W$) becomes $SS_T = SS_{\text{Regression}} + SS_{\text{Residual}}$. The conceptual thread connecting these ideas is the same one from this chapter: understanding what your data can and cannot tell you, and choosing methods that match your data's reality.
What's Next: Chapter 22 introduces correlation and simple linear regression — the tools for understanding relationships between two numerical variables. You'll learn to fit a line to data, interpret slopes and intercepts, and assess whether the relationship is statistically significant. If ANOVA asked "do group means differ?", regression asks "does this variable predict that one?"
"The great thing about nonparametric tests is that they meet the data where it is, rather than demanding the data meet them where they'd like it to be." — Adapted from the spirit of Frank Wilcoxon