Exercises: Nonparametric Methods
These exercises progress from conceptual understanding through hand calculations, Python implementation, and choosing between parametric and nonparametric approaches. Estimated completion time: 3 hours.
Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)
Part A: Conceptual Understanding ⭐
A.1. In your own words, explain the difference between a parametric test and a nonparametric test. What do parametric tests assume that nonparametric tests don't?
A.2. Explain why converting data to ranks makes a test "outlier-resistant." Use a specific numerical example: suppose you have the values {5, 8, 12, 15, 200}. What are the ranks? What happens to the ranks if 200 is changed to 2,000?
A.3. True or false (explain each):
(a) Nonparametric tests have no assumptions at all.
(b) The Mann-Whitney U test is specifically a test about medians.
(c) You can use the Wilcoxon signed-rank test for independent (unpaired) samples.
(d) The Kruskal-Wallis test is the nonparametric alternative to the paired t-test.
(e) If you have a sample of 500 observations, there's almost never a reason to use a nonparametric test.
(f) The sign test uses less information from the data than the Wilcoxon signed-rank test.
A.4. A researcher has satisfaction ratings (1 = Very Dissatisfied to 5 = Very Satisfied) for customers of two different stores. She runs a two-sample t-test and finds $p = 0.032$. Should she trust this result? Why or why not? What test would you recommend instead?
A.5. Explain the "power tradeoff" between parametric and nonparametric tests. When is the tradeoff favorable for nonparametric methods? When is it unfavorable?
A.6. Why are nonparametric tests sometimes called "distribution-free" methods? In what sense are they free of distributional assumptions, and in what sense are they not?
Part B: The Ranking Procedure ⭐
B.1. Rank the following data from smallest to largest. Handle ties using average (midrank) assignment.
Data: 14, 8, 22, 8, 17, 31, 14, 22, 5, 22
Verify that the sum of your ranks equals $N(N+1)/2$.
B.2. Two groups of students received the following scores on a practical exam:
- Group A: 72, 85, 68, 91, 77
- Group B: 64, 78, 82, 60, 88
(a) Combine the data and assign ranks. (b) Compute the rank sum for each group. (c) Verify: $W_A + W_B = N(N+1)/2$.
B.3. The following paired data represents "before" and "after" measurements:
| Pair | Before | After | Difference |
|---|---|---|---|
| 1 | 45 | 52 | |
| 2 | 38 | 42 | |
| 3 | 51 | 48 | |
| 4 | 44 | 44 | |
| 5 | 40 | 47 | |
| 6 | 55 | 60 |
(a) Compute the differences (After - Before). (b) Drop any zero differences. (c) Rank the absolute differences. (d) Assign positive and negative signs to the ranks. (e) Compute $W^+$ and $W^-$.
Part C: The Sign Test ⭐⭐
C.1. A new tutoring method is tested on 9 students. Their test scores before and after tutoring are:
| Student | Before | After |
|---|---|---|
| 1 | 72 | 78 |
| 2 | 65 | 70 |
| 3 | 80 | 78 |
| 4 | 58 | 65 |
| 5 | 74 | 80 |
| 6 | 69 | 69 |
| 7 | 82 | 85 |
| 8 | 71 | 76 |
| 9 | 67 | 72 |
(a) Compute the sign of each difference (After - Before). Drop zeros. (b) Count $n^+$ and $n^-$. (c) Under $H_0$ (the tutoring has no effect), what distribution does $n^+$ follow? (d) Compute the two-tailed p-value. (e) At $\alpha = 0.05$, is the tutoring method significantly effective?
C.2. Ten patients rate their pain level (0-10) before and after a new treatment. All ten show improvement (all differences are negative). Using the sign test with $H_a$: treatment reduces pain (one-tailed):
(a) What is the p-value? (b) Is this significant at $\alpha = 0.05$? (c) A colleague says, "We don't need a statistical test — obviously the treatment works if all ten improved." How would you respond?
C.3. Explain why the sign test is less powerful than the Wilcoxon signed-rank test. What information does the sign test discard that the signed-rank test retains? Give a specific example where this matters.
Part D: Mann-Whitney U / Wilcoxon Rank-Sum ⭐⭐
D.1. A hospital compares recovery times (days) for patients who received physical therapy (Group A) versus those who didn't (Group B):
- Group A (PT): 5, 8, 12, 7, 15
- Group B (No PT): 10, 18, 25, 14, 20
(a) State the null and alternative hypotheses (two-tailed). (b) Combine the data and assign ranks. (c) Compute $W_A$ and $W_B$. (d) Compute $U_A$ and $U_B$. (e) Verify that $U_A + U_B = n_1 \times n_2$. (f) What do the rank sums suggest about the effectiveness of physical therapy?
D.2. Using the data from D.1, compute the test using Python:
from scipy import stats
group_a = [5, 8, 12, 7, 15]
group_b = [10, 18, 25, 14, 20]
# Run Mann-Whitney U test (two-tailed)
stat, p = stats.mannwhitneyu(group_a, group_b,
alternative='two-sided')
print(f"U = {stat}, p = {p:.4f}")
# Also run the two-sample t-test for comparison
t_stat, t_p = stats.ttest_ind(group_a, group_b)
print(f"t = {t_stat:.3f}, p = {t_p:.4f}")
(a) Do the two tests agree on significance at $\alpha = 0.05$? (b) Which test is more appropriate here and why?
D.3. Alex compares average daily streaming minutes for users in two age groups:
- 18-24: 45, 62, 38, 150, 55, 41, 70, 48, 200, 52
- 25-34: 30, 42, 35, 28, 48, 33, 40, 37, 31, 44
(a) Without running any test, explain why a nonparametric test might be more appropriate here. (Hint: look at the data distributions.) (b) Use Python to run both the Mann-Whitney U and the two-sample t-test. (c) Do the tests agree? If not, explain why. (d) Which result would you report and why?
D.4. Sam compares free-throw percentages for two players over their last 8 games each:
- Daria: 85, 90, 75, 88, 92, 78, 85, 80
- Other player: 70, 75, 82, 68, 78, 72, 76, 74
The data are approximately normal (check with Shapiro-Wilk). Run both tests. In this case, which test is preferable and why?
Part E: Wilcoxon Signed-Rank Test ⭐⭐
E.1. A weight-loss program is tested on 8 participants. Here are their weights (kg) before and after:
| Participant | Before | After | Difference |
|---|---|---|---|
| 1 | 95 | 90 | |
| 2 | 82 | 80 | |
| 3 | 110 | 105 | |
| 4 | 78 | 79 | |
| 5 | 88 | 82 | |
| 6 | 92 | 87 | |
| 7 | 105 | 100 | |
| 8 | 85 | 84 |
(a) Compute the differences (After - Before). (b) Rank the absolute differences, handling ties. (c) Compute $W^+$ and $W^-$. (d) What is the test statistic? (e) Use Python to find the p-value. Is the weight loss significant at $\alpha = 0.05$?
E.2. Maya measures patient anxiety levels (1-10 scale) before and after a relaxation protocol for 10 patients:
- Before: 8, 7, 9, 6, 8, 7, 9, 5, 8, 7
- After: 6, 5, 7, 5, 6, 6, 7, 4, 5, 6
Use Python to run both the Wilcoxon signed-rank test and the paired t-test. Which is more appropriate for this data and why?
E.3. Explain why the Wilcoxon signed-rank test is more powerful than the sign test for the same data. What additional information does it use?
Part F: Kruskal-Wallis Test ⭐⭐
F.1. A company tests three different training methods and measures employee performance ratings (1-10 ordinal scale):
- Method A: 7, 8, 6, 9, 7
- Method B: 5, 6, 7, 5, 6
- Method C: 8, 9, 7, 8, 9
(a) Explain why Kruskal-Wallis is more appropriate than ANOVA here. (b) Use Python to run both the Kruskal-Wallis and ANOVA tests. (c) If the Kruskal-Wallis test is significant, run post-hoc pairwise Mann-Whitney tests with Bonferroni correction.
F.2. Sam analyzes player shooting accuracy (percentage) across four game situations: Home, Away, Playoffs, and Practice.
home = [52, 48, 55, 50, 53, 47]
away = [45, 42, 48, 40, 44, 46]
playoffs = [38, 42, 35, 40, 37, 41]
practice = [60, 65, 58, 62, 64, 59]
(a) Run the Kruskal-Wallis test. (b) If significant, which pairs differ (Bonferroni-corrected)? (c) Why might nonparametric methods be preferred here even though the data are continuous?
F.3. A medical study compares pain scores (0-10 scale, ordinal) after three different surgical techniques:
- Technique 1 (n = 12): 3, 5, 4, 6, 3, 4, 5, 7, 4, 3, 5, 4
- Technique 2 (n = 10): 2, 3, 2, 4, 3, 2, 3, 4, 2, 3
- Technique 3 (n = 11): 5, 6, 7, 5, 8, 6, 7, 5, 6, 7, 6
(a) Why is the Kruskal-Wallis test appropriate here? (Identify multiple reasons.) (b) Run the test in Python and interpret the results. (c) If significant, which techniques differ? (d) Report the results as Maya would present them to a clinical team.
Part G: Choosing Between Parametric and Nonparametric ⭐⭐
G.1. For each of the following scenarios, recommend either a parametric or nonparametric test. Specify which test and explain your reasoning.
(a) Comparing systolic blood pressure (mmHg) in two groups, $n_1 = 50$, $n_2 = 48$, both approximately normal.
(b) Comparing customer satisfaction ratings (1-5 stars) between two stores, $n_1 = 15$, $n_2 = 18$.
(c) Comparing income across four education levels, $n = 200$ per group but data is extremely right-skewed.
(d) Comparing reaction time before and after caffeine for 12 participants, differences are approximately normal.
(e) Comparing pain scores (0-10) across three treatment groups, $n = 8$ per group, ordinal data.
(f) Comparing SAT scores across five high schools, $n = 100$ per school, approximately normal.
G.2. A researcher presents these Shapiro-Wilk test results: - Group 1: $W = 0.98$, $p = 0.42$ - Group 2: $W = 0.71$, $p = 0.002$
Both groups have $n = 12$. She wants to compare the group means. What test should she use? Explain the logic.
G.3. A colleague argues: "I always run nonparametric tests because they have fewer assumptions. That way I'm always safe." Write a brief response explaining why this approach is not optimal.
Part H: Integration and Application ⭐⭐⭐
H.1. Alex has the following data comparing click-through rates (proportions) for three different email subject lines. Each row represents a campaign sent to 1,000 users:
import numpy as np
subject_a_rates = [0.052, 0.048, 0.061, 0.055, 0.050,
0.058, 0.045, 0.053, 0.049, 0.056]
subject_b_rates = [0.035, 0.042, 0.038, 0.040, 0.037,
0.041, 0.036, 0.039, 0.043, 0.034]
subject_c_rates = [0.048, 0.052, 0.055, 0.050, 0.047,
0.051, 0.053, 0.049, 0.046, 0.054]
(a) Run both ANOVA and Kruskal-Wallis. Do they agree? (b) Check normality assumptions within each group. (c) Which result would you report to the marketing team and why? (d) If significant, which subject lines differ?
H.2. Maya has the following data on patient wait times (minutes) at two clinics. The data is heavily right-skewed with outliers.
clinic_a = [15, 22, 18, 45, 12, 30, 8, 25, 180, 20, 14, 35]
clinic_b = [10, 15, 12, 20, 8, 18, 25, 11, 14, 16, 22, 13]
(a) Compute the mean and median for each clinic. Which summary statistic better represents a "typical" wait time? (b) Run both the two-sample t-test and the Mann-Whitney U test. (c) Explain why the tests might give different results. (d) Which test is more appropriate and why? (e) The 180-minute value in Clinic A might be a data error. Re-run both tests with this value removed. Do the conclusions change?
H.3. Write a complete analysis report (300-500 words) for the following scenario:
A pharmaceutical company compares the effectiveness of three pain medications using a 7-point pain relief scale (1 = No relief to 7 = Complete relief). The study enrolled 45 patients, randomly assigned 15 to each medication.
import numpy as np
from scipy import stats
med_a = [3, 4, 5, 3, 4, 4, 5, 3, 4, 5, 3, 4, 4, 3, 5]
med_b = [5, 6, 5, 7, 6, 5, 6, 7, 5, 6, 6, 5, 7, 6, 5]
med_c = [4, 5, 4, 3, 5, 4, 5, 4, 3, 4, 5, 4, 4, 5, 3]
Your report should: - Justify the choice of nonparametric over parametric methods - Present the Kruskal-Wallis test results - Include post-hoc comparisons (if warranted) - Interpret the findings in clinical terms - Note any limitations
Part I: Research and Critical Thinking ⭐⭐⭐⭐
I.1. (Simulation Study) Write Python code to compare the power of the Mann-Whitney U test and the two-sample t-test under different conditions:
(a) Both populations are normal ($\mu_1 = 0$, $\mu_2 = 0.5$, $\sigma = 1$, $n = 20$). Run 10,000 simulations. What proportion of the time does each test reject $H_0$?
(b) Both populations follow an exponential distribution with means 1.0 and 1.5. Repeat the simulation. How do the rejection rates compare now?
(c) Summarize your findings. Under what conditions does each test have higher power?
I.2. (Effect Sizes for Nonparametric Tests) Research and explain the rank-biserial correlation as an effect size measure for the Mann-Whitney U test. How is it computed from the U statistic? What are its benchmarks? Use Alex's streaming data from Section 21.6 as an example.
I.3. (Friedman Test — Beyond this Chapter) The Friedman test is the nonparametric alternative to repeated-measures ANOVA. Research how it works and explain in 200-300 words how it extends the Kruskal-Wallis test for within-subjects designs. When would Sam use a Friedman test instead of a Kruskal-Wallis test?
Part J: Mixed Practice ⭐⭐
J.1. Match each scenario to the most appropriate test:
| Scenario | Test Options |
|---|---|
| (a) Compare median income in two cities, $n = 200$ per city, right-skewed | A. Paired t-test |
| (b) Compare blood pressure before and after medication, $n = 10$, normal | B. Mann-Whitney U |
| (c) Compare satisfaction ratings (1-5) across 4 stores | C. ANOVA |
| (d) Compare test scores of three classes, $n = 100$ each, normal | D. Kruskal-Wallis |
| (e) Compare weight before and after diet, $n = 8$, skewed differences | E. Wilcoxon signed-rank |
J.2. For each of the following p-value pairs, explain what the agreement or disagreement tells you about the data:
(a) t-test: $p = 0.003$; Mann-Whitney: $p = 0.005$ (b) t-test: $p = 0.048$; Mann-Whitney: $p = 0.082$ (c) Paired t-test: $p = 0.15$; Wilcoxon signed-rank: $p = 0.14$ (d) ANOVA: $p = 0.001$; Kruskal-Wallis: $p = 0.003$ (e) t-test: $p = 0.22$; Mann-Whitney: $p = 0.03$
J.3. A researcher reports: "The Kruskal-Wallis test was not significant ($H = 3.12$, $df = 3$, $p = 0.37$), so I ran pairwise Mann-Whitney tests and found that Groups A and D differ significantly ($p = 0.02$)." What is wrong with this analysis?