> "Not everything that counts can be counted, and not everything that can be counted counts."
Learning Objectives
- Conduct and interpret a chi-square goodness-of-fit test
- Conduct and interpret a chi-square test of independence
- Calculate expected frequencies and understand their role
- Verify conditions for chi-square tests
- Interpret chi-square results in context with appropriate effect sizes (Cramer's V)
In This Chapter
- Chapter Overview
- 19.1 A Puzzle Before We Start (Productive Struggle)
- 19.2 The Chi-Square Statistic
- 19.3 The Goodness-of-Fit Test
- 19.4 A Conceptual Shift: From Means to Counts
- 19.5 The Chi-Square Test of Independence
- 19.6 Effect Size: Cramer's V
- 19.7 Conditions and Assumptions
- 19.8 Where Is the Association? Standardized Residuals
- 19.9 Alex's Analysis: Subscription Tier and Genre Preference
- 19.10 Connecting It All: The Chi-Square Test Landscape
- 19.11 When Things Go Wrong: Common Mistakes
- 19.12 The Big Picture: Theme 2 — The Human Stories Behind the Data
- 19.13 Computing Corner: Full Python Workflow
- 19.14 Visualizing Chi-Square Results
- 19.15 Progressive Project Checkpoint
- 19.16 What's Next
- Chapter Summary
Chapter 19: Chi-Square Tests: Categorical Data Analysis
"Not everything that counts can be counted, and not everything that can be counted counts." — Attributed to William Bruce Cameron (often misattributed to Einstein)
Chapter Overview
Everything you've learned about hypothesis testing so far has focused on numbers. Means. Proportions. Differences between means. Differences between proportions. The tests themselves — $z$-tests, $t$-tests, even permutation tests — all work with numerical quantities.
But some of the most important questions in the world aren't about numbers at all. They're about categories.
Does a disease affect different racial groups at different rates? Does a person's neighborhood predict whether they receive bail or pretrial detention? Does your music preference depend on your age group? Does the distribution of blood types in a hospital's patient population match the expected distribution for the general population?
These questions involve categorical variables — the kind you first met in Chapter 2 and built contingency tables for in Chapter 8. Until now, we've had limited tools for formally testing claims about categorical data. The one-sample and two-proportion $z$-tests from Chapter 14 handled the special case of two categories (success/failure) or two groups. But what happens when you have three categories? Or five? Or ten? What about testing the relationship between two categorical variables simultaneously?
That's where the chi-square test comes in.
The chi-square test (pronounced "kye-square," rhymes with "pie-square") is a fundamentally different kind of hypothesis test. Instead of comparing means or proportions to reference values, it compares observed frequencies (what you actually see in your data) to expected frequencies (what you'd expect to see if nothing interesting were happening). The bigger the gap between observed and expected, the stronger the evidence against the null hypothesis.
Dr. Maya Chen needs this tool right now. She's been tracking disease cases across four geographic regions and wants to know whether the distribution of cases matches what population size alone would predict — or whether some regions are disproportionately affected. A single proportion test won't do it; she needs to compare observed vs. expected across all four regions simultaneously.
Professor James Washington faces a related but different question. He has data on bail decisions (granted vs. denied) cross-tabulated by the defendant's race. He doesn't just want to know if the proportions differ between two groups — he wants to know if bail decisions are independent of race, period. The two-proportion $z$-test from Chapter 16 could compare two racial groups, but James has data on four groups and wants a single, comprehensive test.
And Alex Rivera is exploring whether StreamVibe's subscription tiers (Free, Basic, Premium) are associated with content genre preferences. Are Premium subscribers more likely to watch documentaries? Do Free users gravitate toward comedy? Alex needs a test that handles two categorical variables with multiple levels each.
All three need chi-square tests.
In this chapter, you will learn to: - Conduct and interpret a chi-square goodness-of-fit test - Conduct and interpret a chi-square test of independence - Calculate expected frequencies and understand their role - Verify conditions for chi-square tests - Interpret chi-square results in context with appropriate effect sizes (Cramer's V)
Fast Track: If you've seen chi-square tests before, skim Sections 19.1–19.3, then jump to Section 19.7 (conditions and assumptions). Complete quiz questions 1, 10, and 17 to verify.
Deep Dive: After this chapter, read Case Study 1 (Maya's regional disease distribution analysis) for a complete public health application, then Case Study 2 (James's analysis of bail decisions by race) for a criminal justice application with important ethical dimensions. Both include full Python code.
19.1 A Puzzle Before We Start (Productive Struggle)
Before we dive into the mechanics, try this scenario.
The Candy Jar Problem
A candy company claims that its bags of assorted fruit candies contain equal proportions of five flavors: cherry, grape, lemon, lime, and orange (20% each).
You open a bag and count:
Flavor Count Cherry 28 Grape 14 Lemon 22 Lime 18 Orange 18 Total 100 (a) If the company's claim is correct, how many of each flavor would you expect in a bag of 100 candies?
(b) The observed counts differ from the expected counts. But counts always vary a little due to random chance. How would you decide whether these differences are too big to be explained by chance alone?
(c) Could you use the one-proportion $z$-test from Chapter 14 here? What would be the problem with testing each flavor separately?
(d) What if you computed something like the "total distance" between the observed and expected counts — adding up how far off each flavor is? What formula might capture the overall discrepancy?
Take 3 minutes. Part (d) is the key insight.
Here's what I hope you noticed:
For part (a), if all flavors are equally represented, each should make up 20% of the bag. With 100 candies, you'd expect 20 of each flavor. These are the expected frequencies — the counts you'd predict if the null hypothesis (equal distribution) were true.
For part (b), this is the central question of the chapter. Cherry has 28 instead of 20 — that's 8 more than expected. Grape has only 14 — that's 6 fewer than expected. But random variation could produce some imbalance even if the process is perfectly fair. We need a formal test.
Part (c) reveals an important limitation. You could use the $z$-test from Chapter 14 to test whether cherry's proportion differs from 20%. But you'd have to run five separate tests — one for each flavor. And as you'll learn in Chapter 20 (and caught a glimpse of in Chapter 17's discussion of multiple testing), running multiple tests inflates your overall Type I error rate. If you test at $\alpha = 0.05$ five times, the probability of at least one false positive is $1 - (1 - 0.05)^5 \approx 0.23$ — nearly one in four! We need a single test that evaluates all five flavors simultaneously.
And part (d) is exactly where the chi-square statistic comes from. You want a single number that captures the total discrepancy between what you observed and what you expected. A natural approach: for each category, compute how far off the observed count is from the expected count, then add these up. But we need to be careful about how we measure "how far off" — and the chi-square statistic handles this with an elegant formula you'll see in Section 19.2.
19.2 The Chi-Square Statistic
From Idea to Formula
Let's build the chi-square statistic from the ground up. We want a single number that measures the total discrepancy between observed and expected frequencies across all categories.
First attempt: just add up the differences.
$$\sum (O_i - E_i)$$
Problem: some differences are positive (cherry: $28 - 20 = +8$) and some are negative (grape: $14 - 20 = -6$). They'd cancel out. Sound familiar? This is the same problem we faced when trying to measure spread in Chapter 6 — differences from the mean cancel out, which is why we squared them to get the variance.
Second attempt: square the differences.
$$\sum (O_i - E_i)^2$$
Better — now all terms are positive. But there's a subtlety. A difference of 8 means something very different if you expected 20 versus if you expected 200. Being off by 8 out of 20 is a 40% error. Being off by 8 out of 200 is a 4% error. We need to standardize each squared difference by what we expected.
Third attempt (this is it):
$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$
This is the chi-square statistic. The Greek letter $\chi$ (chi, pronounced "kye") gives the test its name.
Concept 1: The Chi-Square Statistic
The chi-square statistic measures the overall discrepancy between observed frequencies ($O_i$) and expected frequencies ($E_i$) across all categories:
$$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$$
where $k$ is the number of categories. Each term $(O_i - E_i)^2 / E_i$ measures how much one category deviates from expectations, relative to what was expected. The chi-square statistic is the sum of these standardized squared deviations.
Let's compute it for our candy data:
| Flavor | Observed ($O$) | Expected ($E$) | $O - E$ | $(O-E)^2$ | $(O-E)^2 / E$ |
|---|---|---|---|---|---|
| Cherry | 28 | 20 | 8 | 64 | 3.20 |
| Grape | 14 | 20 | -6 | 36 | 1.80 |
| Lemon | 22 | 20 | 2 | 4 | 0.20 |
| Lime | 18 | 20 | -2 | 4 | 0.20 |
| Orange | 18 | 20 | -2 | 4 | 0.20 |
$$\chi^2 = 3.20 + 1.80 + 0.20 + 0.20 + 0.20 = 5.60$$
Is 5.60 big enough to reject the null hypothesis? We need a reference distribution to find out.
Understanding the Chi-Square Distribution
The chi-square distribution is the reference distribution for our test statistic, just as the normal and $t$-distributions are reference distributions for $z$-tests and $t$-tests. Here's what you need to know:
- It's always right-skewed (unlike the symmetric normal and $t$-distributions). This makes sense: chi-square values are sums of squared terms, so they can't be negative.
- Its shape depends on the degrees of freedom (df).
- As df increases, the distribution shifts right and becomes more symmetric (approaching a normal shape for large df).
- The test is always one-tailed on the right. Large values of $\chi^2$ indicate poor fit between observed and expected; small values indicate good fit. There's no "two-tailed" chi-square test.
Key Term: Chi-Square Distribution
The chi-square distribution is a right-skewed probability distribution that serves as the reference distribution for chi-square tests. Its shape is determined by the degrees of freedom (df). For a goodness-of-fit test, $df = k - 1$, where $k$ is the number of categories.
For the candy example: $k = 5$ flavors, so $df = 5 - 1 = 4$.
Using a chi-square table or Python: the p-value for $\chi^2 = 5.60$ with $df = 4$ is approximately 0.231.
Since $p = 0.231 > 0.05$, we fail to reject $H_0$. The observed distribution of flavors is consistent with the company's claim of equal proportions. The imbalance we noticed — especially the surplus of cherry and shortage of grape — could plausibly be due to random variation.
Why Dividing by $E$ Matters
Let me emphasize why we divide by $E_i$ in each term. Consider two scenarios:
Scenario A: You expect 20 and observe 28. Difference = 8. Contribution: $8^2/20 = 3.2$.
Scenario B: You expect 200 and observe 208. Difference = 8. Contribution: $8^2/200 = 0.32$.
Both are "off by 8," but the first is much more noteworthy. Being off by 40% is very different from being off by 4%. Dividing by $E$ ensures that each category's contribution to the test statistic reflects the relative size of the discrepancy, not just the absolute difference. This is analogous to how we divide by the standard error in $z$-tests and $t$-tests — we're standardizing the deviation.
19.3 The Goodness-of-Fit Test
Now let's formalize what we just did. The candy example illustrates the chi-square goodness-of-fit test — a test of whether a single categorical variable follows a specified distribution.
Concept 2: Chi-Square Goodness-of-Fit Test
The goodness-of-fit test determines whether the observed distribution of a single categorical variable matches a hypothesized distribution. The "fit" refers to how well the observed frequencies "fit" the expected pattern.
- $H_0$: The variable follows the specified distribution (observed = expected)
- $H_a$: The variable does not follow the specified distribution (observed $\neq$ expected)
- Test statistic: $\chi^2 = \sum (O_i - E_i)^2 / E_i$
- Degrees of freedom: $df = k - 1$ (where $k$ = number of categories)
Why $df = k - 1$?
If you know the counts in 4 of 5 categories and the total, you can figure out the fifth. There are $k$ categories, but only $k - 1$ of them are free to vary. This is the same logic behind degrees of freedom for the sample variance in Chapter 6 — one piece of information is "used up" by the constraint that the counts must add to $n$.
🔄 Spaced Review 1 (from Ch.8): Contingency Tables
In Chapter 8, you learned to organize two categorical variables into a contingency table — a grid showing the frequency of each combination of categories. You calculated joint probabilities (cell count / grand total), marginal probabilities (row or column total / grand total), and conditional probabilities (cell count / row or column total).
Back then, contingency tables were tools for describing relationships. In this chapter, they become tools for testing relationships. The chi-square test of independence (Section 19.5) asks: "Is the pattern in this contingency table too extreme to have occurred by chance?" Everything you learned about reading contingency tables in Chapter 8 is about to become even more powerful.
Complete Worked Example: Maya's Regional Disease Distribution
Dr. Maya Chen has been tracking reported cases of a respiratory illness across four geographic regions in her county. The regions have different population sizes, so if the disease struck randomly (proportional to population), the case distribution should mirror the population distribution.
Here's the data:
| Region | Population Share | Expected % | Observed Cases | Expected Cases |
|---|---|---|---|---|
| North | 88,000 (22%) | 22% | 47 | 44 |
| South | 120,000 (30%) | 30% | 52 | 60 |
| East | 112,000 (28%) | 28% | 68 | 56 |
| West | 80,000 (20%) | 20% | 33 | 40 |
| Total | 400,000 (100%) | 100% | 200 | 200 |
Step 1: State the hypotheses.
$H_0$: The disease cases are distributed proportionally to population size (i.e., each region's share of cases equals its share of the population).
$H_a$: The disease cases are not distributed proportionally to population size.
Step 2: Compute expected frequencies.
Expected cases for each region = Total cases $\times$ Population share:
- North: $200 \times 0.22 = 44$
- South: $200 \times 0.30 = 60$
- East: $200 \times 0.28 = 56$
- West: $200 \times 0.20 = 40$
Step 3: Check conditions. (We'll cover conditions formally in Section 19.7, but let's preview: all expected counts must be at least 5. Here, the smallest expected count is 40. Condition met.)
Step 4: Compute the chi-square statistic.
| Region | $O$ | $E$ | $O - E$ | $(O-E)^2$ | $(O-E)^2/E$ |
|---|---|---|---|---|---|
| North | 47 | 44 | 3 | 9 | 0.205 |
| South | 52 | 60 | -8 | 64 | 1.067 |
| East | 68 | 56 | 12 | 144 | 2.571 |
| West | 33 | 40 | -7 | 49 | 1.225 |
$$\chi^2 = 0.205 + 1.067 + 2.571 + 1.225 = 5.068$$
Step 5: Find the p-value.
With $df = 4 - 1 = 3$:
from scipy import stats
chi2_stat = 5.068
df = 3
p_value = 1 - stats.chi2.cdf(chi2_stat, df)
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p_value:.4f}")
Output: $p \approx 0.167$
Step 6: Make a decision.
Since $p = 0.167 > 0.05$, we fail to reject $H_0$. There is not sufficient evidence to conclude that the disease distribution differs from what population sizes alone would predict.
Step 7: Interpret in context.
Although the East region had more cases than expected (68 vs. 56) and the South and West had fewer, these deviations are within the range that random variation could produce. Maya notes this result but plans to continue monitoring — a borderline result with a larger sample might reach significance, and the East region's overrepresentation warrants further investigation.
Python: The Complete Goodness-of-Fit Test
from scipy import stats
import numpy as np
# Maya's disease data
observed = np.array([47, 52, 68, 33])
# Expected proportions based on population shares
expected_proportions = np.array([0.22, 0.30, 0.28, 0.20])
# scipy.stats.chisquare() does all the work
chi2_stat, p_value = stats.chisquare(observed, f_exp=expected_proportions * sum(observed))
print("=" * 50)
print("Chi-Square Goodness-of-Fit Test")
print("=" * 50)
print(f"Observed: {observed}")
print(f"Expected: {expected_proportions * sum(observed)}")
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {len(observed) - 1}")
print(f"P-value: {p_value:.4f}")
print()
if p_value < 0.05:
print("Result: Reject H0 at alpha = 0.05")
print("The observed distribution differs significantly from expected.")
else:
print("Result: Fail to reject H0 at alpha = 0.05")
print("No significant evidence of departure from expected distribution.")
The scipy.stats.chisquare() function takes the observed counts and optionally the expected counts (f_exp). If you omit f_exp, it assumes equal expected frequencies — which is exactly what you'd want for the candy problem.
Excel: CHISQ.TEST
In Excel, the chi-square goodness-of-fit test uses the CHISQ.TEST function:
- Enter observed counts in one column (say A1:A4): 47, 52, 68, 33
- Enter expected counts in another column (say B1:B4): 44, 60, 56, 40
- In a results cell:
=CHISQ.TEST(A1:A4, B1:B4)
This returns the p-value directly. For Maya's data: $p \approx 0.167$.
Note: CHISQ.TEST returns only the p-value. To get the chi-square statistic itself, you'll need to compute it manually using a helper column with the formula =(A1-B1)^2/B1 and then sum those values.
19.4 A Conceptual Shift: From Means to Counts
Before we move to the test of independence, let's pause and appreciate what just happened. This is a different kind of hypothesis test than anything you've seen before.
In Chapters 13-18, every test followed the same basic structure: 1. Compute a numerical statistic from the data ($\bar{x}$, $\hat{p}$, $\bar{x}_1 - \bar{x}_2$) 2. Compare it to a null value using a standardized test statistic ($z$ or $t$) 3. Use the normal or $t$-distribution to find the p-value
The chi-square test works differently: 1. Compute counts for each category 2. Compare observed counts to expected counts using the $\chi^2$ formula 3. Use the chi-square distribution (not the normal or $t$) to find the p-value
This shift — from numerical data to categorical data, from means to counts, from the normal/$t$ world to the chi-square world — is one of the most important conceptual transitions in introductory statistics. Many students find chi-square tests surprisingly intuitive once they get over the initial strangeness, precisely because counting is simpler than computing means and standard deviations.
🔄 Spaced Review 2 (from Ch.14): Inference for Proportions
In Chapter 14, you tested whether a single proportion differed from a hypothesized value ($z$-test for proportions) or whether two proportions differed from each other (Chapter 16). Those tests handled the special case of a categorical variable with exactly two outcomes (success/failure).
The chi-square goodness-of-fit test generalizes the one-proportion $z$-test to categorical variables with any number of categories. In fact, for a variable with only two categories, the chi-square test gives exactly the same p-value as the two-tailed $z$-test. (The chi-square statistic equals the square of the $z$-statistic: $\chi^2 = z^2$.) So the $z$-test was a special case all along — and now you have the general version.
19.5 The Chi-Square Test of Independence
The goodness-of-fit test examined whether a single categorical variable follows a specified distribution. But often we want to know whether two categorical variables are related. That's the chi-square test of independence.
Concept 3: Chi-Square Test of Independence
The test of independence determines whether two categorical variables are associated (related) or independent (unrelated). It uses the same chi-square formula but applies it to a two-way contingency table.
- $H_0$: The two variables are independent (no association)
- $H_a$: The two variables are not independent (there is an association)
- Test statistic: $\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ (summed over all cells)
- Degrees of freedom: $df = (r-1)(c-1)$ where $r$ = number of rows and $c$ = number of columns
How to Calculate Expected Frequencies
This is where it gets clever. In the goodness-of-fit test, the expected frequencies came from the hypothesized distribution (like population shares). In the test of independence, the expected frequencies come from the data itself — specifically, from the row and column totals.
If two variables are truly independent, then knowing the row a case falls into tells you nothing about which column it's in. The expected frequency for any cell is:
$$E_{ij} = \frac{(\text{Row } i \text{ total}) \times (\text{Column } j \text{ total})}{\text{Grand total}}$$
Key Term: Expected Frequency (Test of Independence)
The expected frequency for a cell in a contingency table, under the assumption of independence, is:
$$E_{ij} = \frac{(\text{Row } i \text{ total}) \times (\text{Column } j \text{ total})}{\text{Grand total}}$$
This formula distributes the grand total across cells in proportion to both the row and column marginals. If the two variables are independent, observed frequencies should be close to these expected frequencies.
Why does this formula work? Think of it this way. If 40% of all people are in Row 1 and 30% of all people are in Column 2, and the two variables are independent, then 40% $\times$ 30% = 12% of all people should be in the cell where Row 1 meets Column 2. Multiply that 12% by the grand total, and you get the expected count.
This is the same multiplication rule for independent events you learned in Chapter 8: $P(A \text{ and } B) = P(A) \times P(B)$ when $A$ and $B$ are independent. We're applying that logic to a contingency table.
Why $df = (r-1)(c-1)$?
The degrees of freedom for the test of independence are $(r-1)(c-1)$, where $r$ is the number of rows and $c$ is the number of columns.
The intuition: in a contingency table, once you fix the row totals and column totals (which are determined by the data), how many cell counts are free to vary? In a $3 \times 4$ table, you can fill in any $2 \times 3 = 6$ cells freely; the rest are determined by the constraint that rows and columns must sum to their totals. That's $(3-1)(4-1) = 6$ degrees of freedom.
Complete Worked Example: James's Bail Decision Analysis
Professor James Washington has collected data on bail decisions from a metropolitan courthouse. He wants to test whether bail decisions are independent of the defendant's race. Here is the data:
| Bail Granted | Bail Denied | Total | |
|---|---|---|---|
| White | 142 | 58 | 200 |
| Black | 108 | 92 | 200 |
| Hispanic | 87 | 63 | 150 |
| Other | 43 | 7 | 50 |
| Total | 380 | 220 | 600 |
Step 1: State the hypotheses.
$H_0$: Bail decisions are independent of race (the proportion granted bail is the same across racial groups).
$H_a$: Bail decisions are not independent of race (the proportion granted bail differs across racial groups).
Step 2: Compute expected frequencies.
Overall, $380/600 = 63.3\%$ of defendants were granted bail and $220/600 = 36.7\%$ were denied. If bail decisions are independent of race, each racial group should have these same proportions.
$$E_{ij} = \frac{\text{Row total} \times \text{Column total}}{\text{Grand total}}$$
| Bail Granted ($E$) | Bail Denied ($E$) | |
|---|---|---|
| White | $200 \times 380 / 600 = 126.67$ | $200 \times 220 / 600 = 73.33$ |
| Black | $200 \times 380 / 600 = 126.67$ | $200 \times 220 / 600 = 73.33$ |
| Hispanic | $150 \times 380 / 600 = 95.00$ | $150 \times 220 / 600 = 55.00$ |
| Other | $50 \times 380 / 600 = 31.67$ | $50 \times 220 / 600 = 18.33$ |
Step 3: Check conditions. All expected counts are at least 5. (The smallest is 18.33.) Condition met.
Step 4: Compute the chi-square statistic.
| Cell | $O$ | $E$ | $(O-E)^2/E$ |
|---|---|---|---|
| White, Granted | 142 | 126.67 | 1.858 |
| White, Denied | 58 | 73.33 | 3.210 |
| Black, Granted | 108 | 126.67 | 2.747 |
| Black, Denied | 92 | 73.33 | 4.747 |
| Hispanic, Granted | 87 | 95.00 | 0.674 |
| Hispanic, Denied | 63 | 55.00 | 1.164 |
| Other, Granted | 43 | 31.67 | 4.062 |
| Other, Denied | 7 | 18.33 | 7.017 |
$$\chi^2 = 1.858 + 3.210 + 2.747 + 4.747 + 0.674 + 1.164 + 4.062 + 7.017 = 25.479$$
Step 5: Find the p-value.
$df = (4-1)(2-1) = 3$
from scipy import stats
p_value = 1 - stats.chi2.cdf(25.479, df=3)
print(f"P-value: {p_value:.6f}")
# Output: P-value: 0.000012
Step 6: Make a decision.
Since $p < 0.001$, we reject $H_0$ at any conventional significance level. There is very strong evidence that bail decisions are not independent of race.
Step 7: Interpret in context.
The data provide overwhelming evidence ($\chi^2 = 25.48$, $df = 3$, $p < 0.001$) that bail decisions are associated with the defendant's race. Looking at the largest contributions to the chi-square statistic, the "Other, Denied" cell (contribution = 7.017) and "Black, Denied" cell (contribution = 4.747) stand out — Black defendants were denied bail more often than expected under independence, while defendants in the "Other" category were denied bail far less often than expected.
But notice something important: the chi-square test tells us that an association exists. It doesn't tell us where the association is or how strong it is. That's a limitation we'll address with effect sizes (Cramer's V) in Section 19.6 and residual analysis in Section 19.8.
Python: The Complete Test of Independence
import numpy as np
from scipy import stats
import pandas as pd
# James's bail data as a contingency table
observed = np.array([
[142, 58], # White
[108, 92], # Black
[87, 63], # Hispanic
[43, 7] # Other
])
# Create a labeled DataFrame for clarity
df = pd.DataFrame(
observed,
index=['White', 'Black', 'Hispanic', 'Other'],
columns=['Bail Granted', 'Bail Denied']
)
print("Observed Frequencies:")
print(df)
print()
# scipy.stats.chi2_contingency() handles everything
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"Expected Frequencies (under independence):")
expected_df = pd.DataFrame(
np.round(expected, 2),
index=['White', 'Black', 'Hispanic', 'Other'],
columns=['Bail Granted', 'Bail Denied']
)
print(expected_df)
print()
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.6f}")
print()
if p_value < 0.05:
print("Result: Reject H0 — bail decisions are NOT independent of race.")
else:
print("Result: Fail to reject H0 — no significant association detected.")
The scipy.stats.chi2_contingency() function is the workhorse for the test of independence. You pass it a 2D array of observed frequencies, and it returns the chi-square statistic, p-value, degrees of freedom, and the matrix of expected frequencies. Notice that you don't need to calculate expected frequencies yourself — the function does it automatically.
Excel: CHISQ.TEST for Independence
In Excel, the process is similar:
- Enter the observed counts in a block (say A1:B4 for a $4 \times 2$ table)
- Calculate expected counts using the formula $E_{ij} = (\text{row total} \times \text{column total}) / \text{grand total}$ in a separate block (say D1:E4)
- Use
=CHISQ.TEST(A1:B4, D1:E4)to get the p-value
19.6 Effect Size: Cramer's V
We just found that bail decisions are significantly associated with race ($p < 0.001$). But how strong is this association?
🔄 Spaced Review 3 (from Ch.17): Statistical vs. Practical Significance
In Chapter 17, you learned the crucial distinction: statistical significance (p-value $\leq \alpha$) tells you that an effect exists but says nothing about its size. Practical significance asks whether the effect matters in the real world. Cohen's $d$ measured effect size for comparing means. For chi-square tests, we need a different effect size measure — one designed for categorical associations. That measure is Cramer's V.
The chi-square statistic itself is not a good effect size measure because it depends on sample size. Double the sample (keeping proportions the same) and $\chi^2$ doubles too. We need something that stays between 0 and 1 regardless of sample size.
Concept 4: Cramer's V
Cramer's V measures the strength of association between two categorical variables, scaled from 0 to 1:
$$V = \sqrt{\frac{\chi^2}{n \cdot (k - 1)}}$$
where $\chi^2$ is the chi-square statistic, $n$ is the total sample size, and $k = \min(r, c)$ — the smaller of the number of rows or columns.
- $V = 0$: no association (perfect independence)
- $V = 1$: perfect association (knowing one variable perfectly predicts the other)
Conventional benchmarks (adapted from Cohen): | $V$ | Interpretation | |-----|---------------| | 0.10 | Small association | | 0.30 | Medium association | | 0.50 | Large association |
For James's bail data:
$$V = \sqrt{\frac{25.479}{600 \cdot (2 - 1)}} = \sqrt{\frac{25.479}{600}} = \sqrt{0.04247} = 0.206$$
The association between race and bail decisions has a Cramer's V of 0.206 — a small-to-medium effect. The association is statistically significant (we're confident it's real) but moderate in strength. This matters for interpretation: race is associated with bail decisions, but race alone doesn't determine the outcome. Other factors — severity of the charge, prior record, community ties — also play important roles.
# Computing Cramer's V
n = observed.sum()
k = min(observed.shape) - 1 # min(rows, cols) - 1
cramers_v = np.sqrt(chi2_stat / (n * k))
print(f"Cramer's V: {cramers_v:.3f}")
print(f"Interpretation: {'Small' if cramers_v < 0.2 else 'Small-medium' if cramers_v < 0.3 else 'Medium' if cramers_v < 0.4 else 'Medium-large' if cramers_v < 0.5 else 'Large'} association")
Why Not Just Use the Chi-Square Statistic?
Consider two studies finding the same proportional pattern in bail decisions: - Study A: $n = 600$, $\chi^2 = 25.5$, $V = 0.21$ - Study B: $n = 6{,}000$, $\chi^2 = 255$, $V = 0.21$
The chi-square statistic is 10 times larger in Study B, but the strength of association is identical. Both have $V = 0.21$. Cramer's V tells you the same story regardless of sample size. The chi-square statistic tells you whether the association is real; Cramer's V tells you how strong it is. You need both.
19.7 Conditions and Assumptions
Like every statistical test, chi-square tests have conditions that must be met for the results to be valid. The good news: the conditions are simpler than for $t$-tests (no normality assumption for the raw data!). The key condition involves the expected frequencies.
The Conditions
1. Random sampling or random assignment. The data must come from a random sample or a randomized experiment. This ensures that the sample is representative of the population and that the observations are independent.
2. Independence of observations. Each observation must be independent of every other observation. In particular: - Each individual or case contributes to only one cell in the table. - The sample size must be less than 10% of the population (the 10% condition, familiar from Chapter 14).
3. Expected frequency condition: All expected counts must be at least 5.
This is the big one. The chi-square distribution is an approximation, and it works well only when the expected counts are large enough. If any expected count is below 5, the approximation becomes unreliable and the p-value may be inaccurate.
Common Mistake Alert
The expected frequency condition refers to expected counts, not observed counts. It's fine if an observed count is 0, 1, or 2 — as long as the expected count for that cell is at least 5. Students sometimes check observed counts instead. Don't make that error.
What to Do When Conditions Aren't Met
If expected counts are too small, you have several options:
-
Combine categories. Merge small categories into a larger one (e.g., combine "Other" with another group). This reduces the number of cells and increases expected counts. But be thoughtful — merging should make substantive sense, not just be a statistical convenience.
-
Fisher's exact test. For $2 \times 2$ tables, Fisher's exact test computes the p-value exactly without relying on the chi-square approximation. In Python:
stats.fisher_exact(table). -
Collect more data. Sometimes small expected counts simply mean your sample is too small for the question you're asking.
-
Simulation-based approach. Use a permutation test (as in Chapter 18) to simulate the null distribution instead of relying on the chi-square approximation.
Checking Conditions: A Practical Workflow
# After running chi2_contingency, check expected frequencies
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
print("Expected frequencies:")
print(np.round(expected, 2))
print()
# Check the condition
min_expected = expected.min()
print(f"Minimum expected frequency: {min_expected:.2f}")
if min_expected >= 5:
print("Condition met: All expected frequencies >= 5")
else:
print("WARNING: Condition NOT met.")
print(f" {np.sum(expected < 5)} cell(s) have expected frequency < 5")
print(" Consider combining categories, using Fisher's exact test,")
print(" or using a simulation-based approach.")
19.8 Where Is the Association? Standardized Residuals
Here's a common mistake that deserves its own section.
Common Mistake: Assuming Chi-Square Tells You Where
The chi-square test tells you that an association exists between two categorical variables. It does not tell you where the association is — which specific cells deviate most from independence. For that, you need standardized residuals.
A standardized residual for each cell measures how far that cell's observed count is from its expected count, in standard-deviation units:
$$\text{Standardized residual} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}$$
This is essentially the square root of each cell's contribution to the chi-square statistic, with the sign preserved. A standardized residual greater than $+2$ or less than $-2$ indicates a cell that is notably different from what independence would predict.
For James's data, the standardized residuals reveal the story:
# Calculate standardized residuals
residuals = (observed - expected) / np.sqrt(expected)
residual_df = pd.DataFrame(
np.round(residuals, 2),
index=['White', 'Black', 'Hispanic', 'Other'],
columns=['Bail Granted', 'Bail Denied']
)
print("Standardized Residuals:")
print(residual_df)
| Bail Granted | Bail Denied | |
|---|---|---|
| White | +1.36 | -1.79 |
| Black | -1.66 | +2.18 |
| Hispanic | -0.82 | +1.08 |
| Other | +2.02 | -2.65 |
The residuals tell us where the association lives:
- Black defendants were denied bail more often than expected (residual = +2.18 for "Denied") and granted bail less often than expected (residual = -1.66 for "Granted").
- "Other" defendants show the opposite pattern: denied bail less often than expected (residual = -2.65) and granted bail more often than expected (residual = +2.02).
- White and Hispanic defendants have residuals closer to zero — their bail outcomes are closer to what independence would predict.
This is crucial information that the overall chi-square statistic ($\chi^2 = 25.48$) alone can't provide. James can now report not just that bail decisions are associated with race, but which specific groups show the most pronounced departures from what fair treatment (independence) would look like.
19.9 Alex's Analysis: Subscription Tier and Genre Preference
Let's work through another complete example — this time from the tech world.
Alex Rivera wants to know whether StreamVibe's subscription tier is associated with content genre preference. A random sample of 500 users was categorized by their subscription tier and their most-watched genre:
| Comedy | Drama | Documentary | Action | Total | |
|---|---|---|---|---|---|
| Free | 65 | 40 | 15 | 30 | 150 |
| Basic | 50 | 55 | 35 | 60 | 200 |
| Premium | 25 | 45 | 40 | 40 | 150 |
| Total | 140 | 140 | 90 | 130 | 500 |
Step 1: State the hypotheses.
$H_0$: Subscription tier and genre preference are independent.
$H_a$: Subscription tier and genre preference are not independent.
Step 2: Compute expected frequencies.
Using $E_{ij} = (\text{Row total} \times \text{Column total}) / \text{Grand total}$:
| Comedy ($E$) | Drama ($E$) | Documentary ($E$) | Action ($E$) | |
|---|---|---|---|---|
| Free | $150 \times 140/500 = 42.0$ | $150 \times 140/500 = 42.0$ | $150 \times 90/500 = 27.0$ | $150 \times 130/500 = 39.0$ |
| Basic | $200 \times 140/500 = 56.0$ | $200 \times 140/500 = 56.0$ | $200 \times 90/500 = 36.0$ | $200 \times 130/500 = 52.0$ |
| Premium | $150 \times 140/500 = 42.0$ | $150 \times 140/500 = 42.0$ | $150 \times 90/500 = 27.0$ | $150 \times 130/500 = 39.0$ |
Step 3: Check conditions. All expected counts $\geq 5$. Smallest is 27.0. Condition met.
Step 4: Compute $\chi^2$.
| Cell | $O$ | $E$ | $(O-E)^2/E$ |
|---|---|---|---|
| Free, Comedy | 65 | 42.0 | 12.595 |
| Free, Drama | 40 | 42.0 | 0.095 |
| Free, Documentary | 15 | 27.0 | 5.333 |
| Free, Action | 30 | 39.0 | 2.077 |
| Basic, Comedy | 50 | 56.0 | 0.643 |
| Basic, Drama | 55 | 56.0 | 0.018 |
| Basic, Documentary | 35 | 36.0 | 0.028 |
| Basic, Action | 60 | 52.0 | 1.231 |
| Premium, Comedy | 25 | 42.0 | 6.881 |
| Premium, Drama | 45 | 42.0 | 0.214 |
| Premium, Documentary | 40 | 27.0 | 6.259 |
| Premium, Action | 40 | 39.0 | 0.026 |
$$\chi^2 = 12.595 + 0.095 + 5.333 + 2.077 + 0.643 + 0.018 + 0.028 + 1.231 + 6.881 + 0.214 + 6.259 + 0.026 = 35.400$$
Step 5: Find the p-value.
$df = (3-1)(4-1) = 6$
import numpy as np
from scipy import stats
# Alex's StreamVibe data
observed = np.array([
[65, 40, 15, 30], # Free
[50, 55, 35, 60], # Basic
[25, 45, 40, 40] # Premium
])
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.6f}")
# Cramer's V
n = observed.sum()
k = min(observed.shape) - 1
cramers_v = np.sqrt(chi2_stat / (n * k))
print(f"Cramer's V: {cramers_v:.3f}")
Output: - $\chi^2 = 35.40$, $df = 6$, $p < 0.001$ - $V = 0.188$
Step 6: Decision and interpretation.
We reject $H_0$ ($p < 0.001$). There is very strong evidence that subscription tier and genre preference are not independent — they are associated.
Cramer's V = 0.188 indicates a small effect. The association is real but modest. To see where it lives, let's look at standardized residuals:
residuals = (observed - expected) / np.sqrt(expected)
import pandas as pd
res_df = pd.DataFrame(
np.round(residuals, 2),
index=['Free', 'Basic', 'Premium'],
columns=['Comedy', 'Drama', 'Documentary', 'Action']
)
print("\nStandardized Residuals:")
print(res_df)
| Comedy | Drama | Documentary | Action | |
|---|---|---|---|---|
| Free | +3.55 | -0.31 | -2.31 | -1.44 |
| Basic | -0.80 | -0.13 | -0.17 | +1.11 |
| Premium | -2.62 | +0.46 | +2.50 | +0.16 |
The key findings: - Free users watch much more comedy than expected (residual = +3.55) and much less documentary (-2.31). - Premium users watch much more documentary (+2.50) and much less comedy (-2.62). - Basic users are close to expected across all genres — they're in the middle.
Alex can now report to the content team: Premium subscribers disproportionately prefer documentaries, while Free users disproportionately prefer comedy. This has implications for content investment, recommendation algorithms, and conversion strategies. If you want Free users to upgrade, featuring more documentary-style content in their feeds might not be the best strategy — but highlighting the documentary library might attract users already predisposed to pay.
19.10 Connecting It All: The Chi-Square Test Landscape
Let's summarize when to use which test:
| Question | Test | Null Hypothesis | df |
|---|---|---|---|
| Does one categorical variable match a specified distribution? | Goodness-of-fit | Observed matches expected | $k - 1$ |
| Are two categorical variables independent? | Test of independence | Variables are independent | $(r-1)(c-1)$ |
| Do proportions differ across groups? | Test of homogeneity* | Proportions are equal | $(r-1)(c-1)$ |
*The test of homogeneity is computationally identical to the test of independence. The difference is in the study design: the test of independence uses a single random sample and classifies each observation on two variables, while the test of homogeneity takes separate samples from different populations and compares the distribution of a single variable across them. The mechanics and formulas are the same; the interpretation differs slightly.
Connection to Previous Tests
| Situation | Previous Test (Ch.14-16) | Chi-Square Version |
|---|---|---|
| One proportion vs. hypothesized value | One-sample $z$-test for proportions | Goodness-of-fit (2 categories) |
| Comparing two proportions | Two-proportion $z$-test | Test of independence ($2 \times 2$) |
| Comparing $k > 2$ proportions | No single test available! | Test of independence/homogeneity |
The chi-square test generalizes proportion tests to handle any number of categories and any number of groups. That's its superpower.
🔄 Spaced Review (from Ch.2): Categorical Variables
Way back in Chapter 2, you learned to classify variables as categorical (nominal or ordinal) vs. numerical (discrete or continuous). We said then that categorical variables would be important throughout the course. Chapter 5 gave you bar charts. Chapter 8 gave you contingency tables. Now Chapter 19 gives you a formal hypothesis test for categorical data. This is the chapter for categorical variables — the chapter where categories get their own inferential toolkit, independent of means and medians. If you've ever felt that categorical data got less attention than numerical data in this course, this chapter is the payoff.
19.11 When Things Go Wrong: Common Mistakes
Let me flag the mistakes I see most often with chi-square tests.
Mistake 1: Using chi-square on numerical data. Chi-square tests are for counts of categorical outcomes. If your data are measurements (heights, weights, test scores), chi-square is the wrong tool. You might need a $t$-test, ANOVA, or regression instead.
Mistake 2: Confusing observed and expected in the condition check. The condition requires all expected frequencies $\geq 5$. An observed count of 0 is fine (it may even be the most interesting finding!) as long as the expected count for that cell is at least 5.
Mistake 3: Interpreting a significant result as identifying WHERE the association is. A significant chi-square test tells you an association exists. It does not tell you which cells are driving the association. You need standardized residuals for that. This is analogous to how a significant ANOVA (Chapter 20) tells you some group means differ, but not which ones — you need post-hoc tests.
Mistake 4: Forgetting that chi-square tests are always right-tailed. The chi-square distribution is not symmetric. All the "evidence against $H_0$" lives in the right tail. There's no "two-tailed" chi-square test. Small $\chi^2$ values indicate good fit (or independence); large values indicate poor fit (or association).
Mistake 5: Using chi-square with percentages instead of counts. Always enter raw counts into the chi-square formula or function, not percentages or proportions. The formula depends on the actual frequencies.
Mistake 6: Assuming chi-square proves causation. As with every statistical test, a significant chi-square result shows an association, not a causal relationship. Finding that bail decisions are associated with race doesn't prove that race causes bail outcomes — there could be confounding variables (charge severity, prior record, representation quality) that explain part of the association. Study design, not the test, determines causal claims (Theme 5 from Chapter 4 and Chapter 16).
19.12 The Big Picture: Theme 2 — The Human Stories Behind the Data
Let me pause to surface something important about this chapter.
Every chi-square test in this chapter involves categories of people. Maya's regions contain communities. James's bail data classifies real defendants by race. Alex's subscription tiers describe customers making choices. This isn't a coincidence — categorical data in the social sciences almost always describes people, and the categories we choose (or the categories that institutions choose for us) carry enormous weight.
Consider James's bail analysis. The categories "White," "Black," "Hispanic," and "Other" are socially constructed labels that flatten the complexity of individual lives into a handful of boxes. The choice of which categories to use — and how to define them — shapes what the data can reveal. If James had used only two categories (White vs. Non-White), the "Other" group's dramatically different outcomes (86% granted bail vs. 63.3% overall) would have been hidden. If he'd used ten categories, some expected counts might have fallen below 5, making the test invalid.
Theme Connection: Categorical Data Often Describes People
When we categorize people — by race, gender, income bracket, diagnosis, neighborhood — we are making choices about what matters enough to measure and what gets erased by aggregation. Chi-square tests are powerful tools for detecting patterns in categorical data, but the categories themselves are never neutral. They carry history, power, and assumptions. Always ask: Who chose these categories? Who benefits from this classification? Whose experiences are compressed into "Other"?
This theme will resurface throughout Part 6 as we encounter ANOVA (Chapter 20) and nonparametric methods (Chapter 21). The tests change; the ethical imperative to think critically about categories does not.
19.13 Computing Corner: Full Python Workflow
Here's a complete Python workflow that you can adapt for your own chi-square analyses:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# =============================================================
# Chi-Square Goodness-of-Fit Test
# =============================================================
def chi_square_gof(observed, expected_proportions=None, labels=None,
alpha=0.05):
"""
Perform a chi-square goodness-of-fit test.
Parameters
----------
observed : array-like
Observed frequency counts.
expected_proportions : array-like or None
Expected proportions (must sum to 1). If None, assumes
equal proportions.
labels : list or None
Category labels for display.
alpha : float
Significance level.
"""
observed = np.array(observed)
n = observed.sum()
k = len(observed)
if expected_proportions is None:
expected = np.full(k, n / k)
else:
expected = np.array(expected_proportions) * n
if labels is None:
labels = [f"Category {i+1}" for i in range(k)]
# Compute chi-square statistic
chi2_stat, p_value = stats.chisquare(observed, f_exp=expected)
df = k - 1
# Check conditions
min_expected = expected.min()
condition_met = min_expected >= 5
# Display results
print("=" * 55)
print("CHI-SQUARE GOODNESS-OF-FIT TEST")
print("=" * 55)
result_df = pd.DataFrame({
'Category': labels,
'Observed': observed,
'Expected': np.round(expected, 2),
'O - E': observed - expected,
'Contribution': np.round((observed - expected)**2 / expected, 3)
})
print(result_df.to_string(index=False))
print()
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p_value:.4f}")
print(f"Min expected freq: {min_expected:.2f} "
f"({'OK' if condition_met else 'WARNING: < 5'})")
print()
if p_value < alpha:
print(f"Decision: REJECT H0 at alpha = {alpha}")
else:
print(f"Decision: Fail to reject H0 at alpha = {alpha}")
return chi2_stat, p_value, df
# =============================================================
# Chi-Square Test of Independence
# =============================================================
def chi_square_independence(observed, row_labels=None, col_labels=None,
alpha=0.05):
"""
Perform a chi-square test of independence with effect size.
Parameters
----------
observed : 2D array-like
Contingency table of observed frequencies.
row_labels, col_labels : list or None
Labels for rows and columns.
alpha : float
Significance level.
"""
observed = np.array(observed)
r, c = observed.shape
if row_labels is None:
row_labels = [f"Row {i+1}" for i in range(r)]
if col_labels is None:
col_labels = [f"Col {j+1}" for j in range(c)]
# Run the test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)
# Effect size: Cramer's V
n = observed.sum()
k = min(r, c) - 1
cramers_v = np.sqrt(chi2_stat / (n * k)) if k > 0 else 0
# Standardized residuals
residuals = (observed - expected) / np.sqrt(expected)
# Check conditions
min_expected = expected.min()
condition_met = min_expected >= 5
# Display results
print("=" * 60)
print("CHI-SQUARE TEST OF INDEPENDENCE")
print("=" * 60)
print("\nObserved Frequencies:")
obs_df = pd.DataFrame(observed, index=row_labels, columns=col_labels)
print(obs_df)
print("\nExpected Frequencies:")
exp_df = pd.DataFrame(np.round(expected, 2), index=row_labels,
columns=col_labels)
print(exp_df)
print("\nStandardized Residuals:")
res_df = pd.DataFrame(np.round(residuals, 2), index=row_labels,
columns=col_labels)
print(res_df)
print(f"\nChi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.6f}")
print(f"Cramer's V: {cramers_v:.3f}", end="")
if cramers_v < 0.1:
print(" (negligible)")
elif cramers_v < 0.2:
print(" (small)")
elif cramers_v < 0.3:
print(" (small-medium)")
elif cramers_v < 0.4:
print(" (medium)")
elif cramers_v < 0.5:
print(" (medium-large)")
else:
print(" (large)")
print(f"Min expected freq: {min_expected:.2f} "
f"({'OK' if condition_met else 'WARNING: < 5'})")
print()
if p_value < alpha:
print(f"Decision: REJECT H0 at alpha = {alpha}")
print("The variables are NOT independent — an association exists.")
else:
print(f"Decision: Fail to reject H0 at alpha = {alpha}")
print("No significant association detected.")
return chi2_stat, p_value, dof, cramers_v, residuals
# =============================================================
# Example usage: Alex's StreamVibe data
# =============================================================
print("\n" + "=" * 60)
print("EXAMPLE: Alex's StreamVibe Analysis")
print("=" * 60 + "\n")
observed = np.array([
[65, 40, 15, 30], # Free
[50, 55, 35, 60], # Basic
[25, 45, 40, 40] # Premium
])
chi2, p, dof, v, resid = chi_square_independence(
observed,
row_labels=['Free', 'Basic', 'Premium'],
col_labels=['Comedy', 'Drama', 'Documentary', 'Action']
)
19.14 Visualizing Chi-Square Results
A bar chart comparing observed and expected frequencies can make your results much more interpretable:
import matplotlib.pyplot as plt
import numpy as np
# Alex's data — grouped bar chart
categories = ['Comedy', 'Drama', 'Documentary', 'Action']
tiers = ['Free', 'Basic', 'Premium']
observed = np.array([
[65, 40, 15, 30],
[50, 55, 35, 60],
[25, 45, 40, 40]
])
# Calculate proportions within each tier for fair comparison
obs_props = observed / observed.sum(axis=1, keepdims=True) * 100
x = np.arange(len(categories))
width = 0.25
fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(x - width, obs_props[0], width, label='Free',
color='#4ECDC4', edgecolor='black')
bars2 = ax.bar(x, obs_props[1], width, label='Basic',
color='#45B7D1', edgecolor='black')
bars3 = ax.bar(x + width, obs_props[2], width, label='Premium',
color='#96CEB4', edgecolor='black')
ax.set_ylabel('Percentage of Users (%)', fontsize=12)
ax.set_xlabel('Genre', fontsize=12)
ax.set_title('Genre Preferences by Subscription Tier\n'
r'($\chi^2$ = 35.40, df = 6, p < 0.001, V = 0.19)',
fontsize=13)
ax.set_xticks(x)
ax.set_xticklabels(categories, fontsize=11)
ax.legend(fontsize=11)
ax.set_ylim(0, 50)
plt.tight_layout()
plt.show()
The grouped bar chart makes the pattern visible at a glance: Free users skew heavily toward comedy, Premium users skew toward documentaries, and Basic users are more evenly distributed. The chi-square test confirms that this visual pattern isn't just noise.
19.15 Progressive Project Checkpoint
Your Turn: Data Detective Portfolio
It's time to apply chi-square tests to your own dataset.
Step 1: Identify categorical variables. Look through your dataset's data dictionary (which you created in Chapter 2 and refined in Chapter 7). Identify at least two categorical variables with 3+ categories each.
Step 2: Goodness-of-fit test. Choose one categorical variable and test whether its distribution matches a hypothesized distribution. For example: - If using BRFSS: Does the distribution of general health ratings (Excellent/Very Good/Good/Fair/Poor) in your sample match national proportions? - If using College Scorecard: Does the distribution of institution types (public/private nonprofit/private for-profit) match what you'd expect from national data? - If using Gapminder: Does the distribution of countries across income groups match a hypothesized distribution?
Step 3: Test of independence. Create a contingency table from two categorical variables and test for independence. For example: - BRFSS: Is general health rating independent of smoking status? - College Scorecard: Is institution type independent of region? - World Happiness: Is happiness category independent of continent?
Step 4: Full analysis. For each test: - State hypotheses - Compute and display expected frequencies - Check conditions (all expected $\geq 5$) - Report the chi-square statistic, df, and p-value - Compute Cramer's V for the test of independence - Compute standardized residuals and identify the cells driving any association - Interpret results in context
Step 5: Add to your notebook. Create a new section titled "Chi-Square Analysis" in your Data Detective Portfolio notebook. Include the code, output, visualizations, and interpretation.
Estimated time: 45-60 minutes
19.16 What's Next
In this chapter, you learned to test hypotheses about categorical data using chi-square tests. You can now determine whether a categorical variable follows a specified distribution (goodness-of-fit) and whether two categorical variables are related (test of independence). You can measure effect sizes with Cramer's V and pinpoint where associations live using standardized residuals.
In Chapter 20, we'll tackle a complementary problem: comparing means across three or more groups. Just as the chi-square test generalized the two-proportion $z$-test to multiple categories, Analysis of Variance (ANOVA) generalizes the two-sample $t$-test to multiple groups. The conceptual parallels are striking — both involve comparing observed variation to expected variation, and both use the idea that "more variation than expected" constitutes evidence against the null hypothesis. But where chi-square compares counts, ANOVA compares means. And where chi-square uses the chi-square distribution, ANOVA introduces a new distribution: the $F$-distribution.
The threshold concept in Chapter 20 — decomposing variability into explained and unexplained components — will extend the reasoning you used here with observed vs. expected frequencies into the world of continuous data. If you understood the logic of the chi-square test, ANOVA will feel like a natural next step.
Chapter Summary
The Two Chi-Square Tests at a Glance
| Feature | Goodness-of-Fit | Test of Independence |
|---|---|---|
| Purpose | Does one categorical variable follow a specified distribution? | Are two categorical variables related? |
| Data structure | One column of counts | Contingency table (two-way table) |
| Expected frequencies | From hypothesized proportions | From row and column totals: $E = \frac{R \times C}{n}$ |
| Degrees of freedom | $k - 1$ | $(r-1)(c-1)$ |
| Python function | scipy.stats.chisquare() |
scipy.stats.chi2_contingency() |
| Excel function | CHISQ.TEST() |
CHISQ.TEST() |
The Chi-Square Statistic
$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$
- Measures total discrepancy between observed and expected frequencies
- Always $\geq 0$; larger values indicate more evidence against $H_0$
- Uses the right-skewed chi-square distribution for p-values
Key Conditions
- Random sample or random assignment
- Independent observations
- All expected frequencies $\geq 5$
Effect Size
$$\text{Cramer's V} = \sqrt{\frac{\chi^2}{n \cdot (k-1)}} \quad \text{where } k = \min(r, c)$$
- Ranges from 0 (no association) to 1 (perfect association)
- Small: 0.10 | Medium: 0.30 | Large: 0.50
Finding WHERE the Association Is
$$\text{Standardized residual} = \frac{O - E}{\sqrt{E}}$$
- Values beyond $\pm 2$ indicate notable deviations from independence