Case Study 1: Maya's Poverty Rate and ER Visit Analysis

The Setup

Dr. Maya Chen is presenting to the county health board. The question is deceptively simple: Why do some communities have dramatically higher emergency room visit rates than others?

The board has noticed that ER overcrowding has reached crisis levels in several communities, with average wait times exceeding four hours. They want to understand the drivers so they can allocate intervention funding effectively. A board member has already offered a hypothesis: "It's poverty. Fix poverty, fix the ER problem."

Maya knows the data supports a strong association. But she also knows that the relationship between poverty and ER utilization is more complicated than a single regression line suggests. The real story involves lurking variables, confounding, and the critical distinction between correlation and causation that she's been teaching herself since her epidemiology training.

Her goal for today's presentation: show the board what the data says, what it doesn't say, and what additional information they would need before making policy decisions.

The Data

Maya has assembled data from 25 communities in her health district: the poverty rate (percentage of households below the federal poverty line) and the ER visit rate (annual visits per 1,000 residents).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import seaborn as sns

np.random.seed(2026)

# ============================================================
# MAYA'S FULL ANALYSIS: POVERTY AND ER VISITS
# ============================================================

# Community-level data
communities = pd.DataFrame({
    'community': [f'Community {i+1}' for i in range(25)],
    'poverty_rate': [8.2, 12.5, 15.3, 5.1, 22.6, 18.4, 9.7, 25.1, 11.3,
                     19.8, 6.4, 14.7, 21.3, 7.8, 16.9, 28.4, 10.2, 23.7,
                     13.6, 20.5, 4.3, 17.2, 26.8, 8.9, 15.8],
    'er_rate': [145, 198, 225, 110, 310, 265, 155, 345, 175,
                280, 120, 215, 295, 138, 240, 380, 165, 320,
                200, 290, 95, 250, 365, 148, 230],
    'uninsured_pct': [6.5, 10.2, 14.1, 3.8, 20.5, 16.8, 7.3, 23.2, 9.8,
                      18.1, 5.1, 12.9, 19.7, 6.2, 15.3, 25.8, 8.4, 21.6,
                      11.5, 18.9, 3.2, 15.0, 24.3, 7.1, 13.7],
    'pcp_per_1000': [3.2, 2.5, 2.0, 3.8, 1.3, 1.6, 2.9, 1.0, 2.4,
                     1.5, 3.5, 2.1, 1.4, 3.1, 1.8, 0.8, 2.7, 1.2,
                     2.3, 1.5, 4.0, 1.9, 0.9, 3.0, 2.1]
})

print("=" * 70)
print("MAYA'S ANALYSIS: POVERTY RATE, ER VISITS, AND LURKING VARIABLES")
print("=" * 70)

# ---- PHASE 1: THE SIMPLE STORY ----
print("\n" + "-" * 70)
print("PHASE 1: THE SIMPLE STORY")
print("-" * 70)

# Step 1: Always plot first
fig, ax = plt.subplots(figsize=(10, 7))
sns.regplot(data=communities, x='poverty_rate', y='er_rate',
            ci=95,
            scatter_kws={'s': 80, 'edgecolors': 'navy', 'alpha': 0.8},
            line_kws={'color': 'red', 'linewidth': 2})
ax.set_xlabel('Poverty Rate (% below poverty line)', fontsize=12)
ax.set_ylabel('ER Visit Rate (per 1,000 residents)', fontsize=12)
ax.set_title('Poverty Rate vs. ER Visit Rate (25 Communities)',
             fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Step 2: Correlation
r, p = stats.pearsonr(communities['poverty_rate'],
                      communities['er_rate'])
print(f"\nCorrelation between poverty rate and ER visit rate:")
print(f"  r = {r:.4f}")
print(f"  p = {p:.2e}")
print(f"  R² = {r**2:.4f}")

# Step 3: Simple regression
result = stats.linregress(communities['poverty_rate'],
                          communities['er_rate'])
print(f"\nSimple Regression: ER Rate = {result.intercept:.2f} + "
      f"{result.slope:.2f} × Poverty Rate")
print(f"  Slope: Each 1 percentage-point increase in poverty rate")
print(f"  is associated with {result.slope:.1f} additional ER visits")
print(f"  per 1,000 residents.")
print(f"  Intercept: A community with 0% poverty is predicted")
print(f"  to have {result.intercept:.0f} ER visits per 1,000.")

# Step 4: Residual diagnostics
predicted = result.intercept + result.slope * communities['poverty_rate']
residuals = communities['er_rate'] - predicted

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(predicted, residuals, color='steelblue',
                edgecolors='navy', s=80)
axes[0].axhline(y=0, color='red', linestyle='--', linewidth=1.5)
axes[0].set_xlabel('Predicted ER Visit Rate')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residual Plot')
axes[0].grid(True, alpha=0.3)

stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('QQ-Plot of Residuals')
plt.tight_layout()
plt.show()

print(f"\nResidual diagnostics:")
print(f"  Mean of residuals: {np.mean(residuals):.6f} (should be ~0)")
print(f"  SD of residuals:   {np.std(residuals, ddof=2):.2f}")

# ---- PHASE 2: THE LURKING VARIABLES ----
print("\n" + "-" * 70)
print("PHASE 2: INVESTIGATING LURKING VARIABLES")
print("-" * 70)

# Is poverty really the driver, or is it access to primary care?
r_pcp, p_pcp = stats.pearsonr(communities['pcp_per_1000'],
                               communities['er_rate'])
print(f"\nCorrelation: Primary Care Physicians vs. ER Rate")
print(f"  r = {r_pcp:.4f} (p = {p_pcp:.2e})")

# And the connection between poverty and primary care access
r_pov_pcp, p_pov_pcp = stats.pearsonr(communities['poverty_rate'],
                                       communities['pcp_per_1000'])
print(f"\nCorrelation: Poverty Rate vs. Primary Care Physicians")
print(f"  r = {r_pov_pcp:.4f} (p = {p_pov_pcp:.2e})")

# Insurance as lurking variable
r_ins, p_ins = stats.pearsonr(communities['uninsured_pct'],
                               communities['er_rate'])
print(f"\nCorrelation: Uninsured Rate vs. ER Rate")
print(f"  r = {r_ins:.4f} (p = {p_ins:.2e})")

r_pov_ins, p_pov_ins = stats.pearsonr(communities['poverty_rate'],
                                       communities['uninsured_pct'])
print(f"\nCorrelation: Poverty Rate vs. Uninsured Rate")
print(f"  r = {r_pov_ins:.4f} (p = {p_pov_ins:.2e})")

# Visualize the web of correlations
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Poverty → ER
sns.regplot(data=communities, x='poverty_rate', y='er_rate',
            ax=axes[0], ci=95,
            scatter_kws={'s': 60, 'edgecolors': 'navy'},
            line_kws={'color': 'red'})
axes[0].set_title(f'Poverty → ER Rate\n(r = {r:.3f})')
axes[0].set_xlabel('Poverty Rate (%)')
axes[0].set_ylabel('ER Visits per 1,000')

# Plot 2: Primary Care → ER
sns.regplot(data=communities, x='pcp_per_1000', y='er_rate',
            ax=axes[1], ci=95,
            scatter_kws={'s': 60, 'edgecolors': 'navy'},
            line_kws={'color': 'red'})
axes[1].set_title(f'Primary Care Access → ER Rate\n(r = {r_pcp:.3f})')
axes[1].set_xlabel('PCPs per 1,000 Residents')
axes[1].set_ylabel('ER Visits per 1,000')

# Plot 3: Uninsured → ER
sns.regplot(data=communities, x='uninsured_pct', y='er_rate',
            ax=axes[2], ci=95,
            scatter_kws={'s': 60, 'edgecolors': 'navy'},
            line_kws={'color': 'red'})
axes[2].set_title(f'Uninsured Rate → ER Rate\n(r = {r_ins:.3f})')
axes[2].set_xlabel('Uninsured Rate (%)')
axes[2].set_ylabel('ER Visits per 1,000')

plt.suptitle('The Web of Correlations: Multiple Variables Intertwined',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Correlation matrix
print("\n" + "-" * 70)
print("CORRELATION MATRIX")
print("-" * 70)
corr_vars = communities[['poverty_rate', 'er_rate',
                          'uninsured_pct', 'pcp_per_1000']]
corr_matrix = corr_vars.corr()
print(corr_matrix.round(3))

# Heatmap
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='RdBu_r',
            center=0, vmin=-1, vmax=1,
            xticklabels=['Poverty', 'ER Rate', 'Uninsured', 'PCPs'],
            yticklabels=['Poverty', 'ER Rate', 'Uninsured', 'PCPs'],
            square=True, linewidths=0.5)
ax.set_title('Correlation Heatmap: All Variables', fontsize=14)
plt.tight_layout()
plt.show()

The Analysis

Phase 1: The Simple Story

Maya's initial regression tells a compelling story. Poverty rate and ER visit rate have a very strong positive correlation ($r = 0.99$, $R^2 = 0.97$). The regression equation is:

$$\hat{y} = 54.8 + 11.5x$$

For each 1 percentage-point increase in a community's poverty rate, the predicted ER visit rate increases by approximately 11.5 visits per 1,000 residents. A community with 0% poverty would have a predicted ER rate of about 55 per 1,000 — a baseline that likely represents ER visits for genuine emergencies (accidents, heart attacks, etc.).

The residual plot shows no concerning patterns — residuals scatter randomly around zero with roughly constant spread. The QQ-plot suggests approximate normality. From a statistical standpoint, the model checks out.

If Maya stopped here, the conclusion would be straightforward: "Poverty drives ER utilization. Reduce poverty, reduce ER crowding."

Phase 2: But Wait — Lurking Variables

Maya doesn't stop there. She examines two additional variables: the percentage of residents without health insurance and the number of primary care physicians per 1,000 residents.

The correlation web:

Poverty Rate ←——— r = 0.99 ———→ Uninsured Rate
     |                                    |
     |                                    |
  r = 0.99                            r = 0.99
     |                                    |
     ↓                                    ↓
  ER Visit Rate ←—— r = -0.98 ——→ PCP Access

Every variable is highly correlated with every other variable. Poverty is associated with being uninsured ($r = 0.99$), which is associated with fewer primary care doctors ($r = -0.97$), which is associated with higher ER rates ($r = -0.98$).

The critical question: Which variable is actually driving ER visits?

  • Is it poverty itself? (People who are poor have more health problems due to stress, poor nutrition, environmental exposure.)
  • Is it lack of insurance? (Without insurance, people skip preventive care and use the ER when problems become acute.)
  • Is it lack of primary care access? (Without a doctor to call, the ER becomes the default provider.)
  • Is it all three? (They're so intertwined that separating their effects is nearly impossible with observational data.)

The Confounding Diagram

               Poverty
              ↗       ↘
   Lack of insurance   Poor living conditions
        ↓                    ↓
   No preventive care    More health problems
        ↓                    ↓
   Conditions worsen    Acute episodes
        ↘                 ↗
         ER as last resort

The path from poverty to ER visits runs through multiple mechanisms. A simple regression of ER rate on poverty rate captures the total association, but it doesn't tell you which mechanism is doing the work. Different mechanisms imply different interventions:

  • If lack of insurance is the main driver → expand Medicaid
  • If lack of primary care is the main driver → recruit doctors to underserved areas
  • If poverty itself causes health problems → address root economic conditions

Key Insight: This is why multiple regression (Chapter 23) exists. By including poverty rate, uninsured rate, AND primary care access in the same model, Maya can estimate the partial effect of each variable, holding the others constant. Simple regression can only show the total association. Multiple regression begins to untangle the web.

Maya's Presentation to the Board

"The data show an undeniable association between poverty and ER utilization. Communities with higher poverty rates have dramatically higher ER visit rates, and the linear relationship is very strong.

"However, I want to be transparent about what this analysis cannot tell us. Poverty doesn't operate in isolation. It's tangled with insurance coverage, primary care access, environmental health hazards, health literacy, and a dozen other factors. All of these are correlated with poverty, and all of them plausibly contribute to ER utilization.

"Reducing poverty would likely reduce ER visits — but we can't say by how much from this analysis alone, because we don't know which pathways are doing the most work. A dollar spent expanding insurance coverage might reduce ER visits more than a dollar spent on general poverty reduction, or vice versa.

"My recommendation: before allocating intervention funding, we should: 1. Conduct a multiple regression analysis to estimate the independent contribution of each factor. 2. Look at natural experiments — communities where insurance expanded but poverty didn't change, or where primary care clinics opened in still-poor neighborhoods — to isolate causal effects. 3. Consider a pilot program with a control group to measure actual impact."

Theme 2 — Human Stories Behind the Data:

Behind every data point in Maya's scatterplot is a community of real people. Community 16 — with a 28.4% poverty rate and 380 ER visits per 1,000 residents — represents thousands of families who use the emergency room because they have nowhere else to go. A child with an ear infection. A construction worker with a sprained wrist. A grandmother whose diabetes medication ran out three weeks ago because she couldn't afford the refill. The scatterplot shows a dot at (28.4, 380). The reality is a waiting room full of people who have been there for six hours.

Good statistics requires never losing sight of the people the data represents.

Ethical Analysis

Ethical Reflection:

Maya faces an ethical tension. The board wants a simple answer: "Here's the cause, here's the solution." Maya's data can't provide that level of certainty. She could present the $R^2 = 0.97$ result without mentioning the lurking variables, and the board would likely fund a poverty-reduction initiative — which would probably help, even if the mechanism is indirect.

But statistical integrity demands honesty about what the data can and cannot tell us. Overstating a causal claim could lead to misallocated resources. If the real driver is primary care access, and the board funds a job training program instead of a community health clinic, the ER problem persists — and real people pay the price.

The ethical statistician presents the full picture, including the uncertainty, even when a simpler story would be more persuasive.

Discussion Questions

  1. Maya's $R^2 = 0.97$ seems to leave little room for other variables. Does this mean poverty is definitely the main cause? Or could multiple variables be so correlated that any one of them would produce a high $R^2$?

  2. If Maya ran separate regressions of ER rate on poverty, ER rate on uninsured rate, and ER rate on PCP access, all three would show very high $R^2$. Does this help us determine which variable is the true driver? Why or why not?

  3. The board member said "Fix poverty, fix the ER problem." Under what conditions would this statement be correct? Under what conditions would it be misleading?

  4. Design a study that could help determine whether lack of insurance causes higher ER utilization, independent of poverty. What challenges would such a study face?

  5. How does this case study illustrate the difference between prediction and explanation? Maya's model predicts ER rates very well from poverty rates. Does this mean poverty explains ER rates?