Case Study 2: Does More Spending Mean Better Health? Elena's Regression Analysis

Contributors to Introduction to Data Science

Case Study 2: Does More Spending Mean Better Health? Elena's Regression Analysis

Tier 1 — Verified Concepts: This case study explores the well-documented relationship between healthcare spending and health outcomes, a subject of extensive study in health economics. The general patterns described — that the U.S. spends far more on healthcare than peer nations without proportionally better outcomes, and that the relationship between spending and health has diminishing returns — are well-established in the literature. The specific data is simulated for pedagogical purposes, but the patterns are based on publicly available data from the World Health Organization and the World Bank.

The Question That Won't Go Away

Elena has a question that won't leave her alone: Does spending more money on healthcare actually produce better health outcomes?

It seems like it should be a simple question. More spending means more hospitals, more doctors, more medicine. More of those things should mean healthier people. Right?

But Elena remembers Chapter 24's warning about confounding variables, and she knows from Chapter 25 that a model's predictions don't prove causal mechanisms. So she approaches this question carefully, using linear regression not as a truth machine but as a tool for exploring patterns.

Setting Up the Analysis

Elena has data on 120 countries: healthcare spending per capita, GDP per capita, and life expectancy.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

np.random.seed(42)
n = 120

countries = pd.DataFrame({
    'country_id': range(n),
    'gdp_per_capita': np.random.lognormal(9.2, 1.1, n),
    'health_spend_per_capita': None,
    'life_expectancy': None
})

# Health spending correlated with GDP
countries['health_spend_per_capita'] = (
    countries['gdp_per_capita'] *
    np.random.uniform(0.04, 0.14, n)
).clip(50, 12000)

# Life expectancy: logarithmic relationship with GDP
countries['life_expectancy'] = (
    50 + 7 * np.log(countries['gdp_per_capita'] / 1000) +
    np.random.normal(0, 3, n)
).clip(45, 88)

print(countries.describe().round(1))

The First Scatter Plot: A Relationship Appears

plt.figure(figsize=(10, 6))
plt.scatter(countries['health_spend_per_capita'],
            countries['life_expectancy'],
            alpha=0.5, color='steelblue')
plt.xlabel('Healthcare Spending per Capita ($)')
plt.ylabel('Life Expectancy (years)')
plt.title("Healthcare Spending vs. Life Expectancy")
plt.show()

Elena sees a clear positive relationship — countries that spend more on healthcare tend to have longer life expectancies. But the scatter isn't random. The relationship appears to curve: at low spending levels, small increases are associated with big gains in life expectancy. At high spending levels, additional spending is associated with diminishing returns.

This is the first red flag for linear regression. The relationship doesn't look linear.

Model 1: The Naive Linear Model

Elena fits a straight line anyway, to see what happens:

X = countries[['health_spend_per_capita']]
y = countries['life_expectancy']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_linear = LinearRegression()
model_linear.fit(X_train, y_train)

print("=== Linear Model (raw spending) ===")
print(f"Slope: {model_linear.coef_[0]:.5f} years per $")
print(f"R² (test): {model_linear.score(X_test, y_test):.3f}")
print(f"MAE: {mean_absolute_error(y_test, model_linear.predict(X_test)):.1f} years")

The R² is decent but not great. And when Elena plots the regression line, the problem is obvious:

plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, color='steelblue')
x_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
plt.plot(x_line, model_linear.predict(x_line),
         color='coral', linewidth=2, label='Linear fit')
plt.xlabel('Healthcare Spending per Capita ($)')
plt.ylabel('Life Expectancy (years)')
plt.title('Linear Fit: Missing the Curve')
plt.legend()
plt.show()

The line cuts through the middle of the data but misses the curvature. It overestimates life expectancy for countries with very low or very high spending, and underestimates for countries in the middle range. This is underfitting — the model is too simple for the actual relationship.

Model 2: The Log Transformation

Elena remembers from the chapter that log transformations can help with diminishing returns relationships. She tries log(spending):

# Log-transform spending
countries['log_spending'] = np.log(
    countries['health_spend_per_capita'])

X_log = countries[['log_spending']]

X_train_l, X_test_l, y_train, y_test = train_test_split(
    X_log, y, test_size=0.2, random_state=42
)

model_log = LinearRegression()
model_log.fit(X_train_l, y_train)

print("=== Log-Transformed Model ===")
print(f"R² (test): {model_log.score(X_test_l, y_test):.3f}")
print(f"MAE: {mean_absolute_error(y_test, model_log.predict(X_test_l)):.1f} years")

The improvement is substantial. The log transformation captures the diminishing returns pattern: doubling spending from $100 to $200 has a much bigger effect than doubling from $5,000 to $10,000.

plt.figure(figsize=(10, 6))
plt.scatter(countries['log_spending'], y,
            alpha=0.5, color='steelblue')
x_log_line = np.linspace(
    countries['log_spending'].min(),
    countries['log_spending'].max(), 100
).reshape(-1, 1)
plt.plot(x_log_line, model_log.predict(x_log_line),
         color='coral', linewidth=2)
plt.xlabel('Log(Healthcare Spending per Capita)')
plt.ylabel('Life Expectancy (years)')
plt.title('Log-Transformed: A Much Better Fit')
plt.show()

After the log transformation, the relationship looks linear, and the regression line fits the data well. This is an important lesson: sometimes the relationship is linear — you just need to transform the feature to see it.

The Confounding Question

Elena has a better model now. But she's nagged by a thought: is healthcare spending really driving life expectancy? Or is something else going on?

She checks:

from scipy import stats

r_spend_gdp, _ = stats.pearsonr(
    countries['health_spend_per_capita'],
    countries['gdp_per_capita'])
r_gdp_life, _ = stats.pearsonr(
    countries['gdp_per_capita'],
    countries['life_expectancy'])

print(f"Correlation: Spending vs GDP: r = {r_spend_gdp:.2f}")
print(f"Correlation: GDP vs Life Exp: r = {r_gdp_life:.2f}")

The correlations reveal the problem: healthcare spending is highly correlated with GDP. Rich countries spend more on healthcare and have longer life expectancies. Is it the spending that matters, or the wealth? Or something else entirely — education, clean water, nutrition, political stability?

Model 3: Multiple Regression — Disentangling the Effects

Elena adds GDP as a second feature to see whether spending still matters after accounting for wealth:

countries['log_gdp'] = np.log(countries['gdp_per_capita'])

X_both = countries[['log_spending', 'log_gdp']]

X_train_b, X_test_b, y_train, y_test = train_test_split(
    X_both, y, test_size=0.2, random_state=42
)

model_both = LinearRegression()
model_both.fit(X_train_b, y_train)

print("=== Model with Both log(Spending) and log(GDP) ===")
print(f"R² (test): {model_both.score(X_test_b, y_test):.3f}")
print(f"\nCoefficients:")
print(f"  log(spending): {model_both.coef_[0]:.2f}")
print(f"  log(GDP):      {model_both.coef_[1]:.2f}")

Elena finds something striking: when GDP is included in the model, the coefficient for healthcare spending shrinks dramatically — perhaps even becomes negative or near zero. The spending coefficient in the simple model was largely capturing the effect of wealth, not the independent effect of healthcare spending.

This is multicollinearity in action, and it has profound implications for interpretation.

What Elena Learns

Elena writes up her findings:

Finding 1: There is a strong positive correlation between healthcare spending and life expectancy, but this relationship is largely mediated by GDP. Wealthier countries both spend more on healthcare and have longer life expectancies.

Finding 2: The relationship has diminishing returns — the first $500 per capita in healthcare spending is associated with much larger gains than the next $5,000. A log transformation captures this pattern well.

Finding 3: After controlling for GDP, the independent contribution of healthcare spending to life expectancy is much smaller than the simple correlation suggests. This doesn't mean healthcare spending is unimportant — it means that the simple correlation overstates the direct effect by conflating spending with national wealth.

Finding 4: The model explains approximately 65-75% of the variation in life expectancy. The remaining 25-35% likely reflects factors not in the data: education, sanitation, diet, inequality, governance quality.

The Comparison Table

Model	Features	Test R²	Test MAE
Baseline	Mean life expectancy	0.00	~6 years
Linear	Raw spending	~0.45	~4.5 years
Log-linear	log(spending)	~0.65	~3.2 years
Multiple	log(spending) + log(GDP)	~0.70	~2.8 years

Each model tells a different story: - The linear model says spending predicts life expectancy - The log model says spending has diminishing returns - The multiple model says much of the "spending effect" is actually a "wealth effect"

All three models are useful. But they answer different questions.

The Policy Implications (and Their Limits)

A policymaker might look at Elena's first model and conclude: "Let's increase healthcare spending to improve life expectancy!" But the multiple regression tells a more nuanced story — increasing spending without addressing the underlying factors that make wealthy countries healthier (education, infrastructure, nutrition, governance) might not work as well as the simple correlation suggests.

Elena is careful to note that her model is observational, not experimental. Even the multiple regression can't prove causation. There could be confounders that she hasn't measured. The direction of causality might be reversed — healthier populations might generate more wealth, not the other way around. And the relationship observed across countries might not apply within a single country.

But the model is still useful. It identifies patterns, suggests where to investigate further, and — critically — prevents the simplistic conclusion that "more spending = better health" without qualification.

Connecting to the Chapter

Chapter Concept	Case Study Application
Simple regression	Spending → life expectancy (first model)
Residual analysis	The curved residuals revealed nonlinearity
Log transformation	Captured diminishing returns relationship
Multiple regression	Disentangled spending from GDP
Multicollinearity	Spending and GDP are highly correlated
Coefficient interpretation	The spending coefficient changed dramatically when GDP was added
R² comparison	Each model improved on the last
Prediction vs. explanation	This is primarily an explanation task — the coefficients matter as much as the R²

Discussion Questions

Elena found that the healthcare spending coefficient shrank when GDP was added. Does this mean healthcare spending doesn't matter? How would you interpret this result to a policymaker?
The relationship between spending and life expectancy shows diminishing returns. What does this mean practically? At what point does additional spending stop being effective?
Could the causation run the other way — could longer life expectancy cause higher healthcare spending (older populations need more care)? How would you investigate this?
What additional features would you want to include in the model? List three and explain what confounding they might address.
Elena used R² and MAE to compare models. What metric do you think a health policymaker would care about more, and why?

Key Takeaways from This Case Study

Scatter plots before modeling can reveal nonlinear relationships that linear regression will miss
Log transformations are a powerful tool for capturing diminishing returns
Simple correlations can be misleading when confounders are present — multiple regression helps disentangle effects
Multicollinearity makes individual coefficients unstable but doesn't necessarily hurt predictions
The progression from simple to multiple regression tells a richer story than any single model
Observational regression can identify patterns and suggest hypotheses but cannot prove causation