Case Study 1: MediCore Hierarchical Treatment Effects Across Hospitals
Context
MediCore Pharmaceuticals has completed a Phase III observational study of Cardiopril-X across 15 hospitals in the U.S. and Europe. The study measures systolic blood pressure (SBP) reduction after 12 weeks of treatment. Each hospital enrolled a different number of patients — from 42 at a small community clinic in rural Poland to 580 at a major academic medical center in Boston — and reports a different estimated treatment effect.
The regulatory team faces a decision: should they submit a single pooled treatment effect, or present hospital-specific results? The single pooled estimate (weighted average: 9.2 mmHg, 95% CI [8.4, 10.0]) is clean and straightforward, but the FDA reviewers will want to know whether the treatment works consistently across sites. The hospital-specific estimates tell a messy story: they range from 4.1 mmHg (a small hospital with 42 patients) to 13.8 mmHg (another small hospital with 55 patients). The variance is partly real (different patient populations, different protocols) and partly noise (small samples).
A hierarchical Bayesian model resolves this tension. It produces a population-level estimate that accounts for between-hospital heterogeneity, and it produces hospital-specific estimates that borrow strength from the population distribution — giving more credible results for small hospitals without ignoring genuine site differences.
The Data
import pymc as pm
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
# 15-hospital Cardiopril-X Phase III data
hospital_data = {
"name": [
"Boston Academic", "Chicago Teaching", "London NHS",
"Paris Tertiary", "Munich University", "Toronto General",
"Tokyo Medical", "Sao Paulo Central", "Sydney Metro",
"Stockholm Research", "Rural Poland", "Rural Kentucky",
"Community Ohio", "Suburban Milan", "Cape Town Public"
],
"n_patients": np.array([
580, 420, 350, 310, 280, 240, 200, 170, 150, 120, 42, 55, 68, 95, 85
]),
"observed_effect": np.array([
9.4, 8.8, 9.1, 10.2, 9.7, 8.5, 10.8, 7.2, 9.0, 11.3,
4.1, 13.8, 6.9, 8.4, 7.6
]),
}
# Standard errors (approximated as population SD / sqrt(n))
population_sd = 12.0 # SBP reduction SD from prior studies
hospital_data["se"] = population_sd / np.sqrt(hospital_data["n_patients"])
n_hospitals = len(hospital_data["name"])
print("Cardiopril-X Phase III Data")
print("=" * 72)
print(f"{'Hospital':>22s} {'n':>5s} {'Effect (mmHg)':>14s} {'SE':>6s} {'95% CI':>18s}")
print("-" * 72)
for j in range(n_hospitals):
se = hospital_data["se"][j]
eff = hospital_data["observed_effect"][j]
ci_lo = eff - 1.96 * se
ci_hi = eff + 1.96 * se
print(f"{hospital_data['name'][j]:>22s} {hospital_data['n_patients'][j]:>5d} "
f"{eff:>14.1f} {se:>6.2f} [{ci_lo:>6.1f}, {ci_hi:>5.1f}]")
Cardiopril-X Phase III Data
========================================================================
Hospital n Effect (mmHg) SE 95% CI
------------------------------------------------------------------------
Boston Academic 580 9.4 0.50 [ 8.4, 10.4]
Chicago Teaching 420 8.8 0.59 [ 7.7, 9.9]
London NHS 350 9.1 0.64 [ 7.8, 10.4]
Paris Tertiary 310 10.2 0.68 [ 8.9, 11.5]
Munich University 280 9.7 0.72 [ 8.3, 11.1]
Toronto General 240 8.5 0.77 [ 7.0, 10.0]
Tokyo Medical 200 10.8 0.85 [ 9.1, 12.5]
Sao Paulo Central 170 7.2 0.92 [ 5.4, 9.0]
Sydney Metro 150 9.0 0.98 [ 7.1, 10.9]
Stockholm Research 120 11.3 1.10 [ 9.1, 13.5]
Rural Poland 42 4.1 1.85 [ 0.5, 7.7]
Rural Kentucky 55 13.8 1.62 [ 10.6, 17.0]
Community Ohio 68 6.9 1.46 [ 4.0, 9.8]
Suburban Milan 95 8.4 1.23 [ 6.0, 10.8]
Cape Town Public 85 7.6 1.30 [ 5.1, 10.1]
The Analysis
Step 1: Prior Specification
The regulatory team and clinical statisticians agree on the following priors:
- Population mean effect $\mu$: $\mathcal{N}(10, 5^2)$. Based on prior meta-analyses of ACE inhibitors, the expected SBP reduction is around 10 mmHg, but with substantial uncertainty.
- Between-hospital SD $\tau$: $\text{HalfNormal}(5)$. This allows for moderate site-to-site variation while ruling out implausibly large heterogeneity.
observed = hospital_data["observed_effect"]
se = hospital_data["se"]
# Non-centered hierarchical model
with pm.Model() as medicore_model:
# Hyperpriors
mu = pm.Normal("mu", mu=10, sigma=5)
tau = pm.HalfNormal("tau", sigma=5)
# Non-centered parameterization
eta = pm.Normal("eta", mu=0, sigma=1, shape=n_hospitals)
theta = pm.Deterministic("theta", mu + tau * eta)
# Likelihood
y = pm.Normal("y", mu=theta, sigma=se, observed=observed)
# Stage 1: Prior predictive check
prior_pred = pm.sample_prior_predictive(samples=1000, random_seed=42)
# Verify prior predictive
prior_effects = prior_pred.prior["theta"].values.flatten()
print("Prior predictive check (hospital-level effects):")
print(f" Range: [{prior_effects.min():.1f}, {prior_effects.max():.1f}] mmHg")
print(f" Mean: {prior_effects.mean():.1f} mmHg")
print(f" Std: {prior_effects.std():.1f} mmHg")
print(f" P(effect < 0): {(prior_effects < 0).mean():.3f}")
print(f" P(effect > 20): {(prior_effects > 20).mean():.3f}")
Prior predictive check (hospital-level effects):
Range: [-19.4, 38.8] mmHg
Mean: 10.0 mmHg
Std: 7.6 mmHg
P(effect < 0): 0.084
P(effect > 20): 0.082
The prior predictive allows negative effects (8.4% probability) and very large effects above 20 mmHg (8.2%) — both plausible in principle, though the bulk of the prior mass is in the 0-20 mmHg range consistent with the ACE inhibitor literature.
Step 2: Fit the Model
with medicore_model:
trace = pm.sample(
draws=4000, tune=2000, chains=4,
random_seed=42, target_accept=0.95,
return_inferencedata=True,
)
# Stage 3: MCMC diagnostics
summary = az.summary(trace, var_names=["mu", "tau"])
print("\nPopulation-level parameters:")
print(summary)
divergences = trace.sample_stats["diverging"].values.sum()
print(f"\nDivergences: {divergences}")
# All theta estimates
theta_summary = az.summary(trace, var_names=["theta"])
print("\nHospital-specific effects:")
print(theta_summary[["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]])
Population-level parameters:
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mu 9.122 0.584 8.029 10.243 0.009 0.007 4050.0 3820.0 1.0
tau 1.582 0.722 0.418 2.926 0.013 0.009 3210.0 3480.0 1.0
Divergences: 0
Hospital-specific effects:
mean sd hdi_3% hdi_97% r_hat
theta[0] 9.351 0.492 8.405 10.272 1.0
theta[1] 8.861 0.565 7.780 9.928 1.0
theta[2] 9.106 0.594 7.976 10.226 1.0
theta[3] 10.030 0.618 8.855 11.194 1.0
theta[4] 9.612 0.650 8.372 10.842 1.0
theta[5] 8.625 0.682 7.320 9.912 1.0
theta[6] 10.385 0.728 8.992 11.750 1.0
theta[7] 7.658 0.780 6.165 9.122 1.0
theta[8] 9.048 0.818 7.482 10.580 1.0
theta[9] 10.692 0.870 9.032 12.332 1.0
theta[10] 7.256 1.285 4.800 9.668 1.0
theta[11] 11.254 1.158 9.050 13.428 1.0
theta[12] 7.782 1.102 5.680 9.848 1.0
theta[13] 8.641 0.982 6.770 10.488 1.0
theta[14] 8.052 1.022 6.096 9.972 1.0
Step 3: Quantifying Shrinkage
theta_means = trace.posterior["theta"].mean(dim=["chain", "draw"]).values
pop_mean = trace.posterior["mu"].mean().values
print(f"\nShrinkage analysis (population mean: {pop_mean:.2f} mmHg):")
print(f"{'Hospital':>22s} {'n':>5s} {'Raw':>6s} {'Hier.':>6s} {'Shrink':>7s}")
print("-" * 54)
for j in range(n_hospitals):
raw = observed[j]
hier = theta_means[j]
if abs(raw - pop_mean) > 0.01:
shrink = 1 - (hier - pop_mean) / (raw - pop_mean)
print(f"{hospital_data['name'][j]:>22s} {hospital_data['n_patients'][j]:>5d} "
f"{raw:>6.1f} {hier:>6.1f} {shrink:>6.1%}")
else:
print(f"{hospital_data['name'][j]:>22s} {hospital_data['n_patients'][j]:>5d} "
f"{raw:>6.1f} {hier:>6.1f} ---")
Shrinkage analysis (population mean: 9.12 mmHg):
Hospital n Raw Hier. Shrink
------------------------------------------------------
Boston Academic 580 9.4 9.35 17.9%
Chicago Teaching 420 8.8 8.86 18.8%
London NHS 350 9.1 9.11 0.0%
Paris Tertiary 310 10.2 10.03 15.7%
Munich University 280 9.7 9.61 15.5%
Toronto General 240 8.5 8.63 20.1%
Tokyo Medical 200 10.8 10.39 24.4%
Sao Paulo Central 170 7.2 7.66 23.9%
Sydney Metro 150 9.0 9.05 41.7%
Stockholm Research 120 11.3 10.69 28.0%
Rural Poland 42 4.1 7.26 62.9%
Rural Kentucky 55 13.8 11.25 54.5%
Community Ohio 68 6.9 7.78 39.6%
Suburban Milan 95 8.4 8.64 33.3%
Cape Town Public 85 7.6 8.05 29.6%
The pattern is precisely as expected: small hospitals show heavy shrinkage (Rural Poland: 63%, Rural Kentucky: 55%) while large hospitals barely move (Boston: 18%). Rural Poland's raw estimate of 4.1 mmHg — which might suggest the drug does not work at that site — is shrunk to 7.3 mmHg, reflecting the model's inference that the low estimate is more likely due to small-sample noise than a genuine lack of efficacy.
Step 4: Regulatory Decision Support
# Key regulatory questions
theta_samples = trace.posterior["theta"].values.reshape(-1, n_hospitals)
mu_samples = trace.posterior["mu"].values.flatten()
# Q1: Population mean effect
print("Regulatory Summary")
print("=" * 60)
mu_mean = mu_samples.mean()
mu_hdi = np.percentile(mu_samples, [2.5, 97.5])
print(f"Population mean effect: {mu_mean:.2f} mmHg")
print(f" 95% credible interval: [{mu_hdi[0]:.2f}, {mu_hdi[1]:.2f}]")
# Q2: Probability of clinically meaningful effect (> 5 mmHg)
p_clinically_meaningful = (mu_samples > 5.0).mean()
print(f" P(population effect > 5 mmHg): {p_clinically_meaningful:.4f}")
# Q3: Does any hospital have a genuinely negative effect?
p_negative_any = np.zeros(n_hospitals)
for j in range(n_hospitals):
p_negative_any[j] = (theta_samples[:, j] < 0).mean()
print(f"\nProbability of negative effect by hospital:")
for j in range(n_hospitals):
if p_negative_any[j] > 0.001:
print(f" {hospital_data['name'][j]:>22s}: P(effect < 0) = {p_negative_any[j]:.4f}")
print(f" All others: < 0.001")
# Q4: Between-hospital heterogeneity
tau_samples = trace.posterior["tau"].values.flatten()
print(f"\nBetween-hospital SD: {tau_samples.mean():.2f} mmHg "
f"(95% CI: [{np.percentile(tau_samples, 2.5):.2f}, "
f"{np.percentile(tau_samples, 97.5):.2f}])")
i2 = tau_samples ** 2 / (tau_samples ** 2 + np.mean(se ** 2))
print(f"Approximate I²: {i2.mean():.1%} "
f"(95% CI: [{np.percentile(i2, 2.5):.1%}, {np.percentile(i2, 97.5):.1%}])")
# Q5: Probability that the treatment is effective (> 5 mmHg) at EVERY hospital
p_effective_all = np.all(theta_samples > 5.0, axis=1).mean()
print(f"\nP(effect > 5 mmHg at ALL hospitals): {p_effective_all:.4f}")
Regulatory Summary
============================================================
Population mean effect: 9.12 mmHg
95% credible interval: [7.98, 10.24]
P(population effect > 5 mmHg): 1.0000
Probability of negative effect by hospital:
Rural Poland: P(effect < 0) = 0.0012
All others: < 0.001
Between-hospital SD: 1.58 mmHg (95% CI: [0.31, 3.04])
Approximate I²: 57.2% (95% CI: [8.1%, 82.0%])
P(effect > 5 mmHg at ALL hospitals): 0.8462
Key Findings
-
Population mean effect: 9.12 mmHg (95% CrI [7.98, 10.24]) — well above the clinically meaningful threshold of 5 mmHg. The posterior probability of clinical significance is effectively 100%.
-
Between-hospital heterogeneity is moderate. $\tau = 1.58$ mmHg means that hospital-specific effects range roughly $\pm 3$ mmHg around the population mean. The $I^2$ of 57% indicates that about half the observed variation is real (not noise). This is typical for multi-site trials.
-
Shrinkage improves small-hospital estimates. Rural Poland's raw estimate of 4.1 mmHg (which might trigger a regulatory flag) is recognized as a small-sample anomaly. The hierarchical estimate of 7.3 mmHg is more credible given the population distribution.
-
The drug is effective everywhere. The probability that the treatment exceeds 5 mmHg at every hospital simultaneously is 84.6% — strong evidence of consistent efficacy across sites.
The Regulatory Lesson
The hierarchical model gives the FDA reviewers exactly what they need: a credible population-level estimate with appropriate uncertainty, honest quantification of site-to-site heterogeneity, and individual hospital estimates that are more reliable than the raw data — especially for small sites. The alternative — presenting 15 independent confidence intervals, including Rural Poland's concerning [0.5, 7.7] — would invite unnecessary skepticism about a drug that the data clearly support.
This is where Bayesian hierarchical modeling earns its complexity. A simple pooled estimate ignores the real between-hospital variation. Independent estimates are too noisy for small hospitals. The hierarchical model navigates the space between these extremes, producing estimates that are simultaneously more honest (acknowledging heterogeneity) and more accurate (reducing noise through partial pooling).