Case Study 1: MediCore Hierarchical Treatment Effects Across Hospitals

Context

MediCore Pharmaceuticals has completed a Phase III observational study of Cardiopril-X across 15 hospitals in the U.S. and Europe. The study measures systolic blood pressure (SBP) reduction after 12 weeks of treatment. Each hospital enrolled a different number of patients — from 42 at a small community clinic in rural Poland to 580 at a major academic medical center in Boston — and reports a different estimated treatment effect.

The regulatory team faces a decision: should they submit a single pooled treatment effect, or present hospital-specific results? The single pooled estimate (weighted average: 9.2 mmHg, 95% CI [8.4, 10.0]) is clean and straightforward, but the FDA reviewers will want to know whether the treatment works consistently across sites. The hospital-specific estimates tell a messy story: they range from 4.1 mmHg (a small hospital with 42 patients) to 13.8 mmHg (another small hospital with 55 patients). The variance is partly real (different patient populations, different protocols) and partly noise (small samples).

A hierarchical Bayesian model resolves this tension. It produces a population-level estimate that accounts for between-hospital heterogeneity, and it produces hospital-specific estimates that borrow strength from the population distribution — giving more credible results for small hospitals without ignoring genuine site differences.

The Data

import pymc as pm
import arviz as az
import numpy as np
import matplotlib.pyplot as plt

# 15-hospital Cardiopril-X Phase III data
hospital_data = {
    "name": [
        "Boston Academic", "Chicago Teaching", "London NHS",
        "Paris Tertiary", "Munich University", "Toronto General",
        "Tokyo Medical", "Sao Paulo Central", "Sydney Metro",
        "Stockholm Research", "Rural Poland", "Rural Kentucky",
        "Community Ohio", "Suburban Milan", "Cape Town Public"
    ],
    "n_patients": np.array([
        580, 420, 350, 310, 280, 240, 200, 170, 150, 120, 42, 55, 68, 95, 85
    ]),
    "observed_effect": np.array([
        9.4, 8.8, 9.1, 10.2, 9.7, 8.5, 10.8, 7.2, 9.0, 11.3,
        4.1, 13.8, 6.9, 8.4, 7.6
    ]),
}

# Standard errors (approximated as population SD / sqrt(n))
population_sd = 12.0  # SBP reduction SD from prior studies
hospital_data["se"] = population_sd / np.sqrt(hospital_data["n_patients"])

n_hospitals = len(hospital_data["name"])

print("Cardiopril-X Phase III Data")
print("=" * 72)
print(f"{'Hospital':>22s}  {'n':>5s}  {'Effect (mmHg)':>14s}  {'SE':>6s}  {'95% CI':>18s}")
print("-" * 72)
for j in range(n_hospitals):
    se = hospital_data["se"][j]
    eff = hospital_data["observed_effect"][j]
    ci_lo = eff - 1.96 * se
    ci_hi = eff + 1.96 * se
    print(f"{hospital_data['name'][j]:>22s}  {hospital_data['n_patients'][j]:>5d}  "
          f"{eff:>14.1f}  {se:>6.2f}  [{ci_lo:>6.1f}, {ci_hi:>5.1f}]")
Cardiopril-X Phase III Data
========================================================================
              Hospital      n  Effect (mmHg)      SE              95% CI
------------------------------------------------------------------------
      Boston Academic    580           9.4    0.50  [  8.4,  10.4]
     Chicago Teaching    420           8.8    0.59  [  7.7,   9.9]
          London NHS    350           9.1    0.64  [  7.8,  10.4]
       Paris Tertiary    310          10.2    0.68  [  8.9,  11.5]
    Munich University    280           9.7    0.72  [  8.3,  11.1]
     Toronto General    240           8.5    0.77  [  7.0,  10.0]
       Tokyo Medical    200          10.8    0.85  [  9.1,  12.5]
   Sao Paulo Central    170           7.2    0.92  [  5.4,   9.0]
        Sydney Metro    150           9.0    0.98  [  7.1,  10.9]
  Stockholm Research    120          11.3    1.10  [  9.1,  13.5]
        Rural Poland     42           4.1    1.85  [  0.5,   7.7]
      Rural Kentucky     55          13.8    1.62  [ 10.6,  17.0]
      Community Ohio     68           6.9    1.46  [  4.0,   9.8]
      Suburban Milan     95           8.4    1.23  [  6.0,  10.8]
    Cape Town Public     85           7.6    1.30  [  5.1,  10.1]

The Analysis

Step 1: Prior Specification

The regulatory team and clinical statisticians agree on the following priors:

  • Population mean effect $\mu$: $\mathcal{N}(10, 5^2)$. Based on prior meta-analyses of ACE inhibitors, the expected SBP reduction is around 10 mmHg, but with substantial uncertainty.
  • Between-hospital SD $\tau$: $\text{HalfNormal}(5)$. This allows for moderate site-to-site variation while ruling out implausibly large heterogeneity.
observed = hospital_data["observed_effect"]
se = hospital_data["se"]

# Non-centered hierarchical model
with pm.Model() as medicore_model:
    # Hyperpriors
    mu = pm.Normal("mu", mu=10, sigma=5)
    tau = pm.HalfNormal("tau", sigma=5)

    # Non-centered parameterization
    eta = pm.Normal("eta", mu=0, sigma=1, shape=n_hospitals)
    theta = pm.Deterministic("theta", mu + tau * eta)

    # Likelihood
    y = pm.Normal("y", mu=theta, sigma=se, observed=observed)

    # Stage 1: Prior predictive check
    prior_pred = pm.sample_prior_predictive(samples=1000, random_seed=42)

# Verify prior predictive
prior_effects = prior_pred.prior["theta"].values.flatten()
print("Prior predictive check (hospital-level effects):")
print(f"  Range: [{prior_effects.min():.1f}, {prior_effects.max():.1f}] mmHg")
print(f"  Mean:  {prior_effects.mean():.1f} mmHg")
print(f"  Std:   {prior_effects.std():.1f} mmHg")
print(f"  P(effect < 0): {(prior_effects < 0).mean():.3f}")
print(f"  P(effect > 20): {(prior_effects > 20).mean():.3f}")
Prior predictive check (hospital-level effects):
  Range: [-19.4, 38.8] mmHg
  Mean:  10.0 mmHg
  Std:   7.6 mmHg
  P(effect < 0): 0.084
  P(effect > 20): 0.082

The prior predictive allows negative effects (8.4% probability) and very large effects above 20 mmHg (8.2%) — both plausible in principle, though the bulk of the prior mass is in the 0-20 mmHg range consistent with the ACE inhibitor literature.

Step 2: Fit the Model

with medicore_model:
    trace = pm.sample(
        draws=4000, tune=2000, chains=4,
        random_seed=42, target_accept=0.95,
        return_inferencedata=True,
    )

# Stage 3: MCMC diagnostics
summary = az.summary(trace, var_names=["mu", "tau"])
print("\nPopulation-level parameters:")
print(summary)

divergences = trace.sample_stats["diverging"].values.sum()
print(f"\nDivergences: {divergences}")

# All theta estimates
theta_summary = az.summary(trace, var_names=["theta"])
print("\nHospital-specific effects:")
print(theta_summary[["mean", "sd", "hdi_3%", "hdi_97%", "r_hat"]])
Population-level parameters:
       mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
mu    9.122  0.584   8.029   10.243      0.009    0.007    4050.0    3820.0    1.0
tau   1.582  0.722   0.418    2.926      0.013    0.009    3210.0    3480.0    1.0

Divergences: 0

Hospital-specific effects:
              mean     sd  hdi_3%  hdi_97%  r_hat
theta[0]     9.351  0.492   8.405   10.272    1.0
theta[1]     8.861  0.565   7.780    9.928    1.0
theta[2]     9.106  0.594   7.976   10.226    1.0
theta[3]    10.030  0.618   8.855   11.194    1.0
theta[4]     9.612  0.650   8.372   10.842    1.0
theta[5]     8.625  0.682   7.320    9.912    1.0
theta[6]    10.385  0.728   8.992   11.750    1.0
theta[7]     7.658  0.780   6.165    9.122    1.0
theta[8]     9.048  0.818   7.482   10.580    1.0
theta[9]    10.692  0.870   9.032   12.332    1.0
theta[10]    7.256  1.285   4.800    9.668    1.0
theta[11]   11.254  1.158   9.050   13.428    1.0
theta[12]    7.782  1.102   5.680    9.848    1.0
theta[13]    8.641  0.982   6.770   10.488    1.0
theta[14]    8.052  1.022   6.096    9.972    1.0

Step 3: Quantifying Shrinkage

theta_means = trace.posterior["theta"].mean(dim=["chain", "draw"]).values
pop_mean = trace.posterior["mu"].mean().values

print(f"\nShrinkage analysis (population mean: {pop_mean:.2f} mmHg):")
print(f"{'Hospital':>22s}  {'n':>5s}  {'Raw':>6s}  {'Hier.':>6s}  {'Shrink':>7s}")
print("-" * 54)
for j in range(n_hospitals):
    raw = observed[j]
    hier = theta_means[j]
    if abs(raw - pop_mean) > 0.01:
        shrink = 1 - (hier - pop_mean) / (raw - pop_mean)
        print(f"{hospital_data['name'][j]:>22s}  {hospital_data['n_patients'][j]:>5d}  "
              f"{raw:>6.1f}  {hier:>6.1f}  {shrink:>6.1%}")
    else:
        print(f"{hospital_data['name'][j]:>22s}  {hospital_data['n_patients'][j]:>5d}  "
              f"{raw:>6.1f}  {hier:>6.1f}     ---")
Shrinkage analysis (population mean: 9.12 mmHg):
              Hospital      n    Raw   Hier.  Shrink
------------------------------------------------------
      Boston Academic    580    9.4    9.35    17.9%
     Chicago Teaching    420    8.8    8.86    18.8%
          London NHS    350    9.1    9.11     0.0%
       Paris Tertiary    310   10.2   10.03    15.7%
    Munich University    280    9.7    9.61    15.5%
     Toronto General    240    8.5    8.63    20.1%
       Tokyo Medical    200   10.8   10.39    24.4%
   Sao Paulo Central    170    7.2    7.66    23.9%
        Sydney Metro    150    9.0    9.05    41.7%
  Stockholm Research    120   11.3   10.69    28.0%
        Rural Poland     42    4.1    7.26    62.9%
      Rural Kentucky     55   13.8   11.25    54.5%
      Community Ohio     68    6.9    7.78    39.6%
      Suburban Milan     95    8.4    8.64    33.3%
    Cape Town Public     85    7.6    8.05    29.6%

The pattern is precisely as expected: small hospitals show heavy shrinkage (Rural Poland: 63%, Rural Kentucky: 55%) while large hospitals barely move (Boston: 18%). Rural Poland's raw estimate of 4.1 mmHg — which might suggest the drug does not work at that site — is shrunk to 7.3 mmHg, reflecting the model's inference that the low estimate is more likely due to small-sample noise than a genuine lack of efficacy.

Step 4: Regulatory Decision Support

# Key regulatory questions
theta_samples = trace.posterior["theta"].values.reshape(-1, n_hospitals)
mu_samples = trace.posterior["mu"].values.flatten()

# Q1: Population mean effect
print("Regulatory Summary")
print("=" * 60)
mu_mean = mu_samples.mean()
mu_hdi = np.percentile(mu_samples, [2.5, 97.5])
print(f"Population mean effect: {mu_mean:.2f} mmHg")
print(f"  95% credible interval: [{mu_hdi[0]:.2f}, {mu_hdi[1]:.2f}]")

# Q2: Probability of clinically meaningful effect (> 5 mmHg)
p_clinically_meaningful = (mu_samples > 5.0).mean()
print(f"  P(population effect > 5 mmHg): {p_clinically_meaningful:.4f}")

# Q3: Does any hospital have a genuinely negative effect?
p_negative_any = np.zeros(n_hospitals)
for j in range(n_hospitals):
    p_negative_any[j] = (theta_samples[:, j] < 0).mean()
print(f"\nProbability of negative effect by hospital:")
for j in range(n_hospitals):
    if p_negative_any[j] > 0.001:
        print(f"  {hospital_data['name'][j]:>22s}: P(effect < 0) = {p_negative_any[j]:.4f}")
print(f"  All others: < 0.001")

# Q4: Between-hospital heterogeneity
tau_samples = trace.posterior["tau"].values.flatten()
print(f"\nBetween-hospital SD: {tau_samples.mean():.2f} mmHg "
      f"(95% CI: [{np.percentile(tau_samples, 2.5):.2f}, "
      f"{np.percentile(tau_samples, 97.5):.2f}])")
i2 = tau_samples ** 2 / (tau_samples ** 2 + np.mean(se ** 2))
print(f"Approximate I²: {i2.mean():.1%} "
      f"(95% CI: [{np.percentile(i2, 2.5):.1%}, {np.percentile(i2, 97.5):.1%}])")

# Q5: Probability that the treatment is effective (> 5 mmHg) at EVERY hospital
p_effective_all = np.all(theta_samples > 5.0, axis=1).mean()
print(f"\nP(effect > 5 mmHg at ALL hospitals): {p_effective_all:.4f}")
Regulatory Summary
============================================================
Population mean effect: 9.12 mmHg
  95% credible interval: [7.98, 10.24]
  P(population effect > 5 mmHg): 1.0000

Probability of negative effect by hospital:
         Rural Poland: P(effect < 0) = 0.0012
  All others: < 0.001

Between-hospital SD: 1.58 mmHg (95% CI: [0.31, 3.04])
Approximate I²: 57.2% (95% CI: [8.1%, 82.0%])

P(effect > 5 mmHg at ALL hospitals): 0.8462

Key Findings

  1. Population mean effect: 9.12 mmHg (95% CrI [7.98, 10.24]) — well above the clinically meaningful threshold of 5 mmHg. The posterior probability of clinical significance is effectively 100%.

  2. Between-hospital heterogeneity is moderate. $\tau = 1.58$ mmHg means that hospital-specific effects range roughly $\pm 3$ mmHg around the population mean. The $I^2$ of 57% indicates that about half the observed variation is real (not noise). This is typical for multi-site trials.

  3. Shrinkage improves small-hospital estimates. Rural Poland's raw estimate of 4.1 mmHg (which might trigger a regulatory flag) is recognized as a small-sample anomaly. The hierarchical estimate of 7.3 mmHg is more credible given the population distribution.

  4. The drug is effective everywhere. The probability that the treatment exceeds 5 mmHg at every hospital simultaneously is 84.6% — strong evidence of consistent efficacy across sites.

The Regulatory Lesson

The hierarchical model gives the FDA reviewers exactly what they need: a credible population-level estimate with appropriate uncertainty, honest quantification of site-to-site heterogeneity, and individual hospital estimates that are more reliable than the raw data — especially for small sites. The alternative — presenting 15 independent confidence intervals, including Rural Poland's concerning [0.5, 7.7] — would invite unnecessary skepticism about a drug that the data clearly support.

This is where Bayesian hierarchical modeling earns its complexity. A simple pooled estimate ignores the real between-hospital variation. Independent estimates are too noisy for small hospitals. The hierarchical model navigates the space between these extremes, producing estimates that are simultaneously more honest (acknowledging heterogeneity) and more accurate (reducing noise through partial pooling).