Case Study 1: Does This Drug Work? A Clinical Trial Analysis


Tier 2 — Attributed Findings: This case study describes the general structure and methodology of randomized controlled clinical trials as used in pharmaceutical development. The trial described ("CLARITY-BP") is fictional and created for pedagogical purposes. Statistical methods, regulatory processes, and general findings about antihypertensive medications are based on widely documented practices in clinical pharmacology and biostatistics. No specific drug, company, or trial is represented.


The Stakes

There is perhaps no context where hypothesis testing matters more than in clinical medicine. When a pharmaceutical company claims a new drug works, that claim will be evaluated by regulatory agencies (the FDA in the United States, the EMA in Europe, and their counterparts around the world), and the evaluation rests fundamentally on hypothesis testing.

Get it right, and a drug that helps millions of people reaches the market. Get it wrong — by approving an ineffective drug or rejecting an effective one — and the consequences range from wasted healthcare dollars to unnecessary suffering to loss of life.

This case study walks through a fictional but realistic clinical trial to show hypothesis testing in its highest-stakes application. Every concept from Chapter 23 will appear: null and alternative hypotheses, p-values, Type I and Type II errors, effect sizes, power, and the critical distinction between statistical and practical significance.

The Trial: CLARITY-BP

Imagine a pharmaceutical company has developed a new blood pressure medication called Veridex. They believe it's more effective than the current standard treatment (a well-established drug called lisinopril) while having fewer side effects. To test this, they design a randomized controlled trial called CLARITY-BP.

Study Design

  • Participants: 800 adults with mild-to-moderate hypertension (systolic blood pressure between 140-170 mmHg)
  • Randomization: Participants are randomly assigned to one of two groups:
  • Treatment group (n = 400): Takes Veridex daily for 12 weeks
  • Control group (n = 400): Takes lisinopril (standard of care) daily for 12 weeks
  • Primary outcome: Change in systolic blood pressure (SBP) from baseline to 12 weeks
  • Double-blinded: Neither participants nor doctors know which drug each participant receives

Why Randomization Matters

Randomization is the gold standard because it eliminates confounding variables — factors other than the drug that might affect blood pressure (age, weight, diet, exercise, stress, genetics). By randomly assigning participants, these factors are approximately balanced between groups. Any difference in outcomes can then be attributed to the drug rather than to pre-existing differences between patients.

Without randomization, you couldn't do a clean hypothesis test. If all the healthier patients ended up in the treatment group by choice, a lower blood pressure in that group might reflect their better health, not the drug's effectiveness.

The Hypotheses

Before looking at any data, the research team specifies:

  • H₀: Veridex and lisinopril produce the same average reduction in systolic blood pressure. (μ_veridex = μ_lisinopril)
  • H₁: Veridex and lisinopril produce different average reductions. (μ_veridex ≠ μ_lisinopril)
  • Significance level: α = 0.05 (pre-registered before unblinding)
  • Primary analysis: Two-sample t-test comparing mean SBP change

They use a two-tailed test because, going in, they want to detect a difference in either direction — Veridex could be better OR worse than lisinopril.

The Data

Let's simulate the trial data and analyze it:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)

n_per_group = 400

# Baseline SBP: similar in both groups (thanks to randomization)
baseline_treatment = np.random.normal(155, 10, n_per_group)
baseline_control = np.random.normal(155, 10, n_per_group)

# After 12 weeks:
# Veridex reduces SBP by ~18 mmHg on average (with individual variation)
# Lisinopril reduces SBP by ~15 mmHg on average
change_treatment = np.random.normal(-18, 8, n_per_group)  # Veridex
change_control = np.random.normal(-15, 8, n_per_group)    # Lisinopril

# Final SBP
final_treatment = baseline_treatment + change_treatment
final_control = baseline_control + change_control

# Create DataFrame
trial_data = pd.DataFrame({
    'group': ['Veridex'] * n_per_group + ['Lisinopril'] * n_per_group,
    'baseline_sbp': np.concatenate([baseline_treatment, baseline_control]),
    'final_sbp': np.concatenate([final_treatment, final_control]),
    'change': np.concatenate([change_treatment, change_control])
})

# Summary statistics
summary = trial_data.groupby('group')['change'].agg(['count', 'mean', 'std'])
print("Summary of SBP Change (mmHg):")
print(summary.round(2))
Summary of SBP Change (mmHg):
           count   mean   std
group
Lisinopril   400 -14.98  7.96
Veridex      400 -17.92  8.04

Both drugs reduce blood pressure (negative changes), but Veridex shows about a 3 mmHg larger reduction. Is this difference real, or could it be chance?

The Analysis

Step 1: Visual Exploration

Before any tests, look at the data:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of changes
for group, color, label in [('Veridex', '#2ecc71', 'Veridex'),
                             ('Lisinopril', '#3498db', 'Lisinopril')]:
    data = trial_data[trial_data['group'] == group]['change']
    axes[0].hist(data, bins=30, alpha=0.6, color=color, label=label,
                 edgecolor='white')
    axes[0].axvline(data.mean(), color=color, linewidth=2, linestyle='--')

axes[0].set_xlabel('Change in SBP (mmHg)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Distribution of SBP Changes', fontsize=13)
axes[0].legend(fontsize=11)

# Box plot comparison
bp = axes[1].boxplot([trial_data[trial_data['group'] == 'Veridex']['change'],
                       trial_data[trial_data['group'] == 'Lisinopril']['change']],
                      labels=['Veridex', 'Lisinopril'],
                      patch_artist=True, widths=0.5)
bp['boxes'][0].set_facecolor('#2ecc71')
bp['boxes'][1].set_facecolor('#3498db')
for box in bp['boxes']:
    box.set_alpha(0.7)
axes[1].set_ylabel('Change in SBP (mmHg)', fontsize=12)
axes[1].set_title('SBP Change by Treatment Group', fontsize=13)

plt.tight_layout()
plt.savefig('clarity_bp_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

Step 2: The Hypothesis Test

# Two-sample t-test
veridex_changes = trial_data[trial_data['group'] == 'Veridex']['change']
lisinopril_changes = trial_data[trial_data['group'] == 'Lisinopril']['change']

t_stat, p_value = stats.ttest_ind(veridex_changes, lisinopril_changes)

print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.6f}")
print(f"Significant at α = 0.05? {'Yes' if p_value < 0.05 else 'No'}")
t-statistic: -5.194
p-value: 0.000000
Significant at α = 0.05? Yes

Step 3: Effect Size and Confidence Interval

# Cohen's d
diff = veridex_changes.mean() - lisinopril_changes.mean()
pooled_std = np.sqrt(((len(veridex_changes)-1)*veridex_changes.var(ddof=1) +
                       (len(lisinopril_changes)-1)*lisinopril_changes.var(ddof=1)) /
                      (len(veridex_changes) + len(lisinopril_changes) - 2))
d = diff / pooled_std

# 95% CI for the difference
se_diff = pooled_std * np.sqrt(1/len(veridex_changes) + 1/len(lisinopril_changes))
t_crit = stats.t.ppf(0.975, df=len(veridex_changes)+len(lisinopril_changes)-2)
ci = (diff - t_crit * se_diff, diff + t_crit * se_diff)

print(f"\nMean difference: {diff:.2f} mmHg")
print(f"95% CI for difference: ({ci[0]:.2f}, {ci[1]:.2f}) mmHg")
print(f"Cohen's d: {d:.3f}")
print(f"Effect size interpretation: ", end="")
print("Small" if abs(d) < 0.5 else "Medium" if abs(d) < 0.8 else "Large")
Mean difference: -2.94 mmHg
95% CI for difference: (-4.05, -1.83) mmHg
Cohen's d: -0.367
Effect size interpretation: Small

Step 4: Interpreting the Results

Here's where things get interesting. The results show:

  1. Statistical significance: p < 0.001. The difference is highly statistically significant. We can confidently reject the null hypothesis that the two drugs are equivalent.

  2. Effect size: Cohen's d = 0.37. This is a small-to-medium effect. In terms of raw numbers, Veridex produces about 3 mmHg more reduction than lisinopril.

  3. Confidence interval: The true advantage of Veridex over lisinopril is somewhere between about 1.8 and 4.1 mmHg.

But is 3 mmHg clinically meaningful? This is where domain knowledge becomes crucial.

The Clinical Significance Question

A cardiologist reviewing these results would consider:

  • Population-level impact: A 3 mmHg reduction in average blood pressure across millions of patients can translate to a meaningful reduction in heart attacks and strokes at the population level. Epidemiological studies have estimated that a 2 mmHg population-wide reduction in systolic blood pressure is associated with approximately a 7-10% reduction in mortality from heart disease.

  • Individual-level impact: For any individual patient, 3 mmHg is barely detectable — blood pressure fluctuates by more than that throughout the day. No individual patient would "feel" the difference between Veridex and lisinopril.

  • Side effect profile: If Veridex has fewer side effects than lisinopril, the 3 mmHg advantage is gravy — the real benefit might be better tolerability and adherence.

  • Cost: If Veridex costs 10 times more than lisinopril (which is generic and cheap), a 3 mmHg advantage might not justify the cost increase.

This illustrates a critical principle: the hypothesis test tells you whether the difference is real, but it doesn't tell you whether the difference matters. That judgment requires domain expertise, context, and consideration of factors beyond the statistical analysis.

What Could Go Wrong: The Errors in Practice

Scenario A: Type I Error (False Positive)

Imagine a different reality where Veridex and lisinopril are actually equally effective. The trial, by chance, shows a 3 mmHg difference. The company publishes the result, the drug gets approved, patients switch from cheap lisinopril to expensive Veridex, and healthcare costs increase — all based on a random fluctuation.

At α = 0.05, this happens 5% of the time. With thousands of drugs being tested worldwide, a meaningful number of approved drugs may have benefits no larger than placebo. This is one reason why regulators often want to see results replicated in multiple trials.

Scenario B: Type II Error (False Negative)

Imagine Veridex actually produces a 2 mmHg advantage, but the trial only has 100 participants per group instead of 400. With the smaller sample, the t-test might produce p = 0.18 — not significant. The company concludes Veridex doesn't work and shelves it. Patients are denied a drug that could, at the population level, save thousands of lives.

# Simulate the underpowered trial
np.random.seed(42)
small_treatment = np.random.normal(-17, 8, 100)
small_control = np.random.normal(-15, 8, 100)
_, p_small = stats.ttest_ind(small_treatment, small_control)
print(f"Underpowered trial (n=100 per group): p = {p_small:.3f}")
print(f"Conclusion: {'Significant' if p_small < 0.05 else 'Not significant'}")
print(f"The real effect exists but was missed due to insufficient power.")

This is why power analysis is done before a trial begins. You need to know how many patients you need to have a good chance of detecting a clinically meaningful effect.

The Role of Pre-Registration

One of the most important safeguards in clinical trials is pre-registration — publicly recording the study design, hypotheses, sample size, and analysis plan before collecting any data.

Pre-registration prevents: - P-hacking: Trying multiple analyses until one produces p < 0.05 - Outcome switching: Changing the primary outcome after seeing which one looks significant - HARKing (Hypothesizing After Results are Known): Presenting exploratory findings as if they were predicted

In clinical trials, pre-registration is legally required. In academic research, it's increasingly expected but not always practiced — which is one factor behind the replication crisis (see Case Study 2).

Lessons for Data Science

Lesson 1: The Question Before the Test

The CLARITY-BP team didn't start with data and go looking for significant results. They started with a specific question ("Is Veridex better than lisinopril for blood pressure reduction?"), designed a study to answer it, pre-specified the analysis, and then collected data. This is the right order. In data science, always define your hypothesis before you explore the data.

Lesson 2: Effect Size Completes the Story

The p-value (< 0.001) sounds impressive. But the effect size (d = 0.37, or about 3 mmHg) is what actually matters for clinical decisions. In your data science work, always ask: "Is this difference big enough to care about?" Not just "Is it statistically significant?"

Lesson 3: Power Determines Whether Your Study Can Succeed

The CLARITY-BP trial was designed with 400 per group specifically to ensure adequate power. If they'd used 50 per group, the trial would have had about a 25% chance of detecting the 3 mmHg difference — a waste of time, money, and patients' participation. Always do a power analysis before investing in data collection.

Lesson 4: Context Determines Interpretation

A 3 mmHg difference means different things depending on the context: the cost of the drug, its side effects, the availability of alternatives, and the population-level vs. individual-level impact. Statistics provides the evidence; domain knowledge provides the meaning.


Discussion Questions

  1. The trade-off: Regulatory agencies like the FDA must balance Type I errors (approving ineffective drugs) against Type II errors (rejecting effective drugs). Which type of error do you think is more dangerous in the pharmaceutical context? Should the significance level for drug approval be more or less strict than 0.05?

  2. Ethics of placebo controls: In the CLARITY-BP trial, the control group received lisinopril (an active treatment), not a placebo. Why? Would it be ethical to give patients with hypertension a placebo when effective treatments exist?

  3. Publication bias: If the CLARITY-BP trial had found p = 0.22, the company might never publish the results. How does selective publication affect our collective understanding of which drugs work?

  4. Your project: How does this case study change how you think about hypothesis testing in the progressive project? What are the practical implications of the differences you found between income groups?


Key Takeaway: In clinical trials — and in all data science — hypothesis testing is a tool for decision-making under uncertainty. The p-value tells you how surprising the data would be if nothing were happening. The effect size tells you how much is happening. And domain knowledge tells you whether it matters. You need all three.