Case Study 1: Bayesian A/B Testing for Product Decisions
Background
DataFlow Analytics, a B2B SaaS company, wants to test whether a redesigned onboarding flow (Variant B) improves the free-to-paid conversion rate compared to the current flow (Variant A). Traditional frequentist A/B testing would use a chi-squared test or z-test with a fixed sample size and significance threshold, but the product team has specific requirements that make a Bayesian approach more natural:
- They want to continuously monitor results without inflating the false positive rate (no "peeking problem").
- They need to answer "What is the probability that B is better than A?" -- not just "Can we reject the null hypothesis?"
- They want to incorporate historical data: previous experiments suggest the baseline conversion rate is around 12% with some variation.
- They need a decision framework that accounts for the magnitude of improvement, not just its existence.
Problem Formulation
Data
- Variant A (control): 1,847 visitors, 221 conversions (12.0% observed rate)
- Variant B (new onboarding): 1,902 visitors, 258 conversions (13.6% observed rate)
Model
We model each variant's conversion rate as a Bernoulli process with unknown probability:
$$ x_{A,i} \sim \text{Bernoulli}(\theta_A), \quad x_{B,i} \sim \text{Bernoulli}(\theta_B) $$
Using Beta-Binomial conjugacy:
- Prior: $\theta_A \sim \text{Beta}(\alpha_A, \beta_A)$ and $\theta_B \sim \text{Beta}(\alpha_B, \beta_B)$
- Posterior: $\theta_A \mid \text{data} \sim \text{Beta}(\alpha_A + s_A, \beta_A + f_A)$ and similarly for B
where $s$ and $f$ are successes and failures.
Prior Selection
Based on historical experiments, the team believes the baseline conversion rate is around 12% with a standard deviation of about 3%. This corresponds approximately to a Beta(15, 110) prior (mean = 15/125 = 0.12, std $\approx$ 0.029). We use the same prior for both variants, reflecting the belief that absent new evidence, both should have similar rates.
Analysis
Step 1: Compute Posteriors
from scipy import stats
import numpy as np
# Prior (informed by historical data)
alpha_prior, beta_prior = 15, 110
# Observed data
visitors_a, conversions_a = 1847, 221
visitors_b, conversions_b = 1902, 258
# Posterior parameters
alpha_a = alpha_prior + conversions_a # 236
beta_a = beta_prior + (visitors_a - conversions_a) # 1736
alpha_b = alpha_prior + conversions_b # 273
beta_b = beta_prior + (visitors_b - conversions_b) # 1754
post_a = stats.beta(alpha_a, beta_a)
post_b = stats.beta(alpha_b, beta_b)
print(f"Variant A: posterior mean = {post_a.mean():.4f}, "
f"95% CI = [{post_a.ppf(0.025):.4f}, {post_a.ppf(0.975):.4f}]")
print(f"Variant B: posterior mean = {post_b.mean():.4f}, "
f"95% CI = [{post_b.ppf(0.025):.4f}, {post_b.ppf(0.975):.4f}]")
Results: - Variant A: posterior mean = 0.1197, 95% CI = [0.1054, 0.1348] - Variant B: posterior mean = 0.1347, 95% CI = [0.1197, 0.1505]
Step 2: Probability that B is Better than A
The key question: $P(\theta_B > \theta_A \mid \text{data})$. We estimate this via Monte Carlo sampling:
n_samples = 500_000
samples_a = post_a.rvs(n_samples, random_state=42)
samples_b = post_b.rvs(n_samples, random_state=43)
prob_b_better = np.mean(samples_b > samples_a)
print(f"P(B > A) = {prob_b_better:.4f}")
Result: $P(\theta_B > \theta_A \mid \text{data}) \approx 0.9567$
There is approximately a 95.7% probability that Variant B has a higher conversion rate than Variant A.
Step 3: Expected Lift
Beyond asking "Is B better?", the team needs to know "By how much?"
lift = (samples_b - samples_a) / samples_a
print(f"Expected relative lift: {np.mean(lift):.2%}")
print(f"Median lift: {np.median(lift):.2%}")
print(f"95% CI for lift: [{np.percentile(lift, 2.5):.2%}, "
f"{np.percentile(lift, 97.5):.2%}]")
print(f"P(lift > 5%): {np.mean(lift > 0.05):.4f}")
print(f"P(lift > 10%): {np.mean(lift > 0.10):.4f}")
Results: - Expected relative lift: 12.8% - Median lift: 12.6% - 95% CI for lift: [-2.7%, 29.5%] - P(lift > 5%): 0.7912 - P(lift > 10%): 0.5903
Step 4: Expected Loss Analysis
The team uses a decision framework based on expected loss. If we choose B but A is actually better, we lose conversions. The expected loss of choosing B is:
$$ \text{E[Loss}(B)] = \mathbb{E}[\max(\theta_A - \theta_B, 0)] $$
loss_choosing_b = np.mean(np.maximum(samples_a - samples_b, 0))
loss_choosing_a = np.mean(np.maximum(samples_b - samples_a, 0))
print(f"Expected loss of choosing B: {loss_choosing_b:.5f}")
print(f"Expected loss of choosing A: {loss_choosing_a:.5f}")
Results: - Expected loss of choosing B: 0.00061 (0.061 percentage points) - Expected loss of choosing A: 0.01558 (1.558 percentage points)
The expected loss of choosing B is 25x smaller than choosing A. This is a strong signal for deploying Variant B.
Step 5: Prior Sensitivity Check
We verify that the conclusion is robust by testing under different priors:
priors_to_test = [
{"name": "Uniform (non-informative)", "alpha": 1, "beta": 1},
{"name": "Weak (Beta(2, 15))", "alpha": 2, "beta": 15},
{"name": "Informed (Beta(15, 110))", "alpha": 15, "beta": 110},
{"name": "Strong (Beta(60, 440))", "alpha": 60, "beta": 440},
]
for prior in priors_to_test:
a_post = stats.beta(
prior["alpha"] + conversions_a,
prior["beta"] + visitors_a - conversions_a,
)
b_post = stats.beta(
prior["alpha"] + conversions_b,
prior["beta"] + visitors_b - conversions_b,
)
sa = a_post.rvs(200_000, random_state=42)
sb = b_post.rvs(200_000, random_state=43)
prob = np.mean(sb > sa)
print(f"{prior['name']:35s}: P(B>A) = {prob:.4f}")
Results:
Uniform (non-informative) : P(B>A) = 0.9582
Weak (Beta(2, 15)) : P(B>A) = 0.9578
Informed (Beta(15, 110)) : P(B>A) = 0.9567
Strong (Beta(60, 440)) : P(B>A) = 0.9421
The conclusion is robust across all reasonable priors: P(B > A) exceeds 94% in every case.
Decision and Outcome
The product team adopted a decision rule: deploy Variant B if $P(\theta_B > \theta_A) > 0.90$ AND the expected loss of choosing B is less than 0.1 percentage points. Both criteria were met.
After full deployment, the conversion rate stabilized at 13.2% -- consistent with the Bayesian posterior prediction and representing a meaningful business improvement.
Key Lessons
-
Bayesian A/B testing provides direct probability statements: "There is a 95.7% chance B is better" is more actionable than "p < 0.05."
-
Continuous monitoring is natural: The posterior can be updated after each batch of data without the peeking problem that afflicts frequentist tests.
-
Prior information improves efficiency: Incorporating historical data through informative priors led to a decision in fewer observations than a traditional power calculation would require.
-
Expected loss is a superior decision metric: Rather than just asking "Is the difference significant?", expected loss quantifies the practical cost of a wrong decision.
-
Sensitivity analysis builds confidence: Showing that the result holds under different priors strengthens the case for deployment.
Connection to Chapter Concepts
- Section 10.1: The sequential updating property allowed the team to monitor the experiment daily.
- Section 10.4: Prior selection used historical data (informative prior) with sensitivity analysis.
- Section 10.5: Beta-Binomial conjugacy enabled closed-form posteriors without MCMC.
- Section 10.10: The Bayesian approach directly answered the product team's question, which was inherently probabilistic.
See code/case-study-code.py for the complete implementation including visualizations.