Case Study 1: Bayesian A/B Testing for Product Decisions

Background

DataFlow Analytics, a B2B SaaS company, wants to test whether a redesigned onboarding flow (Variant B) improves the free-to-paid conversion rate compared to the current flow (Variant A). Traditional frequentist A/B testing would use a chi-squared test or z-test with a fixed sample size and significance threshold, but the product team has specific requirements that make a Bayesian approach more natural:

They want to continuously monitor results without inflating the false positive rate (no "peeking problem").
They need to answer "What is the probability that B is better than A?" -- not just "Can we reject the null hypothesis?"
They want to incorporate historical data: previous experiments suggest the baseline conversion rate is around 12% with some variation.
They need a decision framework that accounts for the magnitude of improvement, not just its existence.

Problem Formulation

Data

Variant A (control): 1,847 visitors, 221 conversions (12.0% observed rate)
Variant B (new onboarding): 1,902 visitors, 258 conversions (13.6% observed rate)

Model

We model each variant's conversion rate as a Bernoulli process with unknown probability:

$$ x_{A,i} \sim \text{Bernoulli}(\theta_A), \quad x_{B,i} \sim \text{Bernoulli}(\theta_B) $$

Using Beta-Binomial conjugacy:

Prior: $\theta_A \sim \text{Beta}(\alpha_A, \beta_A)$ and $\theta_B \sim \text{Beta}(\alpha_B, \beta_B)$
Posterior: $\theta_A \mid \text{data} \sim \text{Beta}(\alpha_A + s_A, \beta_A + f_A)$ and similarly for B

where $s$ and $f$ are successes and failures.

Prior Selection

Based on historical experiments, the team believes the baseline conversion rate is around 12% with a standard deviation of about 3%. This corresponds approximately to a Beta(15, 110) prior (mean = 15/125 = 0.12, std $\approx$ 0.029). We use the same prior for both variants, reflecting the belief that absent new evidence, both should have similar rates.

Analysis

Step 1: Compute Posteriors

from scipy import stats
import numpy as np

# Prior (informed by historical data)
alpha_prior, beta_prior = 15, 110

# Observed data
visitors_a, conversions_a = 1847, 221
visitors_b, conversions_b = 1902, 258

# Posterior parameters
alpha_a = alpha_prior + conversions_a    # 236
beta_a = beta_prior + (visitors_a - conversions_a)  # 1736
alpha_b = alpha_prior + conversions_b    # 273
beta_b = beta_prior + (visitors_b - conversions_b)  # 1754

post_a = stats.beta(alpha_a, beta_a)
post_b = stats.beta(alpha_b, beta_b)

print(f"Variant A: posterior mean = {post_a.mean():.4f}, "
      f"95% CI = [{post_a.ppf(0.025):.4f}, {post_a.ppf(0.975):.4f}]")
print(f"Variant B: posterior mean = {post_b.mean():.4f}, "
      f"95% CI = [{post_b.ppf(0.025):.4f}, {post_b.ppf(0.975):.4f}]")

Results: - Variant A: posterior mean = 0.1197, 95% CI = [0.1054, 0.1348] - Variant B: posterior mean = 0.1347, 95% CI = [0.1197, 0.1505]

Step 2: Probability that B is Better than A

The key question: $P(\theta_B > \theta_A \mid \text{data})$. We estimate this via Monte Carlo sampling:

n_samples = 500_000
samples_a = post_a.rvs(n_samples, random_state=42)
samples_b = post_b.rvs(n_samples, random_state=43)

prob_b_better = np.mean(samples_b > samples_a)
print(f"P(B > A) = {prob_b_better:.4f}")

Result: $P(\theta_B > \theta_A \mid \text{data}) \approx 0.9567$

There is approximately a 95.7% probability that Variant B has a higher conversion rate than Variant A.

Step 3: Expected Lift

Beyond asking "Is B better?", the team needs to know "By how much?"

lift = (samples_b - samples_a) / samples_a
print(f"Expected relative lift: {np.mean(lift):.2%}")
print(f"Median lift: {np.median(lift):.2%}")
print(f"95% CI for lift: [{np.percentile(lift, 2.5):.2%}, "
      f"{np.percentile(lift, 97.5):.2%}]")
print(f"P(lift > 5%): {np.mean(lift > 0.05):.4f}")
print(f"P(lift > 10%): {np.mean(lift > 0.10):.4f}")

Results: - Expected relative lift: 12.8% - Median lift: 12.6% - 95% CI for lift: [-2.7%, 29.5%] - P(lift > 5%): 0.7912 - P(lift > 10%): 0.5903

Step 4: Expected Loss Analysis

The team uses a decision framework based on expected loss. If we choose B but A is actually better, we lose conversions. The expected loss of choosing B is:

$$ \text{E[Loss}(B)] = \mathbb{E}[\max(\theta_A - \theta_B, 0)] $$

loss_choosing_b = np.mean(np.maximum(samples_a - samples_b, 0))
loss_choosing_a = np.mean(np.maximum(samples_b - samples_a, 0))

print(f"Expected loss of choosing B: {loss_choosing_b:.5f}")
print(f"Expected loss of choosing A: {loss_choosing_a:.5f}")

Results: - Expected loss of choosing B: 0.00061 (0.061 percentage points) - Expected loss of choosing A: 0.01558 (1.558 percentage points)

The expected loss of choosing B is 25x smaller than choosing A. This is a strong signal for deploying Variant B.

Step 5: Prior Sensitivity Check

We verify that the conclusion is robust by testing under different priors:

priors_to_test = [
    {"name": "Uniform (non-informative)", "alpha": 1, "beta": 1},
    {"name": "Weak (Beta(2, 15))", "alpha": 2, "beta": 15},
    {"name": "Informed (Beta(15, 110))", "alpha": 15, "beta": 110},
    {"name": "Strong (Beta(60, 440))", "alpha": 60, "beta": 440},
]

for prior in priors_to_test:
    a_post = stats.beta(
        prior["alpha"] + conversions_a,
        prior["beta"] + visitors_a - conversions_a,
    )
    b_post = stats.beta(
        prior["alpha"] + conversions_b,
        prior["beta"] + visitors_b - conversions_b,
    )
    sa = a_post.rvs(200_000, random_state=42)
    sb = b_post.rvs(200_000, random_state=43)
    prob = np.mean(sb > sa)
    print(f"{prior['name']:35s}: P(B>A) = {prob:.4f}")

Results:

Uniform (non-informative)          : P(B>A) = 0.9582
Weak (Beta(2, 15))                 : P(B>A) = 0.9578
Informed (Beta(15, 110))           : P(B>A) = 0.9567
Strong (Beta(60, 440))             : P(B>A) = 0.9421

The conclusion is robust across all reasonable priors: P(B > A) exceeds 94% in every case.

Decision and Outcome

The product team adopted a decision rule: deploy Variant B if $P(\theta_B > \theta_A) > 0.90$ AND the expected loss of choosing B is less than 0.1 percentage points. Both criteria were met.

After full deployment, the conversion rate stabilized at 13.2% -- consistent with the Bayesian posterior prediction and representing a meaningful business improvement.

Key Lessons

Bayesian A/B testing provides direct probability statements: "There is a 95.7% chance B is better" is more actionable than "p < 0.05."
Continuous monitoring is natural: The posterior can be updated after each batch of data without the peeking problem that afflicts frequentist tests.
Prior information improves efficiency: Incorporating historical data through informative priors led to a decision in fewer observations than a traditional power calculation would require.
Expected loss is a superior decision metric: Rather than just asking "Is the difference significant?", expected loss quantifies the practical cost of a wrong decision.
Sensitivity analysis builds confidence: Showing that the result holds under different priors strengthens the case for deployment.

Connection to Chapter Concepts

Section 10.1: The sequential updating property allowed the team to monitor the experiment daily.
Section 10.4: Prior selection used historical data (informative prior) with sensitivity analysis.
Section 10.5: Beta-Binomial conjugacy enabled closed-form posteriors without MCMC.
Section 10.10: The Bayesian approach directly answered the product team's question, which was inherently probabilistic.

See code/case-study-code.py for the complete implementation including visualizations.