Case Study 2: The Hard Conversation --- When Data Disagrees with Leadership

DataField.Dev

When Data Disagrees with Leadership" type: case-study chapter: 34 part: 6

Case Study 2: The Hard Conversation --- When Data Disagrees with Leadership

Background

NovaMart is a mid-size e-commerce company ($120M annual revenue, 8 million monthly active users). The product team has spent four months redesigning the checkout flow. The new design reduces the number of steps from five to three, adds a progress indicator, and replaces the "Create Account" wall with a guest checkout option. The VP of Product, Derek, is personally invested in the project. He presented the redesign to the board as a key initiative for Q2.

The data science team ran an A/B test. The test ran for three weeks with 50,000 users per group (treatment = new checkout, control = existing checkout). The primary metric is checkout completion rate (the percentage of users who start checkout and complete the purchase).

The results are in. They are not what Derek wants to hear.

Phase 1: The Results

import numpy as np
from scipy import stats

np.random.seed(42)

# A/B test parameters
n_control = 50000
n_treatment = 50000

# True checkout completion rates
# The new design is slightly better, but the effect is small.
control_rate = 0.324     # 32.4% checkout completion
treatment_rate = 0.328   # 32.8% checkout completion (true lift: +0.4pp)

# Simulate outcomes
control_completions = np.random.binomial(n_control, control_rate)
treatment_completions = np.random.binomial(n_treatment, treatment_rate)

# Observed rates
obs_control_rate = control_completions / n_control
obs_treatment_rate = treatment_completions / n_treatment
obs_lift = obs_treatment_rate - obs_control_rate

print("A/B Test Results: Checkout Redesign")
print("=" * 50)
print(f"Control (old checkout):")
print(f"  Users:       {n_control:>10,}")
print(f"  Completions: {control_completions:>10,}")
print(f"  Rate:        {obs_control_rate:>10.3%}")
print(f"\nTreatment (new checkout):")
print(f"  Users:       {n_treatment:>10,}")
print(f"  Completions: {treatment_completions:>10,}")
print(f"  Rate:        {obs_treatment_rate:>10.3%}")
print(f"\nObserved lift: {obs_lift:>+.3%}")

Statistical Test

# Two-proportion z-test
pooled_rate = (control_completions + treatment_completions) / (n_control + n_treatment)
se = np.sqrt(pooled_rate * (1 - pooled_rate) * (1 / n_control + 1 / n_treatment))
z_stat = obs_lift / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# Confidence interval
z_crit = stats.norm.ppf(0.975)
ci_lower = obs_lift - z_crit * se
ci_upper = obs_lift + z_crit * se

print(f"\nStatistical Test")
print(f"-" * 50)
print(f"Z-statistic:   {z_stat:>10.3f}")
print(f"p-value:       {p_value:>10.3f}")
print(f"95% CI:        [{ci_lower:>+.3%}, {ci_upper:>+.3%}]")
print(f"Significant at alpha=0.05: {'Yes' if p_value < 0.05 else 'No'}")

What the Numbers Mean

# Power analysis: what effect could this test reliably detect?
alpha = 0.05
power = 0.80
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)

mde = (z_alpha + z_beta) * np.sqrt(
    2 * control_rate * (1 - control_rate) / n_control
)
print(f"\nMinimum Detectable Effect (MDE)")
print(f"-" * 50)
print(f"At alpha={alpha}, power={power}, n={n_control:,} per group:")
print(f"MDE: {mde:.4f} ({mde:.2%})")
print(f"Observed effect: {abs(obs_lift):.4f} ({abs(obs_lift):.2%})")
print(f"Test was powered to detect: {'>=' if abs(obs_lift) >= mde else '<'} MDE")

if abs(obs_lift) < mde:
    print("\nThe observed effect is smaller than the MDE.")
    print("The test was not adequately powered to detect an effect this small.")
    print("This does NOT mean there is no effect --- only that we cannot")
    print("distinguish it from noise with this sample size.")

Phase 2: The Meeting

Derek calls a meeting with the data science lead (you), the head of engineering, and the head of design. Here is how the conversation unfolds.

Derek's Position

"The results show a positive trend. The new checkout has a higher completion rate. I know the p-value is not below 0.05, but that is a statistical technicality. We have been working on this for four months. The design is objectively better --- fewer steps, guest checkout, progress indicator. I want to launch."

Your Preparation

Before the meeting, you prepare three things:

# 1. Revenue impact range
monthly_checkouts = 800000  # 8M MAU * 10% reach checkout
avg_order_value = 47.50

# Scenario A: the true effect is at the lower bound of the CI
revenue_impact_low = monthly_checkouts * ci_lower * avg_order_value
# Scenario B: the true effect is at the point estimate
revenue_impact_mid = monthly_checkouts * obs_lift * avg_order_value
# Scenario C: the true effect is at the upper bound of the CI
revenue_impact_high = monthly_checkouts * ci_upper * avg_order_value

print("Revenue Impact Scenarios (monthly)")
print("=" * 50)
print(f"Pessimistic (CI lower): ${revenue_impact_low:>12,.0f}")
print(f"Point estimate:         ${revenue_impact_mid:>12,.0f}")
print(f"Optimistic (CI upper):  ${revenue_impact_high:>12,.0f}")
print(f"\nNote: The pessimistic scenario includes the possibility")
print(f"of a NEGATIVE effect on revenue.")

# 2. Sample size needed to detect a 0.4pp effect with 80% power
required_n_per_group = (
    ((z_alpha + z_beta) ** 2)
    * 2
    * control_rate
    * (1 - control_rate)
    / (0.004 ** 2)
)
additional_weeks = (required_n_per_group - n_control) / (n_control / 3)

print(f"\nSample Size for Definitive Answer")
print(f"-" * 50)
print(f"To detect 0.4pp effect with 80% power:")
print(f"  Required per group: {required_n_per_group:>12,.0f}")
print(f"  Currently have:     {n_control:>12,}")
print(f"  Need additional:    {required_n_per_group - n_control:>12,.0f}")
print(f"  At current rate:    ~{additional_weeks:.0f} more weeks")

# 3. Staged rollout plan
staged_plan = {
    "option_a": {
        "name": "Extend the test",
        "description": (
            "Run for 4 more weeks with the same 50/50 split. "
            "Gives us the power to detect a 0.4pp effect."
        ),
        "risk": "Low. No change to user experience for most users.",
        "timeline": "4 weeks",
        "cost": "Engineering time to maintain both checkout flows.",
    },
    "option_b": {
        "name": "Staged rollout (10%)",
        "description": (
            "Launch to 10% of users. Monitor checkout completion, "
            "revenue, and support tickets for 2 weeks."
        ),
        "risk": "Low-medium. If the new checkout has issues, only 10% "
                "of users are affected.",
        "timeline": "2 weeks for monitoring, then decision to expand.",
        "cost": "Same engineering cost as the test.",
    },
    "option_c": {
        "name": "Full launch with monitoring",
        "description": (
            "Launch to 100% of users. Monitor closely for 4 weeks. "
            "Prepare rollback plan."
        ),
        "risk": "Medium. If the true effect is negative, all users are "
                "affected during the monitoring period.",
        "timeline": "Immediate launch, 4-week monitoring window.",
        "cost": "Minimal incremental cost.",
    },
}

print("\nProposed Options")
print("=" * 60)
for key, option in staged_plan.items():
    print(f"\n{option['name'].upper()}")
    print(f"  Description: {option['description']}")
    print(f"  Risk:        {option['risk']}")
    print(f"  Timeline:    {option['timeline']}")
    print(f"  Cost:        {option['cost']}")

Phase 3: The Conversation

Here is the script. This is not hypothetical --- this is the kind of conversation that data scientists have every month.

Your Opening

"Derek, I want to make sure we interpret the results correctly. The new checkout does show a slightly higher completion rate --- about 0.3 to 0.4 percentage points higher. That is a real observation. The issue is that with the sample size we had, we cannot distinguish that small an effect from random noise. The 95% confidence interval includes zero, which means we cannot rule out the possibility that the true effect is zero or even slightly negative."

Derek's Response

"But the trend is positive. And the design improvements are real --- fewer steps, guest checkout. Those are objectively better."

Your Counter

"I agree that the design improvements are real, and there are strong usability reasons to prefer the new checkout. But the A/B test is measuring the net effect on revenue, and that net effect is too small for us to measure with confidence given three weeks of data. Here is what I recommend."

The Recommendation

"We have three options."

"Option A: Extend the test for four more weeks. This gives us enough data to detect a 0.4 percentage point effect with 80% confidence. If the effect is real, the extended test will confirm it."

"Option B: Launch to 10% of users and monitor for two weeks. This gets the new design into production faster while limiting risk. If checkout completion holds steady or improves in the 10% group, we expand to everyone."

"Option C: Launch to 100% with close monitoring and a rollback plan. This is the fastest path but carries the most risk. If the true effect is slightly negative, all users are affected."

"My recommendation is Option B. It balances speed with confidence."

Derek's Decision

Derek chooses Option C --- full launch. He cites the board presentation, the four months of design work, and the qualitative user research that supports the new design.

Your Documentation

from datetime import date


def document_decision(
    decision_date: date,
    decision: str,
    data_summary: str,
    rationale: str,
    dissent: str,
    monitoring_plan: str,
    rollback_trigger: str,
) -> str:
    """
    Create a decision log entry for the data science records.
    """
    lines = [
        "DECISION LOG ENTRY",
        "=" * 60,
        f"Date:     {decision_date.isoformat()}",
        f"Decision: {decision}",
        "",
        "Data Summary:",
        f"  {data_summary}",
        "",
        "Business Rationale:",
        f"  {rationale}",
        "",
        "Data Science Assessment:",
        f"  {dissent}",
        "",
        "Monitoring Plan:",
        f"  {monitoring_plan}",
        "",
        "Rollback Trigger:",
        f"  {rollback_trigger}",
        "=" * 60,
    ]
    return "\n".join(lines)


log_entry = document_decision(
    decision_date=date(2026, 3, 15),
    decision="Launch new checkout flow to 100% of users.",
    data_summary=(
        "A/B test (3 weeks, 50K per group) showed +0.3-0.4pp lift in "
        "checkout completion. p=0.23 (not statistically significant at "
        "alpha=0.05). 95% CI: [-0.1%, +0.9%]. Test was underpowered "
        "for the observed effect size."
    ),
    rationale=(
        "VP of Product decision based on: (1) positive trend direction, "
        "(2) qualitative design improvements (fewer steps, guest checkout), "
        "(3) user research supporting the redesign, (4) board commitment "
        "to Q2 launch."
    ),
    dissent=(
        "Data science recommended Option B (10% staged rollout) to "
        "gather additional evidence. The confidence interval includes "
        "zero and a small negative effect. Risk: if the true effect is "
        "negative, all users are affected during the monitoring period."
    ),
    monitoring_plan=(
        "Daily monitoring of: checkout completion rate, revenue per "
        "session, cart abandonment rate, support ticket volume. "
        "Comparison: pre-launch 4-week baseline vs. post-launch. "
        "Statistical process control charts for all four metrics."
    ),
    rollback_trigger=(
        "Automatic rollback if checkout completion drops more than "
        "1.0pp below the 4-week baseline average for 3 consecutive "
        "days, or if support ticket volume exceeds 2x the baseline."
    ),
)
print(log_entry)

Phase 4: What Happened Next

Four weeks after the full launch:

# Post-launch monitoring results (simulated)
np.random.seed(42)

weeks = ["Week 1", "Week 2", "Week 3", "Week 4"]
baseline_rate = 0.324

# The true effect is +0.3pp, with some weekly noise
weekly_rates = [
    baseline_rate + np.random.normal(0.003, 0.002),
    baseline_rate + np.random.normal(0.003, 0.002),
    baseline_rate + np.random.normal(0.004, 0.002),
    baseline_rate + np.random.normal(0.003, 0.002),
]

print("Post-Launch Monitoring: Checkout Completion Rate")
print("=" * 55)
print(f"{'Week':<10} {'Rate':>10} {'vs Baseline':>15} {'Status':>12}")
print("-" * 55)
for week, rate in zip(weeks, weekly_rates):
    diff = rate - baseline_rate
    status = "OK" if diff > -0.005 else "INVESTIGATE"
    print(f"{week:<10} {rate:>10.3%} {diff:>+14.3%}  {status:>10}")

print(f"\nBaseline:  {baseline_rate:.3%}")
print(f"Average post-launch: {np.mean(weekly_rates):.3%}")
print(f"Average lift: {np.mean(weekly_rates) - baseline_rate:+.3%}")
print(f"\nConclusion: No rollback triggered. Effect consistent with")
print(f"a small positive lift, as the A/B test suggested.")

The outcome: the new checkout performed slightly better than the old one, consistent with the original A/B test's point estimate. No rollback was needed. The effect was real --- just too small for the original test to detect with statistical confidence.

Key Insight --- Derek's decision turned out to be correct, but that does not mean it was the right decision at the time. The data scientist's job was to ensure the decision was made with full awareness of the uncertainty. If the true effect had been -0.5% instead of +0.3%, the data scientist's documentation and monitoring plan would have caught it. The process matters more than any single outcome.

Phase 5: Lessons Learned

Lesson 1: "Not significant" is not "not real"

A non-significant p-value does not prove the null hypothesis. It means the test lacked the power to detect the effect. This distinction is critical when communicating with stakeholders. Saying "the test showed no effect" is wrong. Saying "the test could not distinguish the observed effect from noise" is accurate.

Lesson 2: The confidence interval is more useful than the p-value

The p-value is a binary gate: significant or not. The confidence interval tells the full story: "the true effect is plausibly between -0.1% and +0.9%." This range lets the business make a judgment call with full information.

Lesson 3: Document everything

When the business overrides a data recommendation, the data scientist's job is to document the decision, the rationale, the dissent, the monitoring plan, and the rollback trigger. This is not political cover --- it is good practice. If the decision goes wrong, the documentation enables a fast response. If the decision goes right, the documentation shows that the organization made an informed choice.

Lesson 4: Provide options, not ultimatums

"You should not launch" is an ultimatum. "Here are three options with different risk profiles" is a collaboration. The data scientist's role is to inform the decision, not to make it. The business may have context the data does not capture.

Lesson 5: The monitoring plan is the bridge

When the data is ambiguous and the business wants to act, the monitoring plan is the compromise that makes action safe. "Launch, but with daily monitoring and an automatic rollback trigger" turns an uncertain decision into a manageable risk.

Discussion Questions

Was Derek right to launch? Under what circumstances would you have pushed back harder?
Suppose the post-launch monitoring showed a -0.8pp drop in checkout completion. How would you handle the rollback conversation with Derek, who just presented the redesign to the board?
The data science team's recommendation (Option B: 10% staged rollout) was overridden. Should the data scientist escalate to Derek's manager? Under what conditions is escalation appropriate?
How would you redesign the original A/B test to ensure adequate power for the expected effect size? What sample size would you need, and how long would the test take to run?
NovaMart's CEO reads about the checkout redesign's inconclusive test results and says: "Why are we spending money on A/B testing if we are going to launch regardless of the results?" How do you respond?

This case study supports Chapter 34: The Business of Data Science. Return to the chapter for full context.