Case Study 1: The ShopSmart Recommendation Experiment

From Design to Decision --- and the Conversation Nobody Wanted to Have


Background

ShopSmart is a mid-size e-commerce marketplace with 14 million monthly active users, $620M in annual GMV, and a product catalog of 8.4 million items. Their recommendation engine, RecV1, powers "You Might Also Like" carousels on product pages, the homepage feed, and post-purchase email campaigns. RecV1 was built three years ago using collaborative filtering with item-item similarity. It performs reasonably but has known weaknesses: it over-indexes on popular products, under-serves new arrivals, and produces homogeneous recommendations for users with sparse browsing histories.

The ML team spent four months building RecV2, a hybrid system combining collaborative filtering with a deep learning sequence model trained on browsing session data. Offline evaluation was promising:

Metric RecV1 RecV2 Improvement
Mean Reciprocal Rank (MRR) 0.142 0.159 +12.0%
Hit Rate @ 10 0.231 0.258 +11.7%
Coverage (% catalog shown) 14.2% 31.8% +124.0%
Long-tail exposure (items below median popularity) 8.1% 22.4% +176.5%

These numbers are excellent. But the team learned from past mistakes. Four months ago, a similar model had been launched without an experiment, and the resulting "3.2% revenue lift" could not be attributed to the algorithm. This time, they would do it right.


The Experiment Design

Pre-Registration Document

Before writing any experiment code, the team wrote and circulated this document:

EXPERIMENT DESIGN: RecV2 Recommendation Algorithm
===================================================

Experiment ID: SHOP-2025-Q1-REC-002
Owner: ML Team (Maya Santos, lead)
PM: Derek Kwon
Start Date: February 3, 2025
Planned End Date: February 23, 2025 (21 days, 3 full weeks)
Analysis Date: February 24, 2025 (no peeking before this date)

HYPOTHESIS
  H0: RecV2 does not change revenue per user compared to RecV1.
  H1: RecV2 changes revenue per user compared to RecV1.
  (Two-sided test.)

PRIMARY METRIC
  Revenue per user (RPU) over the 21-day experimental period.
  Calculated as total revenue attributed to recommendation clicks
  divided by number of unique users in the group.

GUARDRAIL METRICS
  1. Page load time (p95): must not increase by more than 100ms
  2. Return rate: must not increase by more than 1 percentage point
  3. Search fallback rate: must not increase by more than 0.5pp
  4. Customer support tickets mentioning recommendations

RANDOMIZATION
  Unit: User ID (deterministic hash-based assignment)
  Split: 50/50
  Stratification: By user activity tier (high/medium/low, based on
    prior 30-day session count)

SAMPLE SIZE
  Baseline RPU: $4.82/week
  Standard deviation: $8.14
  MDE: 2% relative ($0.096 absolute)
  Alpha: 0.05 (two-sided)
  Power: 0.80
  Required per group: 88,764
  Available per week: ~1,750,000 per group
  Duration: 3 weeks (including buffer for day-of-week effects)

LAUNCH CRITERIA
  Ship if: Primary metric shows statistically significant positive
  lift (p < 0.05) AND no guardrail metric is significantly degraded.
  Do not ship if: Primary metric is not significant OR any guardrail
  is significantly degraded.
  If ambiguous: Extend experiment or investigate.

Production Tip --- The pre-registration document is not bureaucratic overhead. It is a contract. It prevents post-hoc rationalization, protects the analyst from stakeholder pressure, and creates a paper trail. If the PM later says "can we look at conversion rate instead?" the answer is: "That was not our pre-registered primary metric."


Week 1: The Peeking Incident

Despite the pre-registration document explicitly stating "no peeking before February 24," the PM, Derek Kwon, had access to the experiment dashboard. On February 6 --- day 3 --- he checked it.

The dashboard showed:

Group Users Mean RPU p-value
Control 247,812 $4.71 ---
Treatment 248,440 $4.93 0.028

Derek sent a Slack message to Maya: "Looks like RecV2 is winning. p = 0.028. Can we call it early?"

Maya's response was a masterclass in diplomatic firmness:

"Thanks for the enthusiasm, Derek. A few things to keep in mind:

  1. We committed to analyzing on Feb 24. The day 3 p-value is not reliable --- our false positive rate for peeking at 6 checkpoints would be approximately 15%, not 5%.

  2. Day 3 has only weekend data (Sat-Mon). E-commerce behavior is very different on weekdays. We need full-week cycles.

  3. The sample size at day 3 is about 250K per group. Our power analysis required at least 89K, but the MDE calculation assumed the full 21-day period of observation. With only 3 days of revenue accumulation per user, the metric variance is much higher.

I am going to revoke dashboard access for everyone except myself until the analysis date. I know that sounds heavy-handed, but it is how mature experimentation orgs operate. I will send a report on Feb 24."

She revoked dashboard access that afternoon. Derek was annoyed. The VP was confused. But the pre-registration document was on her side.


Week 2: The Infrastructure Scare

On February 12, the engineering team discovered that a caching bug had caused approximately 3% of treatment users to occasionally see RecV1 recommendations instead of RecV2. The bug was present from February 3-10 and was fixed on February 11.

This is a contamination problem. If treatment users sometimes saw the control experience, the treatment effect is diluted. The experiment is measuring a mix of RecV2 and RecV1, not pure RecV2.

The team had two options:

  1. Restart the experiment. Start fresh on February 11. This adds 21 days to the timeline.
  2. Continue and analyze as intent-to-treat. Keep all users in their assigned groups regardless of the contamination. The measured effect will be a lower bound on the true effect.

Maya chose option 2, with documentation:

CONTAMINATION NOTE (Feb 12):
  ~3% of treatment users received RecV1 recommendations between Feb 3-10
  due to caching bug (JIRA: SHOP-4412, fixed Feb 11).

  Decision: Continue experiment. Analyze as intent-to-treat.
  Impact: Treatment effect estimate will be conservative (diluted by ~3%).
  If the experiment is significant, the true RecV2 effect is slightly
  larger than measured. If not significant, the contamination may have
  hidden a real effect; we will consider extending the test.

Day 21: The Results

On February 24, Maya pulled the final data. Here are the results:

import numpy as np
import pandas as pd
from scipy import stats

# Final experiment results (summarized from production data)
results = {
    'metric': ['Revenue per User (primary)', 'Click-Through Rate',
               'Conversion Rate', 'Average Order Value',
               'Return Rate (guardrail)', 'P95 Load Time ms (guardrail)',
               'Search Fallback Rate (guardrail)'],
    'control': [14.46, 0.0342, 0.0218, 67.42, 0.048, 412, 0.156],
    'treatment': [14.81, 0.0378, 0.0226, 68.11, 0.051, 438, 0.149],
    'lift_pct': [2.42, 10.53, 3.67, 1.02, 6.25, 6.31, -4.49],
    'p_value': [0.031, 0.0001, 0.087, 0.312, 0.182, 0.003, 0.041],
    'significant': ['Yes', 'Yes', 'No', 'No', 'No', 'Yes (ALERT)', 'Yes (improved)'],
}

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
                            metric  control  treatment  lift_pct   p_value     significant
     Revenue per User (primary)    14.46      14.81      2.42     0.031              Yes
              Click-Through Rate   0.0342     0.0378     10.53    0.0001              Yes
                 Conversion Rate   0.0218     0.0226      3.67     0.087               No
             Average Order Value    67.42      68.11      1.02     0.312               No
           Return Rate (guardrail)  0.048      0.051      6.25     0.182               No
     P95 Load Time ms (guardrail)    412        438      6.31     0.003    Yes (ALERT)
  Search Fallback Rate (guardrail)  0.156      0.149     -4.49     0.041  Yes (improved)

The Good News

  • Primary metric wins. Revenue per user increased 2.42% (p = 0.031). Statistically significant. The 95% CI for the lift was [0.22%, 4.62%]. The entire interval is positive.
  • CTR substantially improved. Users are clicking on recommendations more often.
  • Search fallback rate decreased. Users are finding what they want through recommendations and relying less on search. This suggests recommendation quality genuinely improved.

The Problem

  • P95 load time increased by 26ms (412ms to 438ms), and the increase is statistically significant (p = 0.003). The pre-registered guardrail threshold was 100ms, so 26ms is within the guardrail. However, any statistically significant increase in latency warrants investigation.

The Analysis

Maya's report to Derek and the VP:

Recommendation: Conditional launch.

RecV2 demonstrates a statistically significant 2.42% increase in revenue per user. The confidence interval [0.22%, 4.62%] excludes zero. Accounting for the ~3% treatment contamination in week 1, the true effect may be slightly larger.

However, P95 load time increased by 26ms (6.3%). While this is within our 100ms guardrail threshold, the increase is statistically significant and should be investigated. The likely cause is the additional inference latency from the sequence model.

Recommended next steps: 1. Engineering team investigates latency increase. If it can be reduced to < 10ms with caching or model optimization, proceed with full launch. 2. If latency cannot be reduced, run a follow-up experiment with the optimized model to confirm the revenue lift persists. 3. Monitor return rate for 30 days post-launch. The 6.25% increase was not statistically significant but warrants watching.


The Stakeholder Conversation

The VP, Sarah Chen, read the report and called a meeting.

"The numbers look good. 2.42% revenue lift, significant. But I want to understand the confidence interval. You said the true lift could be as low as 0.22%. That does not sound very impressive."

Maya's response: "You are right that 0.22% is the lower bound. But the confidence interval is a range of plausible values for the true effect. Our best estimate is 2.42%, which would translate to roughly $15M in annual revenue if it generalizes to the full user base. Even the lower bound of 0.22% represents about $1.4M. And the latency fix could push the true effect higher."

"What about conversion rate? It went up 3.67% but was not significant?"

"Correct. The p-value was 0.087. With correction for multiple testing, it would be even higher. We cannot claim a conversion rate improvement from this experiment. It is suggestive but not conclusive."

"I noticed you did not include conversion rate in the pre-registered metrics."

"It was a secondary metric. Our launch decision rests on revenue per user, which was the pre-registered primary."

Sarah approved the conditional launch, contingent on the engineering team resolving the latency issue.


Epilogue

The engineering team optimized RecV2's inference pipeline, reducing P95 latency from 438ms to 408ms --- actually lower than the control baseline. RecV2 launched to 100% of users on March 15, 2025.

Six months later, the annualized revenue impact was approximately $12.8M --- consistent with the lower half of the confidence interval but below the point estimate of $15M. This is normal. Offline and short-term online results often overestimate long-term impact due to novelty effects and market dynamics.

The team also discovered that RecV2 performed 40% better for users with more than 10 prior purchases (the deep learning model had more behavioral signal to work with) but showed no improvement for new users. This segmentation insight, which emerged from post-hoc analysis of the experiment data, informed the roadmap for RecV3: a separate model pathway for new-user cold start.


Discussion Questions

  1. Maya revoked dashboard access to prevent peeking. Was this the right call? What alternatives exist that allow monitoring without inflating false positive rates?

  2. The experiment suffered from 3% treatment contamination in week 1. Maya chose intent-to-treat analysis. Under what circumstances would restarting the experiment have been the better choice?

  3. The confidence interval for revenue lift was [0.22%, 4.62%]. If the VP had required at least a 2% lift to justify the engineering maintenance costs of RecV2, would the experiment support launching? Why or why not?

  4. The post-hoc finding that RecV2 worked better for high-purchase-count users was not pre-registered. How should this finding be treated? Should it influence the launch decision?

  5. The six-month revenue impact ($12.8M) was below the initial point estimate ($15M). Does this mean the experiment was wrong? How should organizations think about the relationship between experimental results and long-term impact?


This case study accompanies Chapter 3: Experimental Design and A/B Testing. Return to the chapter for full context.