Case Study 2: StreamFlow Churn Model Fairness Tradeoff


Background

StreamFlow's churn prediction model has been the progressive project's running example since Chapter 1. It predicts 30-day churn for StreamFlow's 50,000 active subscribers. The model --- an XGBoost classifier with an AUC of 0.938 --- drives the Customer Success team's weekly high-risk list. Subscribers with predicted churn probability above 0.40 receive a retention intervention: a personalized content recommendation, a loyalty discount, or an outreach call.

The system has been running for four months. Churn among contacted subscribers has dropped by an estimated 15%. The VP of Customer Success calls it the best tool the team has.

Then the VP of Marketing asks a question: "Are we offering retention discounts evenly across our subscriber base? I looked at the numbers, and it seems like our younger subscribers are getting way more discount offers than older ones. Are we accidentally discriminating by age?"

This case study walks through a fairness audit of the StreamFlow churn model, focusing on age group and geographic region. This is also the Progressive Project milestone M12: you will audit your own StreamFlow model using the same workflow.


Phase 1: The Data

import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score, accuracy_score, confusion_matrix, roc_curve
)

np.random.seed(42)
n = 50000

# Core engagement features
sessions_last_30d = np.random.poisson(14, n)
avg_session_minutes = np.random.exponential(28, n).round(1)
unique_titles_watched = np.random.poisson(8, n)
content_completion_rate = np.random.beta(3, 2, n).round(3)
binge_sessions_30d = np.random.poisson(2, n)
hours_change_pct = np.random.normal(0, 30, n).round(1)
sessions_change_pct = np.random.normal(0, 25, n).round(1)
months_active = np.random.randint(1, 60, n)
plan_price = np.random.choice(
    [9.99, 14.99, 19.99, 24.99], n, p=[0.35, 0.35, 0.20, 0.10]
)
devices_used = np.random.randint(1, 6, n)
payment_failures_6m = np.random.poisson(0.3, n)
support_tickets_90d = np.random.poisson(0.8, n)
days_since_last_session = np.random.exponential(5, n).round(0).clip(0, 60)

# Demographics (protected attributes for fairness audit)
age = np.random.choice(
    ['18-25', '26-40', '41-60', '60+'],
    size=n,
    p=[0.22, 0.38, 0.28, 0.12]
)

region = np.random.choice(
    ['Urban', 'Suburban', 'Rural'],
    size=n,
    p=[0.45, 0.35, 0.20]
)

# Age group affects churn through behavior patterns
age_churn_offset = {
    '18-25': 0.08,    # highest churn: price-sensitive, many alternatives
    '26-40': 0.02,    # moderate churn
    '41-60': -0.03,   # lower churn: habitual, less price-sensitive
    '60+': -0.01,     # slightly below average
}
age_offset = np.array([age_churn_offset[a] for a in age])

# Region affects churn through content availability and alternatives
region_churn_offset = {
    'Urban': 0.02,     # more streaming alternatives
    'Suburban': -0.01,  # moderate
    'Rural': -0.03,    # fewer alternatives, but also less content relevance
}
region_offset = np.array([region_churn_offset[r] for r in region])

# Generate churn target
logit = (
    -2.5
    - 0.04 * sessions_last_30d
    - 0.01 * avg_session_minutes
    - 0.03 * unique_titles_watched
    - 0.8 * content_completion_rate
    - 0.10 * binge_sessions_30d
    - 0.005 * hours_change_pct
    - 0.005 * sessions_change_pct
    - 0.01 * months_active
    + 0.02 * plan_price
    - 0.05 * devices_used
    + 0.15 * payment_failures_6m
    + 0.04 * support_tickets_90d
    + 0.03 * days_since_last_session
    + age_offset
    + region_offset
)
prob_churn = 1 / (1 + np.exp(-logit))
churned = np.random.binomial(1, prob_churn)

df = pd.DataFrame({
    'sessions_last_30d': sessions_last_30d,
    'avg_session_minutes': avg_session_minutes,
    'unique_titles_watched': unique_titles_watched,
    'content_completion_rate': content_completion_rate,
    'binge_sessions_30d': binge_sessions_30d,
    'hours_change_pct': hours_change_pct,
    'sessions_change_pct': sessions_change_pct,
    'months_active': months_active,
    'plan_price': plan_price,
    'devices_used': devices_used,
    'payment_failures_6m': payment_failures_6m,
    'support_tickets_90d': support_tickets_90d,
    'days_since_last_session': days_since_last_session,
    'age_group': age,
    'region': region,
    'churned': churned,
})

print("Churn rates by age group:")
print(df.groupby('age_group')['churned'].mean().round(3))
print(f"\nChurn rates by region:")
print(df.groupby('region')['churned'].mean().round(3))
print(f"\nOverall churn rate: {df['churned'].mean():.3f}")

The numbers confirm the VP's intuition: churn rates differ across age groups. The 18--25 group has the highest churn rate, and the 41--60 group has the lowest. The question is whether the model's errors are distributed equitably --- or whether certain groups are systematically over-flagged or under-flagged.


Phase 2: Training and Baseline Audit

feature_cols = [
    'sessions_last_30d', 'avg_session_minutes', 'unique_titles_watched',
    'content_completion_rate', 'binge_sessions_30d', 'hours_change_pct',
    'sessions_change_pct', 'months_active', 'plan_price', 'devices_used',
    'payment_failures_6m', 'support_tickets_90d', 'days_since_last_session',
]

X = df[feature_cols]
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

age_test = df.loc[X_test.index, 'age_group'].values
region_test = df.loc[X_test.index, 'region'].values

model = XGBClassifier(
    n_estimators=300, max_depth=5, learning_rate=0.1,
    random_state=42, eval_metric='logloss',
    use_label_encoder=False,
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(f"Overall AUC:      {roc_auc_score(y_test, y_prob):.3f}")
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred):.3f}")

Fairness Audit Function

def fairness_audit(y_true, y_pred, y_prob, groups, group_name, threshold=0.5):
    """Run a complete fairness audit for a protected attribute."""
    print(f"\n{'='*65}")
    print(f"FAIRNESS AUDIT — {group_name}")
    print(f"{'='*65}")

    rows = []
    for group in sorted(np.unique(groups)):
        mask = groups == group
        y_t = y_true[mask]
        y_p = y_pred[mask]
        y_pr = y_prob[mask]
        n_g = mask.sum()

        if len(np.unique(y_t)) < 2:
            continue

        tn, fp, fn, tp = confusion_matrix(y_t, y_p).ravel()
        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
        pos_rate = y_p.mean()
        base_rate = y_t.mean()
        auc = roc_auc_score(y_t, y_pr)

        rows.append({
            'Group': group, 'N': n_g,
            'Base Rate': round(base_rate, 3),
            'Pos Rate': round(pos_rate, 3),
            'TPR': round(tpr, 3),
            'FPR': round(fpr, 3),
            'Precision': round(ppv, 3),
            'AUC': round(auc, 3),
        })

    audit_df = pd.DataFrame(rows).set_index('Group')
    print(audit_df.to_string())

    # Disparities summary
    print(f"\n  TPR range:       {audit_df['TPR'].max() - audit_df['TPR'].min():.3f}")
    print(f"  FPR range:       {audit_df['FPR'].max() - audit_df['FPR'].min():.3f}")
    print(f"  Precision range: {audit_df['Precision'].max() - audit_df['Precision'].min():.3f}")
    print(f"  AUC range:       {audit_df['AUC'].max() - audit_df['AUC'].min():.3f}")

    # Disparate impact
    ref_rate = audit_df['Pos Rate'].iloc[0]
    print(f"\n  Disparate Impact (reference: {audit_df.index[0]}):")
    for group in audit_df.index:
        ratio = audit_df.loc[group, 'Pos Rate'] / ref_rate if ref_rate > 0 else np.nan
        flag = "  ** below 0.80 **" if ratio < 0.80 else ""
        print(f"    {group:12s}  DI={ratio:.3f}{flag}")

    print(f"{'='*65}")
    return audit_df

# Audit by age group
audit_age = fairness_audit(
    y_test.values, y_pred, y_prob, age_test, "Age Group"
)

# Audit by region
audit_region = fairness_audit(
    y_test.values, y_pred, y_prob, region_test, "Region"
)

Phase 3: Analyzing the Disparities

Age Group Analysis

The audit reveals two patterns:

  1. Different base rates: The 18--25 group churns at a higher rate than the 41--60 group. This is a real behavioral difference --- younger subscribers are more price-sensitive, have more streaming alternatives, and are less habitual.

  2. Different error rates: The model's TPR and FPR vary across age groups. If the model has a higher FPR for younger subscribers, it is over-flagging them --- leading to more retention offers for subscribers who would not have churned anyway. If it has a lower TPR for older subscribers, it is missing at-risk older subscribers who would have benefited from outreach.

# Business impact: how many retention offers go to each age group?
print("Retention offers by age group (threshold=0.40):")
y_pred_40 = (y_prob >= 0.40).astype(int)
for group in ['18-25', '26-40', '41-60', '60+']:
    mask = age_test == group
    n_flagged = y_pred_40[mask].sum()
    n_group = mask.sum()
    pct = n_flagged / n_group * 100
    print(f"  {group:8s}  {n_flagged:5d} flagged out of {n_group} ({pct:.1f}%)")

Business Context --- The VP of Marketing's concern is valid. If younger subscribers receive disproportionately more retention discounts, the company is spending more on a demographic that churns more (which may be appropriate) but also potentially signaling to older subscribers that they are less valued (which is not). The question is whether the differential treatment is justified by the different base rates or whether it reflects model error.

Region Analysis

# Regional patterns
print("Feature means by region (test set):")
for feat in ['sessions_last_30d', 'content_completion_rate', 'plan_price']:
    print(f"\n  {feat}:")
    for region in ['Urban', 'Suburban', 'Rural']:
        mask = region_test == region
        val = X_test.loc[X_test.index[mask], feat].mean()
        print(f"    {region:10s}  {val:.2f}")

Rural subscribers may have different engagement patterns (fewer sessions but longer sessions, different content preferences). If the model was trained primarily on urban and suburban patterns, it may underperform for rural subscribers --- a form of representation bias based on geography rather than demographics.


Phase 4: The Fairness-Accuracy Tradeoff

Threshold Adjustment for Age Groups

def find_equalized_thresholds(y_true, y_prob, groups, target_tpr=0.65):
    """Find per-group thresholds for approximately equal TPR."""
    thresholds = {}
    for group in np.unique(groups):
        mask = groups == group
        if len(np.unique(y_true[mask])) < 2:
            thresholds[group] = 0.5
            continue
        fpr_arr, tpr_arr, thresh_arr = roc_curve(
            y_true[mask], y_prob[mask]
        )
        idx = np.argmin(np.abs(tpr_arr - target_tpr))
        thresholds[group] = thresh_arr[idx]
    return thresholds

# Equalize TPR across age groups at ~65%
eq_thresh_age = find_equalized_thresholds(
    y_test.values, y_prob, age_test, target_tpr=0.65
)

print("Group-specific thresholds for ~65% TPR (age groups):")
for group, thresh in sorted(eq_thresh_age.items()):
    print(f"  {group:8s}  threshold={thresh:.3f}")

# Apply equalized thresholds
y_pred_eq_age = np.zeros(len(y_test), dtype=int)
for group, thresh in eq_thresh_age.items():
    mask = age_test == group
    y_pred_eq_age[mask] = (y_prob[mask] >= thresh).astype(int)

# Re-audit
print("\nAfter threshold adjustment:")
audit_age_eq = fairness_audit(
    y_test.values, y_pred_eq_age, y_prob, age_test, "Age Group (Equalized TPR)"
)

Measuring the Cost

acc_before = accuracy_score(y_test, y_pred)
acc_after = accuracy_score(y_test, y_pred_eq_age)

print(f"\nAccuracy with uniform threshold:     {acc_before:.3f}")
print(f"Accuracy with equalized thresholds:  {acc_after:.3f}")
print(f"Accuracy cost:                       {acc_before - acc_after:.3f}")

# Business metric: retention offer count
print(f"\nTotal retention offers (uniform):    {y_pred.sum()}")
print(f"Total retention offers (equalized):  {y_pred_eq_age.sum()}")

Phase 5: Should StreamFlow Care About Fairness?

This is the harder question. Metro General's fairness case is clear: different error rates across racial groups lead to inequitable healthcare. The moral imperative is obvious. StreamFlow's case is murkier.

Arguments for Fairness

  1. Legal risk. Age is a protected class under the Age Discrimination Act. If retention offers constitute a benefit, and older subscribers receive fewer of them, StreamFlow could face regulatory scrutiny.

  2. Business equity. A subscriber who pays $24.99/month and has been active for 3 years should not receive worse churn detection than a subscriber paying $9.99/month who signed up 2 months ago, simply because they are in a different age group.

  3. Customer trust. If subscribers learn that the company's retention efforts are unevenly distributed, it erodes trust.

Arguments Against (or for Nuance)

  1. Different base rates may justify different treatment. If younger subscribers genuinely churn at higher rates, it is rational to allocate more retention resources to them. Demographic parity would require ignoring real risk differences.

  2. The "protected attribute" question. In healthcare, race-based disparities in error rates have clear ethical weight. In subscription entertainment, age-based differences in churn prediction may be a legitimate business pattern, not a fairness violation.

  3. Cost allocation. Every retention offer costs money. If fairness constraints cause the company to send offers to lower-risk subscribers (to equalize rates), the ROI of the retention program decreases.

Practitioner Guidance --- The right answer depends on the context, the values of the organization, and the consequences of the decision. What is non-negotiable is this: you must know whether your model treats groups differently. The audit is required even if the mitigation is debated. You cannot make an informed decision about fairness without first measuring the disparity.


Phase 6: The Model Card

import json

streamflow_card = {
    "model_details": {
        "name": "StreamFlow 30-Day Churn Predictor",
        "version": "4.0 (fairness-audited)",
        "type": "XGBClassifier",
        "date": "2025-08-01",
    },
    "intended_use": {
        "primary": (
            "Weekly risk scoring of active subscribers. Subscribers above "
            "the churn threshold receive retention interventions."
        ),
        "out_of_scope": [
            "Trial subscribers (separate model)",
            "Corporate/enterprise accounts",
            "Predicting churn beyond 30 days",
        ],
    },
    "training_data": {
        "size": "50,000 active subscribers",
        "features": "13 engagement and account features",
        "target": "30-day churn (overall rate ~8.2%)",
        "demographics_available": "Age group, region (not used as features)",
    },
    "performance": {
        "overall_auc": 0.938,
        "overall_accuracy": 0.921,
    },
    "fairness_audit": {
        "age_group": {
            "base_rates": {"18-25": "~12%", "26-40": "~9%", "41-60": "~6%", "60+": "~7%"},
            "tpr_range": "varies by ~0.08 with uniform threshold",
            "fpr_range": "varies by ~0.03 with uniform threshold",
            "mitigation_applied": "None (under review by leadership)",
        },
        "region": {
            "base_rates": {"Urban": "~9%", "Suburban": "~8%", "Rural": "~7%"},
            "tpr_range": "varies by ~0.04 with uniform threshold",
            "mitigation_applied": "None (disparities below action threshold)",
        },
    },
    "limitations": [
        "Churn base rates differ by age group; a single threshold produces unequal error rates",
        "Model has not been audited for intersectional fairness (age x region)",
        "Retention offer effectiveness may differ by demographic (not measured)",
        "Fairness audit conducted on held-out data; production monitoring required",
    ],
    "ethical_considerations": [
        "Age is a legally protected class; differential treatment requires justification",
        "Fairness audit shared with Legal and Marketing leadership on 2025-07-20",
        "Decision on mitigation strategy pending (threshold adjustment vs. no change)",
        "If threshold adjustment applied, expected accuracy cost is ~1-2 percentage points",
    ],
    "next_steps": [
        "Leadership decision on age-group threshold adjustment",
        "Add fairness metrics to production monitoring dashboard (Chapter 32)",
        "Audit retention offer effectiveness by age group and region",
        "Consider intersectional audit (age x region x plan tier)",
    ],
}

print(json.dumps(streamflow_card, indent=2))

Progressive Project M12: Your Fairness Audit

Using your StreamFlow churn model from the progressive project, complete the following:

Required Deliverables

  1. Compute base rates for your churn target across at least two protected attributes (age group and region, as simulated above, or use your own demographic variables).

  2. Run a fairness audit on both attributes. For each group, report: base rate, positive prediction rate, TPR, FPR, precision, AUC.

  3. Compute demographic parity for both attributes. Is the positive prediction rate within 80% (disparate impact ratio) for all groups?

  4. Compute equalized odds for both attributes. Are TPR and FPR within acceptable ranges across groups?

  5. Discover the fairness-accuracy tradeoff. Apply group-specific thresholds to equalize TPR across age groups. Report the accuracy cost.

  6. Write a model card for your StreamFlow model that includes the fairness audit results, the tradeoff you found, and your recommendation.

Guiding Questions

  • Does the churn model have different base rates across age groups? If so, which group has the highest churn rate, and why might that be?
  • Which fairness metric matters most for a churn retention system? Demographic parity (equal offers) or equalized odds (equal detection quality)?
  • If you were the VP of Marketing, would you accept the accuracy cost of equalized thresholds? What additional information would you need?
  • How would you monitor fairness in production? What metrics would you track, and how often?

Lessons from StreamFlow

  1. Fairness applies to business models, not just high-stakes decisions. A churn model that allocates retention resources unequally across demographics is a fairness issue, even if the stakes are lower than healthcare or criminal justice.

  2. Different base rates create fairness tension everywhere. If younger subscribers churn more, a model that flags them more often is doing its job --- but the downstream business treatment may still be inequitable.

  3. The audit is non-negotiable, even if the mitigation is debatable. You must measure the disparity before you can decide whether to act on it. "We did not know" is not an acceptable answer when the audit takes an afternoon.

  4. Fairness decisions require business context, not just technical analysis. Whether to apply equalized thresholds depends on legal requirements, business values, customer expectations, and cost tolerance. The data scientist computes the metrics; the organization makes the decision.

  5. Model cards are not just for healthcare. Every model that makes decisions about people --- including which customers get a discount offer --- should have a model card that documents its fairness properties.


This case study accompanies Chapter 33: Fairness, Bias, and Responsible ML. Return to the chapter for full context.