Case Study 1: StreamFlow Regional Churn Choropleth


Background

StreamFlow, the SaaS company whose churn data has been the progressive example since Chapter 1, has a problem that spreadsheets cannot solve. The Customer Success team reports that churn is "high in the South and Southwest" --- but this is based on anecdote, not analysis. The VP of Customer Success wants answers to three questions:

  1. Which states have the highest churn rates, and is the pattern statistically meaningful or just noise?
  2. Does the geographic pattern persist after controlling for plan mix, tenure, and revenue?
  3. What actionable hypotheses explain the geographic variation?

The task: Build a state-level churn analysis with choropleth visualization, test whether geography adds predictive power to the churn model, and deliver three testable hypotheses for the regional pattern.

This case study applies the full geospatial toolkit from this chapter: spatial aggregation, choropleth mapping, spatial feature engineering, and model comparison.


The Data

import numpy as np
import pandas as pd
import geopandas as gpd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, classification_report
import matplotlib.pyplot as plt

np.random.seed(42)

# --- StreamFlow customer base: 8,000 customers across 30 states ---
states_info = {
    'California':     {'abbr': 'CA', 'region': 'West',      'base_churn': 0.11, 'pop_weight': 0.14},
    'Texas':          {'abbr': 'TX', 'region': 'South',     'base_churn': 0.19, 'pop_weight': 0.11},
    'New York':       {'abbr': 'NY', 'region': 'Northeast', 'base_churn': 0.10, 'pop_weight': 0.09},
    'Florida':        {'abbr': 'FL', 'region': 'South',     'base_churn': 0.23, 'pop_weight': 0.08},
    'Illinois':       {'abbr': 'IL', 'region': 'Midwest',   'base_churn': 0.15, 'pop_weight': 0.06},
    'Pennsylvania':   {'abbr': 'PA', 'region': 'Northeast', 'base_churn': 0.13, 'pop_weight': 0.05},
    'Ohio':           {'abbr': 'OH', 'region': 'Midwest',   'base_churn': 0.20, 'pop_weight': 0.04},
    'Georgia':        {'abbr': 'GA', 'region': 'South',     'base_churn': 0.21, 'pop_weight': 0.04},
    'North Carolina': {'abbr': 'NC', 'region': 'South',     'base_churn': 0.18, 'pop_weight': 0.04},
    'Michigan':       {'abbr': 'MI', 'region': 'Midwest',   'base_churn': 0.19, 'pop_weight': 0.03},
    'New Jersey':     {'abbr': 'NJ', 'region': 'Northeast', 'base_churn': 0.11, 'pop_weight': 0.03},
    'Virginia':       {'abbr': 'VA', 'region': 'South',     'base_churn': 0.14, 'pop_weight': 0.03},
    'Washington':     {'abbr': 'WA', 'region': 'West',      'base_churn': 0.09, 'pop_weight': 0.03},
    'Arizona':        {'abbr': 'AZ', 'region': 'West',      'base_churn': 0.24, 'pop_weight': 0.03},
    'Massachusetts':  {'abbr': 'MA', 'region': 'Northeast', 'base_churn': 0.09, 'pop_weight': 0.03},
    'Tennessee':      {'abbr': 'TN', 'region': 'South',     'base_churn': 0.20, 'pop_weight': 0.02},
    'Indiana':        {'abbr': 'IN', 'region': 'Midwest',   'base_churn': 0.19, 'pop_weight': 0.02},
    'Missouri':       {'abbr': 'MO', 'region': 'Midwest',   'base_churn': 0.17, 'pop_weight': 0.02},
    'Maryland':       {'abbr': 'MD', 'region': 'Northeast', 'base_churn': 0.12, 'pop_weight': 0.02},
    'Colorado':       {'abbr': 'CO', 'region': 'West',      'base_churn': 0.10, 'pop_weight': 0.02},
    'Wisconsin':      {'abbr': 'WI', 'region': 'Midwest',   'base_churn': 0.16, 'pop_weight': 0.01},
    'Minnesota':      {'abbr': 'MN', 'region': 'Midwest',   'base_churn': 0.14, 'pop_weight': 0.01},
    'South Carolina': {'abbr': 'SC', 'region': 'South',     'base_churn': 0.22, 'pop_weight': 0.01},
    'Alabama':        {'abbr': 'AL', 'region': 'South',     'base_churn': 0.23, 'pop_weight': 0.01},
    'Louisiana':      {'abbr': 'LA', 'region': 'South',     'base_churn': 0.21, 'pop_weight': 0.01},
    'Kentucky':       {'abbr': 'KY', 'region': 'South',     'base_churn': 0.20, 'pop_weight': 0.01},
    'Oregon':         {'abbr': 'OR', 'region': 'West',      'base_churn': 0.10, 'pop_weight': 0.01},
    'Oklahoma':       {'abbr': 'OK', 'region': 'South',     'base_churn': 0.19, 'pop_weight': 0.01},
    'Nevada':         {'abbr': 'NV', 'region': 'West',      'base_churn': 0.16, 'pop_weight': 0.01},
    'Iowa':           {'abbr': 'IA', 'region': 'Midwest',   'base_churn': 0.15, 'pop_weight': 0.01},
}

n_customers = 8000
state_names = list(states_info.keys())
weights = [states_info[s]['pop_weight'] for s in state_names]
weights = np.array(weights) / sum(weights)  # Normalize

customer_states = np.random.choice(state_names, n_customers, p=weights)

# Generate customer features with state-level variation
records = []
for i in range(n_customers):
    state = customer_states[i]
    info = states_info[state]

    plan = np.random.choice(
        ['basic', 'pro', 'enterprise'],
        p=[0.50, 0.35, 0.15]
    )

    # Tenure: slightly lower in high-churn states
    tenure = int(np.random.exponential(18 - info['base_churn'] * 20))
    tenure = max(1, min(tenure, 72))

    # Revenue: correlated with plan
    plan_multiplier = {'basic': 1.0, 'pro': 2.2, 'enterprise': 5.0}[plan]
    revenue = round(np.random.lognormal(3.5, 0.5) * plan_multiplier, 2)

    # Support tickets: higher in high-churn states
    tickets = np.random.poisson(1.5 + info['base_churn'] * 10)

    # Churn: base rate modified by individual factors
    churn_prob = info['base_churn']
    if plan == 'enterprise':
        churn_prob *= 0.5  # Enterprise customers churn less
    if tenure > 24:
        churn_prob *= 0.7  # Long-tenured customers churn less
    if tickets > 5:
        churn_prob *= 1.5  # High-ticket customers churn more
    churn_prob = min(churn_prob, 0.6)

    churned = np.random.binomial(1, churn_prob)

    records.append({
        'customer_id': i,
        'state': state,
        'region': info['region'],
        'plan': plan,
        'tenure_months': tenure,
        'monthly_revenue': revenue,
        'support_tickets': tickets,
        'churned': churned,
    })

df = pd.DataFrame(records)
print(f"Dataset: {len(df)} customers across {df['state'].nunique()} states")
print(f"Overall churn rate: {df['churned'].mean():.3f}")
print(f"\nPlan distribution:\n{df['plan'].value_counts(normalize=True).round(3)}")

Step 1: State-Level Churn Analysis

# Aggregate churn by state
state_churn = (
    df.groupby('state')
    .agg(
        n_customers=('customer_id', 'count'),
        churn_rate=('churned', 'mean'),
        avg_revenue=('monthly_revenue', 'mean'),
        avg_tenure=('tenure_months', 'mean'),
        avg_tickets=('support_tickets', 'mean'),
        pct_enterprise=('plan', lambda x: (x == 'enterprise').mean()),
    )
    .round(3)
    .sort_values('churn_rate', ascending=False)
)

print("Churn by state (sorted):")
print(state_churn)
# Regional aggregation
region_churn = (
    df.groupby('region')
    .agg(
        n_customers=('customer_id', 'count'),
        churn_rate=('churned', 'mean'),
        avg_revenue=('monthly_revenue', 'mean'),
        avg_tenure=('tenure_months', 'mean'),
    )
    .round(3)
    .sort_values('churn_rate', ascending=False)
)

print("\nChurn by region:")
print(region_churn)

The pattern is clear: the South has the highest churn rate, followed by the Midwest. The West and Northeast have substantially lower churn. But is this just plan mix?

# Control for plan: churn by region and plan
region_plan_churn = (
    df.groupby(['region', 'plan'])
    .agg(
        n=('customer_id', 'count'),
        churn_rate=('churned', 'mean'),
    )
    .round(3)
)

print("\nChurn by region and plan:")
print(region_plan_churn.unstack(level='plan'))

The geographic pattern persists within each plan tier. Basic plan customers in the South churn at a higher rate than basic plan customers in the Northeast. The same is true for pro and enterprise tiers. This rules out plan mix as the sole explanation.


Step 2: The Choropleth Map

import folium

# Prepare data for choropleth
state_churn_map = state_churn.reset_index()[['state', 'churn_rate']]

# Create the map
m = folium.Map(location=[39.8, -98.5], zoom_start=4, tiles='cartodbpositron')

# US states GeoJSON
us_states_url = (
    'https://raw.githubusercontent.com/python-visualization/'
    'folium/main/tests/us-states.json'
)

# Choropleth layer
choropleth = folium.Choropleth(
    geo_data=us_states_url,
    data=state_churn_map,
    columns=['state', 'churn_rate'],
    key_on='feature.properties.name',
    fill_color='YlOrRd',
    fill_opacity=0.7,
    line_opacity=0.3,
    legend_name='Monthly Churn Rate',
    nan_fill_color='#f0f0f0',
    nan_fill_opacity=0.4,
).add_to(m)

# Add tooltips with state name and churn rate
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(
        fields=['name'],
        aliases=['State:'],
        style='font-size: 12px;',
    )
)

# Add StreamFlow data center locations
datacenters = [
    {'name': 'DC-East', 'lat': 39.0, 'lon': -77.5},
    {'name': 'DC-West', 'lat': 37.4, 'lon': -122.1},
    {'name': 'DC-Central', 'lat': 41.9, 'lon': -87.6},
]

for dc in datacenters:
    folium.Marker(
        location=[dc['lat'], dc['lon']],
        popup=f"StreamFlow {dc['name']}",
        tooltip=dc['name'],
        icon=folium.Icon(color='blue', icon='server', prefix='fa'),
    ).add_to(m)

m.save('streamflow_churn_choropleth.html')
print("Choropleth saved to streamflow_churn_choropleth.html")

The map reveals what the table obscured: a clear geographic gradient. The South and Southwest form a contiguous high-churn zone. The Northeast and Pacific Northwest form a low-churn zone. The data centers are clustered in the low-churn regions. This is a visual hypothesis --- distance to data center may explain part of the churn pattern.

Practical Tip --- When presenting a choropleth to stakeholders, always include a color legend with actual values, use no more than 5-7 color bins, and choose a color scale that is colorblind-safe. The 'YlOrRd' scale in folium is a good default for "low-to-high risk" visualizations.


Step 3: Spatial Feature Engineering

# Feature 1: Distance to nearest data center
# Using approximate state centroids
state_centroids = {
    'California': (-119.4, 36.8), 'Texas': (-99.9, 31.9),
    'New York': (-75.5, 43.0), 'Florida': (-81.5, 27.6),
    'Illinois': (-89.4, 40.6), 'Pennsylvania': (-77.2, 41.2),
    'Ohio': (-82.9, 40.4), 'Georgia': (-83.5, 32.9),
    'North Carolina': (-79.0, 35.8), 'Michigan': (-84.5, 44.3),
    'New Jersey': (-74.4, 40.1), 'Virginia': (-78.2, 37.8),
    'Washington': (-120.7, 47.8), 'Arizona': (-111.1, 34.0),
    'Massachusetts': (-71.4, 42.4), 'Tennessee': (-86.6, 35.5),
    'Indiana': (-86.1, 40.3), 'Missouri': (-91.8, 38.6),
    'Maryland': (-76.6, 39.0), 'Colorado': (-105.3, 39.6),
    'Wisconsin': (-89.6, 43.8), 'Minnesota': (-94.7, 46.7),
    'South Carolina': (-81.2, 34.0), 'Alabama': (-86.9, 32.3),
    'Louisiana': (-91.9, 30.9), 'Kentucky': (-84.3, 37.8),
    'Oregon': (-120.6, 43.8), 'Oklahoma': (-97.1, 35.0),
    'Nevada': (-116.4, 38.8), 'Iowa': (-93.1, 42.0),
}

dc_coords = [(-77.5, 39.0), (-122.1, 37.4), (-87.6, 41.9)]

from math import radians, sin, cos, sqrt, atan2

def haversine(lon1, lat1, lon2, lat2):
    """Great-circle distance in km."""
    R = 6371
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    return R * 2 * atan2(sqrt(a), sqrt(1 - a))

state_dc_distances = {}
for state, (slon, slat) in state_centroids.items():
    min_dist = min(
        haversine(slon, slat, dclon, dclat)
        for dclon, dclat in dc_coords
    )
    state_dc_distances[state] = round(min_dist, 0)

df['dist_nearest_dc_km'] = df['state'].map(state_dc_distances)

# Feature 2: Regional churn rate (leave-one-out to avoid leakage)
# For each customer, compute the churn rate of their state excluding themselves
state_counts = df.groupby('state')['churned'].agg(['sum', 'count'])

df = df.merge(state_counts, left_on='state', right_index=True)
df['state_churn_loo'] = (
    (df['sum'] - df['churned']) / (df['count'] - 1)
).round(4)
df = df.drop(columns=['sum', 'count'])

# Feature 3: Region encoded as ordinal by base churn rate
region_order = {'West': 0, 'Northeast': 1, 'Midwest': 2, 'South': 3}
df['region_encoded'] = df['region'].map(region_order)

print("Spatial features added:")
print(df[['state', 'dist_nearest_dc_km', 'state_churn_loo', 'region_encoded']].head(10))

# Quick validation: correlation between distance and churn
state_dist_churn = df.groupby('state').agg(
    churn_rate=('churned', 'mean'),
    dist_dc=('dist_nearest_dc_km', 'first'),
).round(3)

correlation = state_dist_churn['churn_rate'].corr(state_dist_churn['dist_dc'])
print(f"\nCorrelation between state churn rate and distance to DC: {correlation:.3f}")

Step 4: Model Comparison

# Encode plan
le = LabelEncoder()
df['plan_encoded'] = le.fit_transform(df['plan'])

# Define feature sets
base_features = ['plan_encoded', 'tenure_months', 'monthly_revenue', 'support_tickets']
geo_features = base_features + ['dist_nearest_dc_km', 'state_churn_loo', 'region_encoded']

X_base = df[base_features]
X_geo = df[geo_features]
y = df['churned']

# Train-test split (stratified)
X_base_train, X_base_test, y_train, y_test = train_test_split(
    X_base, y, test_size=0.2, random_state=42, stratify=y
)
X_geo_train = df.loc[X_base_train.index, geo_features]
X_geo_test = df.loc[X_base_test.index, geo_features]

# Model 1: No spatial features
model_base = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.1,
    subsample=0.8, random_state=42
)
scores_base = cross_val_score(
    model_base, X_base_train, y_train, cv=5, scoring='roc_auc'
)

# Model 2: With spatial features
model_geo = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.1,
    subsample=0.8, random_state=42
)
scores_geo = cross_val_score(
    model_geo, X_geo_train, y_train, cv=5, scoring='roc_auc'
)

print("Cross-Validated AUC (5-fold):")
print(f"  Base model (no geo):   {scores_base.mean():.4f} +/- {scores_base.std():.4f}")
print(f"  Geo model (with geo):  {scores_geo.mean():.4f} +/- {scores_geo.std():.4f}")
print(f"  Improvement: {(scores_geo.mean() - scores_base.mean()) * 100:.2f} AUC points")

# Fit final models and evaluate on test set
model_base.fit(X_base_train, y_train)
model_geo.fit(X_geo_train, y_train)

auc_base = roc_auc_score(y_test, model_base.predict_proba(X_base_test)[:, 1])
auc_geo = roc_auc_score(y_test, model_geo.predict_proba(X_geo_test)[:, 1])

print(f"\nTest Set AUC:")
print(f"  Base: {auc_base:.4f}")
print(f"  Geo:  {auc_geo:.4f}")
# Feature importance
importances = pd.Series(
    model_geo.feature_importances_,
    index=geo_features
).sort_values(ascending=False)

print("\nFeature Importances (geo model):")
for feat, imp in importances.items():
    print(f"  {feat:25s} {imp:.4f}")

Step 5: Hypothesis Generation

The analysis reveals a clear geographic churn pattern that persists after controlling for plan, tenure, revenue, and support tickets. Three testable hypotheses:

Hypothesis 1: Network latency. StreamFlow's three data centers are in the DC area, Bay Area, and Chicago. Customers in the South and Southwest are farthest from all three. If StreamFlow is a performance-sensitive application (video conferencing, real-time collaboration), latency could degrade the user experience and drive churn. Test: Correlate per-customer average latency (from application logs) with churn. If latency explains the geographic pattern, adding a data center in Dallas or Atlanta should reduce Southern churn.

Hypothesis 2: Competitive landscape. A regional competitor may have stronger presence in the South. Customers in those states have more alternatives and lower switching costs. Test: Survey churned customers about their reason for leaving. Cross-reference with competitor presence data (sales team intelligence, job postings, press releases). If competitor presence explains the pattern, the response is sales/retention, not infrastructure.

Hypothesis 3: Customer acquisition channel. StreamFlow may acquire Southern customers through different channels (paid ads vs. organic, different reseller partners) that attract lower-intent users. These customers were never a good fit. Test: Segment churn by acquisition channel within each region. If channel explains the gap, the fix is marketing targeting, not product or infrastructure.

Key Insight --- The choropleth does not explain why churn varies by geography. It reveals that it does. The map is the starting point for investigation, not the conclusion. The three hypotheses above are testable, actionable, and mutually non-exclusive --- all three could contribute to the pattern.


Discussion Questions

  1. The leave-one-out state churn rate is the most important spatial feature. Is this feature truly "spatial," or is it a proxy for something else entirely? What would happen if you replaced state names with arbitrary labels?

  2. The choropleth uses state boundaries as the unit of analysis. What information is lost by aggregating to the state level? How would a ZIP-code-level or county-level analysis differ?

  3. If StreamFlow opens a new data center in Atlanta, how would you design an experiment to measure its impact on Southern churn? What confounds would you need to control for?

  4. The model improvement from spatial features (a few AUC points) may seem small. Under what business conditions would this improvement be worth acting on? How would you translate AUC improvement into revenue impact?

  5. A colleague argues that the geographic pattern is entirely explained by broadband infrastructure quality, which varies by state. How would you test this hypothesis? What public data sources would you use?


Return to the chapter | Next: Case Study 2 --- ShopSmart Delivery Radius Optimization