Sandra Chen has been VP of Sales at Acme Corp for four years. In that time, she has watched her team pour equal energy into every account: answering support calls from clients who spend $800 a year and taking clients who spend $80,000 a year for...
In This Chapter
- Learning Objectives
- 27.1 Not All Customers Are Equal — And That's the Point
- 27.2 Customer Lifetime Value: The Foundational Metric
- 27.3 RFM Analysis: Scoring Customers on Three Dimensions
- 27.4 Cohort Analysis: Tracking Customer Groups Over Time
- 27.5 Customer Segmentation with K-Means Clustering
- 27.6 Churn Indicators: Early Warning Signs
- 27.7 Building a Customer Health Score
- 27.8 Bringing It Together: A Complete Customer Analytics Workflow
- Chapter Summary
Chapter 27: Customer Analytics and Segmentation
Learning Objectives
By the end of this chapter, you will be able to:
- Explain why customer analytics is foundational to sustainable revenue growth
- Calculate Customer Lifetime Value (CLV) using both simple and cohort-adjusted models
- Build a complete RFM (Recency, Frequency, Monetary) scoring system in pandas
- Assign customers to named segments (Champions, Loyal, At Risk, Lost) and prescribe business actions for each
- Construct an acquisition cohort table and visualize retention rates as a heatmap
- Apply K-means clustering to customer data for behavioral segmentation
- Identify early churn indicators from behavioral patterns
- Build a composite customer health score that flags accounts needing attention
27.1 Not All Customers Are Equal — And That's the Point
Sandra Chen has been VP of Sales at Acme Corp for four years. In that time, she has watched her team pour equal energy into every account: answering support calls from clients who spend $800 a year and taking clients who spend $80,000 a year for granted because "they always renew." Last quarter, one of those large accounts — a facilities management company that had been ordering office supplies from Acme for six years — quietly shifted most of their business to a competitor. Nobody noticed until the renewal conversation failed.
"We didn't see it coming," Sandra told Priya Okonkwo, Acme's junior analyst, after the post-mortem. "They just… went quiet. And we were busy chasing new logos."
This is the core problem that customer analytics is designed to solve. Not all customers contribute equally to your business. A small fraction of your customers typically generate the majority of your revenue. Some customers have been loyal for years; others placed one order eighteen months ago and have never returned. Some are growing their spend; others are quietly winding down. If you treat all of them the same, you will systematically under-invest in your most valuable relationships and fail to catch the ones that are slipping away.
Customer analytics gives you the tools to see this clearly. It answers questions like:
- Which customers are most valuable, and what do they have in common?
- Which customers used to be valuable but are showing signs of disengagement?
- How long, on average, does it take a customer to make their second purchase?
- If I acquired 200 customers in January, how many are still buying in month six, month twelve?
- Are there distinct behavioral groups in my customer base that should receive different offers, messaging, or service levels?
These are not academic questions. They drive decisions about where to focus sales attention, how to allocate marketing budgets, which accounts to flag for executive relationship-building, and which offers to make to whom. Done well, customer analytics transforms reactive account management into proactive customer strategy.
27.2 Customer Lifetime Value: The Foundational Metric
Customer Lifetime Value (CLV) is the total revenue — or more precisely, the total profit — you can expect to receive from a customer over the entire duration of your relationship. It is the single most important number in customer analytics because it tells you what a customer is actually worth, which in turn tells you how much you can afford to spend acquiring a similar customer and how hard you should work to retain an existing one.
27.2.1 The Simple CLV Model
The simplest version of CLV looks like this:
CLV = Average Purchase Value × Purchase Frequency × Customer Lifespan
In Python:
# Simple CLV calculation
average_purchase_value = 450.00 # average order value in dollars
purchase_frequency = 8.5 # average orders per year
customer_lifespan_years = 3.2 # average years before churning
simple_clv = average_purchase_value * purchase_frequency * customer_lifespan_years
print(f"Simple CLV: ${simple_clv:,.2f}")
# Simple CLV: $12,240.00
This model is easy to explain and useful for a first approximation. If your simple CLV is $12,240 and you are spending $3,000 to acquire a new customer, your payback period is reasonable. If you are spending $15,000, you have a problem.
The limitation is that simple CLV treats all customers as identical. Your actual customer base has a distribution — some customers buy once and leave, others buy monthly for a decade. Averaging across them can mask important patterns.
27.2.2 Customer-Level CLV in pandas
A more useful approach computes CLV at the individual customer level using your transaction history:
import pandas as pd
import numpy as np
# Assume we have a transaction DataFrame with these columns:
# customer_id, order_date, order_value
# For this example, we generate synthetic data
np.random.seed(42)
n_customers = 500
n_transactions = 3000
transaction_data = pd.DataFrame({
"customer_id": np.random.choice(range(1, n_customers + 1), n_transactions),
"order_date": pd.date_range("2022-01-01", periods=n_transactions, freq="4h"),
"order_value": np.random.exponential(scale=480, size=n_transactions).round(2),
})
# Compute per-customer summary metrics
customer_summary = (
transaction_data
.groupby("customer_id")
.agg(
total_revenue=("order_value", "sum"),
order_count=("order_value", "count"),
first_purchase=("order_date", "min"),
last_purchase=("order_date", "max"),
avg_order_value=("order_value", "mean"),
)
.reset_index()
)
# Calculate customer tenure in years
reference_date = pd.Timestamp("2024-01-01")
customer_summary["tenure_years"] = (
(reference_date - customer_summary["first_purchase"]).dt.days / 365.25
)
# Annualized revenue (revenue per year of tenure)
customer_summary["annual_revenue"] = (
customer_summary["total_revenue"] / customer_summary["tenure_years"]
)
# Simple projected CLV: annualized revenue × expected additional lifespan
# We assume a 3-year expected remaining lifespan for this example
expected_remaining_years = 3.0
customer_summary["projected_clv"] = (
customer_summary["annual_revenue"] * expected_remaining_years
)
print(customer_summary.sort_values("projected_clv", ascending=False).head(10))
This gives you a ranked list of customers by projected value — your VIP list. The customers at the top deserve your most attentive service. The customers near the bottom may be candidates for lower-cost service models.
27.2.3 Margin-Adjusted CLV
Revenue is not profit. A customer who buys high-margin products is more valuable than a customer who buys the same dollar amount of low-margin products. If you have margin data by product or order, the most accurate CLV uses gross margin rather than revenue:
# If you have a margin column (as a decimal, e.g., 0.35 = 35% gross margin)
customer_summary["gross_margin_rate"] = 0.38 # simplified; use actual per-order margins
customer_summary["profit_clv"] = (
customer_summary["projected_clv"] * customer_summary["gross_margin_rate"]
)
For this chapter, we will work with revenue-based CLV for simplicity. The principle is the same.
27.3 RFM Analysis: Scoring Customers on Three Dimensions
RFM analysis is a proven customer scoring framework used by direct marketers, e-commerce companies, and B2B sales teams alike. It scores every customer on three dimensions:
- Recency (R): How recently did they last buy? A customer who bought yesterday is more engaged than one who bought two years ago.
- Frequency (F): How often do they buy? A customer who orders every month is more loyal than one who orders once a year.
- Monetary (M): How much do they spend? A customer who spends $50,000 annually matters more than one who spends $500.
The power of RFM is not in any single score — it is in the combination. A customer with high frequency and high monetary value but low recency (they used to buy a lot, but they've gone quiet) is a very different business problem than a customer with high recency but low frequency and monetary value (they're new and enthusiastic but haven't fully committed yet). RFM lets you see these distinctions and act on them.
27.3.1 Calculating R, F, and M in pandas
import pandas as pd
import numpy as np
# Reference date: the day we run the analysis
# In production, use pd.Timestamp.today()
analysis_date = pd.Timestamp("2024-01-15")
# Compute raw RFM metrics
rfm_raw = (
transaction_data
.groupby("customer_id")
.agg(
last_purchase_date=("order_date", "max"),
frequency=("order_value", "count"),
monetary=("order_value", "sum"),
)
.reset_index()
)
# Recency: days since last purchase (lower = more recent = better)
rfm_raw["recency_days"] = (
analysis_date - rfm_raw["last_purchase_date"]
).dt.days
print(rfm_raw.head())
print(f"\nMedian recency: {rfm_raw['recency_days'].median():.0f} days")
print(f"Median frequency: {rfm_raw['frequency'].median():.1f} orders")
print(f"Median monetary: ${rfm_raw['monetary'].median():,.2f}")
27.3.2 Scoring and Binning: Turning Raw Numbers into Scores
Raw RFM values are not directly comparable — frequency counts and dollar amounts live on completely different scales. We need to convert them into comparable scores, typically on a 1–5 scale.
The standard approach uses quintile scoring: we divide the customer base into five equal groups based on each metric and assign a score of 1–5.
# Score each dimension on a 1-5 scale using quintiles
# For recency: lower days = more recent = higher score (so we invert)
# For frequency and monetary: higher = better = higher score
rfm_raw["r_score"] = pd.qcut(
rfm_raw["recency_days"],
q=5,
labels=[5, 4, 3, 2, 1], # inverted: fewer days = score 5
duplicates="drop",
).astype(int)
rfm_raw["f_score"] = pd.qcut(
rfm_raw["frequency"],
q=5,
labels=[1, 2, 3, 4, 5],
duplicates="drop",
).astype(int)
rfm_raw["m_score"] = pd.qcut(
rfm_raw["monetary"],
q=5,
labels=[1, 2, 3, 4, 5],
duplicates="drop",
).astype(int)
# Composite RFM score: simple average (or weighted)
rfm_raw["rfm_score"] = (
rfm_raw["r_score"] + rfm_raw["f_score"] + rfm_raw["m_score"]
) / 3.0
# Concatenated score string: useful for segment lookup
rfm_raw["rfm_cell"] = (
rfm_raw["r_score"].astype(str)
+ rfm_raw["f_score"].astype(str)
+ rfm_raw["m_score"].astype(str)
)
print(rfm_raw[["customer_id", "recency_days", "frequency", "monetary",
"r_score", "f_score", "m_score", "rfm_score"]].head(10))
27.3.3 Defining RFM Segments
With scores in hand, we assign customers to named segments. These names carry business meaning — they tell a story about the customer's relationship with your company and immediately suggest what action to take.
def assign_rfm_segment(row):
"""
Assign a business segment label based on RFM scores.
Rules are directional guidelines, not rigid algorithms.
Adjust thresholds to fit your specific business context.
"""
r = row["r_score"]
f = row["f_score"]
m = row["m_score"]
if r >= 4 and f >= 4 and m >= 4:
return "Champions"
elif r >= 3 and f >= 3 and m >= 3:
return "Loyal Customers"
elif r >= 4 and f <= 2:
return "Recent Customers"
elif r >= 3 and f >= 2 and m >= 2:
return "Potential Loyalists"
elif r >= 4 and f == 1 and m == 1:
return "New Customers"
elif r == 3 and f <= 2:
return "Promising"
elif r <= 2 and f >= 3 and m >= 3:
return "At Risk"
elif r <= 2 and f >= 4 and m >= 4:
return "Cannot Lose Them"
elif r == 2 and f <= 2:
return "Hibernating"
elif r == 1 and f == 1:
return "Lost"
else:
return "Needs Attention"
rfm_raw["segment"] = rfm_raw.apply(assign_rfm_segment, axis=1)
# Segment summary
segment_summary = (
rfm_raw
.groupby("segment")
.agg(
customer_count=("customer_id", "count"),
avg_recency_days=("recency_days", "mean"),
avg_frequency=("frequency", "mean"),
avg_monetary=("monetary", "mean"),
total_revenue=("monetary", "sum"),
)
.round(2)
.sort_values("total_revenue", ascending=False)
.reset_index()
)
print(segment_summary.to_string(index=False))
27.3.4 Business Actions for Each Segment
The real value of RFM segmentation is not the math — it is the clarity it provides about what to do next. Here is how to think about each segment:
Champions (R=4-5, F=4-5, M=4-5) These customers buy often, recently, and spend heavily. They are your ambassadors. Actions: Reward them with early access to new products. Ask for referrals and testimonials. Give them your best service tier. Do not over-contact them — they are already engaged.
Loyal Customers (R=3-5, F=3-5, M=3-5) Solid, consistent buyers who are not quite at Champion level. Actions: Offer a loyalty program or tiered discount structure. Upsell to adjacent product categories. Provide proactive account reviews.
At Risk (R=1-2, F=3-5, M=3-5) This is your most urgent segment. These customers used to buy frequently and spend well — they have high lifetime value — but they have gone quiet. Something changed. Actions: Personal outreach from a senior account manager or the account's salesperson. A "we've missed you" offer. A direct conversation to understand if there is a problem.
Cannot Lose Them (R=1-2, F=4-5, M=4-5) Like "At Risk" but even more urgent — these are your highest-value customers who are disengaging. Actions: Escalate to executive level. Offer to visit in person. Understand the root cause immediately.
New Customers (R=4-5, F=1, M=1) Bought recently but only once. Actions: Onboarding sequence. Second-purchase incentive. Educational content about your product range.
Potential Loyalists (R=3-5, F=2-3, M=2-3) Showing positive signals but haven't fully committed. Actions: A loyalty program invitation. Category expansion offers. Proactive check-in.
Hibernating (R=2, F=1-2, M=1-2) Low engagement, haven't bought recently. Low value historically. Actions: Low-cost re-engagement (email campaign). If no response, move resources elsewhere.
Lost (R=1, F=1, M=1) Bought once (or very infrequently), a long time ago. Actions: One last win-back attempt via email. If no response, archive. Do not spend significant resources here.
27.4 Cohort Analysis: Tracking Customer Groups Over Time
Cohort analysis tracks groups of customers who share a common characteristic — usually the time period when they first purchased — and follows them over time. The primary question it answers is: Of the customers who joined in month X, what percentage are still buying in month Y?
This is different from an overall retention rate. An overall retention rate averages across all customers and can mask the fact that your newer cohorts are retaining much worse than your older ones (a sign that your product or service has degraded) or much better (a sign that a recent improvement is paying off).
27.4.1 Building an Acquisition Cohort Table
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Add acquisition month and order month to transaction data
transaction_data["order_month"] = transaction_data["order_date"].dt.to_period("M")
# Find each customer's first purchase month (acquisition cohort)
customer_first_purchase = (
transaction_data
.groupby("customer_id")["order_date"]
.min()
.dt.to_period("M")
.rename("cohort_month")
.reset_index()
)
# Join cohort month back to transaction data
transaction_cohort = transaction_data.merge(
customer_first_purchase, on="customer_id", how="left"
)
# Calculate months since acquisition
transaction_cohort["period_number"] = (
transaction_cohort["order_month"] - transaction_cohort["cohort_month"]
).apply(lambda x: x.n)
# Build cohort table: unique active customers per cohort-period
cohort_data = (
transaction_cohort
.groupby(["cohort_month", "period_number"])["customer_id"]
.nunique()
.reset_index()
.rename(columns={"customer_id": "active_customers"})
)
# Pivot to wide format
cohort_pivot = cohort_data.pivot_table(
index="cohort_month",
columns="period_number",
values="active_customers",
)
# Calculate retention rates: divide each period by period 0 (cohort size)
cohort_sizes = cohort_pivot[0]
retention_matrix = cohort_pivot.divide(cohort_sizes, axis=0).round(3)
print("Cohort Retention Rate Matrix:")
print(retention_matrix.iloc[:6, :7].to_string())
27.4.2 Visualizing the Cohort Retention Heatmap
A cohort retention heatmap is one of the most powerful visualizations in customer analytics. Each row is a cohort (acquisition month). Each column is a period (months since acquisition). The color intensity shows retention — darker means more customers still buying.
fig, ax = plt.subplots(figsize=(14, 8))
# Plot the heatmap
sns.heatmap(
retention_matrix.iloc[:12, :12], # first 12 cohorts, 12 periods
annot=True,
fmt=".0%",
cmap="YlOrRd_r", # reversed: dark = high retention
vmin=0,
vmax=1,
linewidths=0.5,
ax=ax,
cbar_kws={"label": "Retention Rate"},
)
ax.set_title("Customer Cohort Retention Analysis\nAcme Corp, 2022–2023",
fontsize=14, fontweight="bold", pad=20)
ax.set_xlabel("Months Since First Purchase", fontsize=12)
ax.set_ylabel("Acquisition Cohort (Month)", fontsize=12)
ax.set_yticklabels(
[str(p) for p in retention_matrix.index[:12]],
rotation=0,
fontsize=9,
)
plt.tight_layout()
plt.savefig("cohort_retention_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()
print("Heatmap saved.")
Reading the heatmap: The diagonal from top-left toward bottom-right shows how new cohorts retain in their first few months. A flat diagonal (consistent color) means your retention is stable over time. A darkening diagonal in recent cohorts means newer customers are retaining better. A lightening diagonal means something has changed for the worse.
27.4.3 What Cohort Analysis Tells You
The heatmap answers questions that aggregate metrics cannot:
- Is our retention improving or getting worse? Compare the brightness of recent cohort rows to older ones.
- When do we lose most customers? Look at which period column shows the steepest drop. If most churn happens between period 0 and period 1, you have an onboarding or first-repeat-purchase problem.
- Are seasonal cohorts different? Customers acquired in November (Black Friday) may have systematically lower retention than customers acquired in March.
- Did a product change help or hurt? If you launched a new feature in month 8, did cohorts acquired after that month retain better?
27.5 Customer Segmentation with K-Means Clustering
RFM gives you a rule-based segmentation: you define the rules (R >= 4, F >= 4 = Champion), and customers fall into boxes. This is transparent and interpretable. But there is another approach: let the data define its own natural groupings.
K-means clustering is a machine learning algorithm that finds groups of similar data points without you specifying the rules. You give it your customer data (RFM scores, or other features) and tell it how many groups (K) to find. It figures out the rest.
27.5.1 The Intuition: No Heavy Math Required
Imagine plotting all your customers on a scatter plot, with frequency on the X-axis and monetary value on the Y-axis. You would likely see some natural clusters — a dense cloud of low-frequency, low-spend customers in the bottom-left; a sparse group of high-frequency, high-spend customers in the top-right; perhaps a cluster of moderate-frequency, low-spend customers in the middle-left.
K-means finds those clusters algorithmically. It: 1. Places K "center points" (called centroids) randomly in the data space. 2. Assigns every customer to the nearest centroid. 3. Moves each centroid to the center of its assigned customers. 4. Repeats steps 2–3 until the assignments stop changing.
The result is K groups, each with a centroid that describes the "typical" customer in that group. You then interpret the centroids to give each cluster a business name.
27.5.2 Preprocessing: Feature Scaling
K-means uses distances between data points. This means features on larger scales (like monetary value, which might range from $100 to $100,000) will dominate features on smaller scales (like frequency, which might range from 1 to 50). We must scale the features before clustering.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
# Use the RFM scores as clustering features
features = rfm_raw[["r_score", "f_score", "m_score"]].copy()
# Scale to zero mean, unit variance
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
print("Before scaling:")
print(f" r_score range: {features['r_score'].min()} to {features['r_score'].max()}")
print(f" monetary range: ${rfm_raw['monetary'].min():.0f} to "
f"${rfm_raw['monetary'].max():.0f}")
print("\nAfter scaling (first 3 rows):")
print(features_scaled[:3])
27.5.3 Choosing K: The Elbow Method
The key decision in K-means is choosing how many clusters to create. The elbow method runs K-means for several values of K (typically 2–10), plots the "inertia" (sum of squared distances from each point to its centroid), and looks for the "elbow" — the point where adding another cluster stops reducing inertia significantly.
# Elbow method
inertia_values = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(features_scaled)
inertia_values.append(kmeans.inertia_)
# Plot the elbow curve
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(list(k_range), inertia_values, marker="o", linewidth=2,
color="#2c7bb6", markersize=8)
ax.set_xlabel("Number of Clusters (K)", fontsize=12)
ax.set_ylabel("Inertia (Within-Cluster Sum of Squares)", fontsize=12)
ax.set_title("Elbow Method for Optimal K\nCustomer Segmentation", fontsize=13)
ax.axvline(x=4, color="red", linestyle="--", alpha=0.7, label="Suggested K=4")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("elbow_curve.png", dpi=150, bbox_inches="tight")
plt.show()
Look for the value of K where the curve bends sharply. In most customer datasets, the elbow appears at K=3, K=4, or K=5. If the curve is very smooth (no clear elbow), you may need to either add more features or accept that your customer base does not have strong natural clusters.
27.5.4 Fitting K-Means and Interpreting Clusters
# Fit with the chosen K (let's say K=4 based on elbow)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
rfm_raw["cluster"] = kmeans.fit_predict(features_scaled)
# Characterize each cluster by its mean RFM values
cluster_profiles = (
rfm_raw
.groupby("cluster")
.agg(
customer_count=("customer_id", "count"),
avg_r_score=("r_score", "mean"),
avg_f_score=("f_score", "mean"),
avg_m_score=("m_score", "mean"),
avg_recency_days=("recency_days", "mean"),
avg_frequency=("frequency", "mean"),
avg_monetary=("monetary", "mean"),
total_revenue=("monetary", "sum"),
)
.round(2)
)
print("Cluster Profiles:")
print(cluster_profiles.to_string())
Now you interpret. Suppose the output shows:
- Cluster 0: High R, High F, High M — this is your VIP / Champion cluster
- Cluster 1: Low R, High F, High M — this is your At Risk / Lapsing High-Value cluster
- Cluster 2: High R, Low F, Low M — this is your New / Occasional Buyers cluster
- Cluster 3: Low R, Low F, Low M — this is your Lost / Dormant cluster
The cluster labels are assigned by the algorithm, not by you, but the business interpretation is entirely yours.
# Assign meaningful business labels to clusters
cluster_labels = {
0: "VIP Customers",
1: "At Risk High-Value",
2: "New and Growing",
3: "Dormant",
}
rfm_raw["cluster_label"] = rfm_raw["cluster"].map(cluster_labels)
# Visualize in 2D (R vs M, colored by cluster)
fig, ax = plt.subplots(figsize=(10, 7))
colors = ["#e41a1c", "#377eb8", "#4daf4a", "#ff7f00"]
for cluster_id, label in cluster_labels.items():
mask = rfm_raw["cluster"] == cluster_id
ax.scatter(
rfm_raw.loc[mask, "r_score"],
rfm_raw.loc[mask, "m_score"],
c=colors[cluster_id],
label=label,
alpha=0.6,
s=60,
edgecolors="white",
linewidths=0.5,
)
ax.set_xlabel("Recency Score (5 = Most Recent)", fontsize=12)
ax.set_ylabel("Monetary Score (5 = Highest Spend)", fontsize=12)
ax.set_title("Customer Segmentation: K-Means Clusters\nRecency vs. Monetary Value",
fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("customer_clusters.png", dpi=150, bbox_inches="tight")
plt.show()
27.5.5 K-Means vs. RFM Rule-Based Segmentation: Which to Use?
Both approaches have their place. Here is how to choose:
Use rule-based RFM when: - You need to explain the segmentation to non-technical stakeholders ("Champions are customers with R, F, and M scores all above 4") - Your team will take manual action on specific segments - You need stable, consistent segment definitions over time
Use K-means when: - You want to discover natural groupings in the data without imposing your assumptions - You are adding many features beyond R, F, M (product categories bought, support tickets raised, geography, etc.) - You are building a model that will run automatically on new data
In practice, many companies use both: rule-based RFM for operational actions (sales team outreach lists), K-means for strategic analysis and persona development.
27.6 Churn Indicators: Early Warning Signs
Churn is when a customer stops doing business with you. For subscription businesses, this is explicit — they cancel their subscription. For transactional businesses like Acme Corp, churn is implicit — they simply stop ordering, and you have to detect it from behavioral signals.
27.6.1 Common Behavioral Churn Indicators
| Signal | Description | Threshold Example |
|---|---|---|
| Days since last purchase | The most direct signal | > 90 days for a customer who normally buys every 30 |
| Declining order frequency | Order rate is decreasing over time | 50%+ drop in orders, last 6 months vs. prior 6 |
| Declining order value | Spending less per order | 30%+ drop in average order value |
| Shrinking product breadth | Buying fewer product categories | From 5 categories to 1-2 categories |
| Support ticket spike | More complaints recently | 2+ tickets in 30 days with no prior history |
| No response to outreach | Sales calls/emails not returned | 3+ unreturned contacts |
27.6.2 Computing Churn Indicators in pandas
# Compute churn risk signals for each customer
# Reference: last 6 months vs. prior 6 months
recent_cutoff = analysis_date - pd.DateOffset(months=6)
prior_cutoff = analysis_date - pd.DateOffset(months=12)
# Recent period orders
recent_orders = (
transaction_data[transaction_data["order_date"] >= recent_cutoff]
.groupby("customer_id")
.agg(
recent_order_count=("order_value", "count"),
recent_total_spend=("order_value", "sum"),
)
.reset_index()
)
# Prior period orders
prior_orders = (
transaction_data[
(transaction_data["order_date"] >= prior_cutoff)
& (transaction_data["order_date"] < recent_cutoff)
]
.groupby("customer_id")
.agg(
prior_order_count=("order_value", "count"),
prior_total_spend=("order_value", "sum"),
)
.reset_index()
)
# Merge and compute changes
churn_signals = recent_orders.merge(
prior_orders, on="customer_id", how="outer"
).fillna(0)
churn_signals["order_count_change_pct"] = (
(churn_signals["recent_order_count"] - churn_signals["prior_order_count"])
/ churn_signals["prior_order_count"].replace(0, np.nan)
* 100
)
churn_signals["spend_change_pct"] = (
(churn_signals["recent_total_spend"] - churn_signals["prior_total_spend"])
/ churn_signals["prior_total_spend"].replace(0, np.nan)
* 100
)
# Flag customers showing significant decline
churn_signals["declining_frequency"] = (
churn_signals["order_count_change_pct"] < -40
)
churn_signals["declining_spend"] = (
churn_signals["spend_change_pct"] < -30
)
churn_signals["churn_risk_signals"] = (
churn_signals["declining_frequency"].astype(int)
+ churn_signals["declining_spend"].astype(int)
)
at_risk_customers = churn_signals[churn_signals["churn_risk_signals"] >= 1]
print(f"Customers showing churn signals: {len(at_risk_customers)}")
print(f" Declining frequency: {churn_signals['declining_frequency'].sum()}")
print(f" Declining spend: {churn_signals['declining_spend'].sum()}")
27.7 Building a Customer Health Score
A customer health score is a single composite number — typically 0–100 — that summarizes how engaged and valuable a customer is right now. It combines multiple signals into one actionable metric that your sales team can use at a glance.
def calculate_health_score(row):
"""
Calculate a 0-100 customer health score.
Higher is healthier.
Components:
- Recency score (0-25): based on days since last purchase
- Frequency score (0-25): normalized order frequency
- Monetary score (0-25): normalized spend
- Trend score (0-25): recent vs. prior period trend
"""
# Recency component: 25 points max
# Full points for < 30 days, zero for > 180 days
recency_days = row.get("recency_days", 180)
recency_component = max(0, 25 * (1 - recency_days / 180))
# Frequency component: 25 points max (normalized to 1-5 score)
f_score = row.get("f_score", 1)
frequency_component = (f_score / 5) * 25
# Monetary component: 25 points max (normalized to 1-5 score)
m_score = row.get("m_score", 1)
monetary_component = (m_score / 5) * 25
# Trend component: 25 points max
# Gaining or stable = 25, declining = scaled down
spend_change = row.get("spend_change_pct", 0)
if pd.isna(spend_change):
trend_component = 12.5 # neutral if no prior period data
elif spend_change >= 0:
trend_component = 25.0
else:
trend_component = max(0, 25 * (1 + spend_change / 100))
total_score = (
recency_component
+ frequency_component
+ monetary_component
+ trend_component
)
return round(total_score, 1)
# Merge RFM and churn signals for health score input
health_input = rfm_raw.merge(
churn_signals[["customer_id", "spend_change_pct"]],
on="customer_id",
how="left",
)
health_input["health_score"] = health_input.apply(calculate_health_score, axis=1)
# Assign health tiers
health_input["health_tier"] = pd.cut(
health_input["health_score"],
bins=[0, 25, 50, 75, 100],
labels=["Critical", "At Risk", "Healthy", "Thriving"],
include_lowest=True,
)
tier_summary = (
health_input
.groupby("health_tier", observed=True)
.agg(
customer_count=("customer_id", "count"),
avg_health_score=("health_score", "mean"),
total_revenue=("monetary", "sum"),
)
.round(2)
)
print("\nCustomer Health Score Distribution:")
print(tier_summary.to_string())
The health score gives your sales team an instantly scannable signal. Sort by health score and look at the bottom 10%: these are the accounts that deserve immediate attention, regardless of how they appear on the surface.
27.8 Bringing It Together: A Complete Customer Analytics Workflow
Here is the recommended sequence for a first customer analytics project:
-
Start with data inventory. What transaction data do you have? How far back? Do you have customer demographic or firmographic data? Does every transaction link to a customer ID?
-
Compute CLV and identify your top 20% of customers. These accounts fund your business. Know them well.
-
Run RFM analysis. Generate scores and segment the full customer base. The "At Risk" and "Cannot Lose Them" segments are your most urgent operational priorities.
-
Build cohort retention tables. This tells you whether your business is getting better or worse at keeping customers. A declining retention trend needs strategic attention before anything else.
-
Add K-means clustering if you have rich feature data. This is most valuable when you have data beyond just transaction history — product categories, geographies, industry verticals, support history.
-
Build the health score. This is the operational output that your sales team actually uses day to day.
-
Set up monitoring. Customer analytics is not a one-time exercise. Run it monthly. Alert on customers who cross from "Healthy" to "At Risk" this month. Flag any "At Risk" customer who was a top-20% customer six months ago.
Chapter Summary
Customer analytics is the discipline of understanding who your customers are, how they behave, and which relationships deserve the most attention. The core tools in this chapter — CLV, RFM analysis, cohort analysis, K-means clustering, and health scoring — each illuminate a different facet of your customer relationships:
- CLV tells you what each customer is worth over their lifetime
- RFM tells you where each customer stands right now on three behavioral dimensions
- Cohort analysis tells you whether your customer relationships are getting stronger or weaker over time
- K-means clustering discovers natural behavioral groupings without imposing your assumptions
- Churn indicators give you early warning before a valuable relationship ends
- Health scoring translates all of the above into a single actionable number
The most important thing to understand is that these tools are only valuable when they connect to business action. An "At Risk" segment that nobody follows up on is just a label. A cohort retention chart that nobody reads is just a pretty picture. Build the analysis, build the workflow, and make sure someone owns the action.