Chapter 27 Exercises: Customer Analytics and Segmentation
These exercises progress from direct application of chapter concepts (Tier 1) through independent analysis and design challenges (Tier 5). Work through each tier before moving to the next.
Tier 1: Recall and Fundamentals
These exercises confirm you understood the core concepts. They require minimal coding.
Exercise 1.1 — RFM Definitions
Without looking at the chapter, write a one-sentence definition of each of the three RFM dimensions. Then explain why each one, on its own, gives an incomplete picture of customer value.
Exercise 1.2 — Segment Identification
For each customer profile below, identify the most likely RFM segment and explain your reasoning:
| Customer | Last Purchase | Orders (12 mo) | Annual Spend |
|---|---|---|---|
| A | 5 days ago | 24 | $85,000 |
| B | 210 days ago | 18 | $72,000 |
| C | 8 days ago | 1 | $340 |
| D | 190 days ago | 2 | $1,200 |
| E | 45 days ago | 8 | $14,500 |
Exercise 1.3 — Cohort Analysis Interpretation
The cohort retention table below shows retention rates for three acquisition cohorts:
| Cohort | M+0 | M+1 | M+2 | M+3 | M+6 | M+12 |
|---|---|---|---|---|---|---|
| Jan-2022 | 100% | 42% | 31% | 28% | 19% | 14% |
| Jul-2022 | 100% | 45% | 33% | 30% | 21% | 15% |
| Jan-2023 | 100% | 53% | 41% | 37% | — | — |
Answer these questions: 1. Is month-1 retention improving or declining over time? 2. What does the blank in the Jan-2023 M+12 column mean? 3. Based on the trend, what would you estimate M+6 retention for the Jan-2023 cohort?
Exercise 1.4 — CLV Calculation
A subscription software company has the following metrics: - Average monthly revenue per customer: $299 - Average customer lifespan: 26 months - Gross margin: 72%
Calculate: (a) simple CLV based on revenue, (b) margin-adjusted CLV.
If their customer acquisition cost (CAC) is $1,800, is their unit economics healthy? Explain.
Exercise 1.5 — K-Means Concepts
Answer these conceptual questions about K-means clustering: 1. Why must you scale features before running K-means? 2. What does "inertia" measure in the context of the elbow method? 3. K-means requires you to specify K in advance. Name two approaches for deciding what K to use. 4. You run K-means on customer data and get four clusters. Cluster 3 has very high monetary value and very low recency. What business label would you assign to this cluster, and what action would you take?
Tier 2: Direct Application
Apply the chapter's code patterns to new data scenarios.
Exercise 2.1 — RFM Scoring from Scratch
Create a synthetic transaction dataset with these properties: - 300 unique customers - 2,500 total transactions - Transaction dates spanning 2022-01-01 to 2023-12-31 - Order values drawn from a log-normal distribution (mean $500, wide spread)
Then: 1. Compute raw R, F, M metrics for each customer 2. Apply quintile scoring (1–5) 3. Print the distribution of R, F, and M scores 4. Verify that each score has approximately equal numbers of customers (because quintiles split evenly by design)
Exercise 2.2 — At-Risk Customer Report
Using the data from Exercise 2.1, or the data generated by rfm_analysis.py:
- Filter to only "At Risk" and "Cannot Lose Them" segments
- Sort by monetary value descending
- Add a column called
days_until_one_year_inactivethat calculates how many days until the customer will have been inactive for 365 days - Export this as
urgent_outreach_list.csv
Exercise 2.3 — Cohort Table Construction
Given a DataFrame transactions with columns customer_id, purchase_date, and amount, write a function called build_cohort_table(transactions) that returns a retention rate matrix (as in Section 27.4.1) without using any code from the chapter directly. Write it from memory, then compare to the chapter code.
Exercise 2.4 — Health Score Customization
The health score formula in Section 27.7 weights recency, frequency, monetary, and trend equally (25 points each).
Redesign the health score for a subscription business where: - Recency is less important (customers pay monthly automatically) - Trend (whether they are expanding or contracting their subscription) is most important - Product breadth (number of features actively used) should be a factor
Write the revised calculate_health_score() function with your new weightings. Justify each weighting in a comment.
Exercise 2.5 — Elbow Method Practice
Generate a dataset of 500 customers with three features: r_score, f_score, and m_score (all 1–5 integers). Run the K-means elbow method for K=2 through K=10. Plot the elbow curve. Based on the plot, what K would you choose? Does the elbow method give a clear answer for this particular dataset? Why or why not?
Tier 3: Analysis and Interpretation
These exercises require you to think like a business analyst, not just run code.
Exercise 3.1 — Segment Action Planning
You have run RFM analysis on a retail company and found the following segment distribution:
| Segment | Customers | % of Revenue |
|---|---|---|
| Champions | 89 | 38% |
| Loyal Customers | 213 | 27% |
| At Risk | 156 | 18% |
| Potential Loyalists | 298 | 9% |
| New Customers | 412 | 5% |
| Lost | 534 | 3% |
- Write a one-paragraph recommendation for where to focus marketing and sales effort, with a specific rationale for each segment you prioritize or deprioritize.
- If you only had budget for two outreach campaigns, which two segments would you target and what would each campaign look like?
- The "Lost" segment has 534 customers (the most of any segment) but only 3% of revenue. Should you invest in trying to win them back? Under what conditions would this be worth it?
Exercise 3.2 — Cohort Heatmap Analysis
Run cohort_analysis.py and examine the resulting heatmap. Then write a 200-word business memo (as if you were Priya, writing to Sandra) that:
1. Explains what a cohort retention chart shows (in plain English, without jargon)
2. States the key findings from the chart
3. Makes one specific recommendation based on those findings
Exercise 3.3 — Churn Signal Investigation
Using the churn signal framework from Section 27.6, write a query (using pandas) against a transaction dataset that identifies customers who meet ALL of the following criteria: 1. Were in the top 25% of spenders in the prior 12 months 2. Have not ordered in the last 45 days 3. Showed declining order frequency (fewer orders in the last 6 months than in the 6 months before that)
For each customer found, calculate a "days until one-year inactive" field. Export the results sorted by historical spend descending.
Exercise 3.4 — Comparing RFM and K-Means Segments
Run both the rule-based RFM segmentation (from Section 27.3) and K-means clustering (K=4) on the same dataset. Create a cross-tabulation (use pd.crosstab()) showing how the K-means clusters align with (or differ from) the RFM segments.
Write a paragraph explaining: where do the two methods agree? Where do they disagree? What does that tell you about your customer base?
Exercise 3.5 — CLV Distribution Analysis
Using per-customer CLV calculations from Section 27.2.2:
- Plot a histogram of projected CLV values
- Calculate what percentage of customers account for 80% of total projected CLV
- Define a "CLV tier" column: Top 10%, Top 11–30%, Bottom 70%
- For each tier, calculate average recency, frequency, and monetary values
- What does this tell you about the relationship between current behavioral signals and projected lifetime value?
Tier 4: Extended Projects
These exercises require building something new, not just adapting chapter code.
Exercise 4.1 — Automated Monthly RFM Report
Build a Python script called monthly_rfm_report.py that:
1. Accepts a CSV filename as a command-line argument (python monthly_rfm_report.py transactions.csv)
2. Runs the complete RFM pipeline
3. Saves a PNG heatmap showing the RFM score distribution (R vs M, colored by segment)
4. Saves a CSV with the at-risk list, filtered to customers with monetary score >= 3
5. Prints a one-page executive summary to the console
The script should be fully self-contained and runnable without modification after the initial setup.
Exercise 4.2 — RFM Score Stability Analysis
Run RFM analysis on your transaction dataset, then simulate running it again "one month later" by shifting the analysis date forward 30 days.
- Identify which customers changed segments between the two runs
- Build a "migration matrix" showing how many customers moved from each segment to each other segment
- Calculate the net movement: are more customers moving up (improving) or down (deteriorating)?
- Visualize the migration with a Sankey-style flow chart (use
matplotlibor theplotlylibrary's Sankey diagram)
Exercise 4.3 — Industry-Specific RFM Customization
The standard RFM framework was designed for retail transaction data. Adapt it for one of the following contexts, justifying your choices:
Option A: B2B Software Subscriptions - Replace frequency with "feature adoption score" (number of distinct product features used per month) - Replace monetary with "contract value" (annual contract value) - Add a fourth dimension: "expansion" (whether the account has grown, stayed flat, or shrunk)
Option B: Professional Services / Consulting - Replace raw transaction counts with "project count" and "average project duration" - Add a "referral score" (has this client referred other clients?) - Define segment names that make sense in a services context
Write the scoring functions, segment assignment logic, and segment action recommendations for your chosen option.
Exercise 4.4 — Cohort Retention with Revenue
Standard cohort analysis counts customers. Build a parallel analysis that tracks average revenue per user (ARPU) by cohort, not just customer counts.
For each cohort and period: 1. Calculate total revenue from active customers 2. Divide by the original cohort size (not just active customers) to get "revenue retention" 3. Compare customer retention to revenue retention — in healthy businesses, revenue retention often exceeds customer retention (because surviving customers tend to spend more over time) 4. Plot both metrics on the same chart for comparison
Tier 5: Advanced and Open-Ended
These exercises have no single correct answer. They require judgment, creativity, and business thinking.
Exercise 5.1 — Designing a Customer Analytics Program
You have been asked by a regional bank to design a customer analytics program for their retail banking customers. The bank has: - 85,000 retail checking/savings customers - Transaction data going back 5 years - Product holding data (which products each customer has) - Customer service interaction records - No existing customer segmentation
Design (in writing and pseudocode, not necessarily working code) a full customer analytics program that includes: 1. Data inventory and quality assessment plan 2. Segmentation approach (justify your choice of RFM, K-means, or a hybrid) 3. Key metrics to track and their refresh cadence 4. Recommended actions for each segment 5. How you would measure the ROI of the analytics program itself
Exercise 5.2 — Retention Curve Fitting
Cohort retention curves follow a roughly exponential decay. Using scipy.optimize.curve_fit or a manual least-squares approach:
- Fit an exponential decay model to the retention data from a cohort:
retention(t) = a * e^(-b*t) - Extract the parameters
aandbfor each cohort - Use the fitted parameters to predict what month-12 retention will be for your most recent cohort (which may not yet have 12 months of data)
- Discuss the limitations of this prediction approach
Exercise 5.3 — Customer Value Segmentation vs. Behavioral Segmentation
There is a philosophical debate in customer analytics between: - Value-based segmentation: Group customers by how much they are worth (CLV tiers, revenue tiers) - Behavioral segmentation: Group customers by what they do (RFM, K-means on behaviors)
Write a 400-word essay arguing for one approach over the other for a specific business context of your choice. Address: What decisions does each approach support? What does each approach miss? When would you use both?
Exercise 5.4 — Build a Customer Analytics Dashboard
Using matplotlib subplots (or plotly if you prefer interactive charts), build a single-page Customer Analytics Dashboard that fits on a standard widescreen display. The dashboard should show:
- Segment distribution (pie or donut chart)
- Revenue by segment (horizontal bar chart)
- Cohort retention heatmap (or simplified version showing only M+1 and M+3)
- Health score distribution (histogram)
- Top 10 at-risk customers by revenue (simple table or annotated chart)
The dashboard should be generated from a single function call and take no more than 5 seconds to render on a standard laptop.
Exercise 5.5 — Critique and Improve the Chapter's Health Score
Review the calculate_health_score() function from Section 27.7. Identify at least three specific limitations or weaknesses:
1. A case where the formula would give a misleadingly high score to a customer who is actually unhealthy
2. A case where it would give a misleadingly low score to a customer who is actually thriving
3. A business scenario where the formula would be entirely inappropriate
For each limitation, propose a specific code change that would address it. Implement your improved version and compare the score distribution before and after your changes on the same dataset.