Case Study 2: Marcus's Sales Dashboard — From CSV to Insights in 20 Lines
Tier 3 — Illustrative/Composite Example: This case study follows Marcus, the small business owner introduced in Chapter 1, as he uses pandas to analyze his coffee shop's sales data. The coffee shop ("The Grind"), product names, sales figures, and all numerical values are fictional, created for pedagogical purposes to illustrate realistic small business analytics challenges.
The Setting
Marcus owns The Grind, a coffee shop near a university campus. Business has been good — or at least, it feels like business has been good. The line is long every morning. The tables are full during lunch. But Marcus has a nagging question that "feeling busy" can't answer: Which products are actually making money, and which are just taking up menu space?
His point-of-sale system exports a CSV file at the end of each month. He's been stacking these files in a folder for a year and doing nothing with them. Last semester, he took the first six chapters of this data science course — he knows Python basics, functions, and data structures. He loaded one month's file in Chapter 6 using csv.DictReader and computed a few averages with loops. It took him an entire evening.
This week he learned pandas, and he wants to try again. His goal: load the data, understand his sales patterns, and decide whether to cut any underperforming products. He's giving himself one hour.
The Data
Marcus's exported CSV, grind_sales_2024.csv, has 3,650 rows — one row per transaction — covering a full calendar year. Here are the columns:
| Column | Description | Example |
|---|---|---|
date |
Sale date (YYYY-MM-DD) | "2024-03-15" |
time_of_day |
Morning, Afternoon, or Evening | "Morning" |
product |
Product name | "Oat Latte" |
category |
Product category | "Coffee", "Tea", "Food", "Other" |
quantity |
Number sold in this transaction | 2 |
unit_price |
Price per item | 5.50 |
Step 1: Load and Inspect (2 minutes)
Marcus opens a Jupyter notebook and starts typing:
import pandas as pd
df = pd.read_csv("grind_sales_2024.csv")
print(f"Shape: {df.shape}")
df.head()
Shape: (3650, 6)
| date | time_of_day | product | category | quantity | unit_price | |
|---|---|---|---|---|---|---|
| 0 | 2024-01-02 | Morning | Drip Coffee | Coffee | 1 | 3.00 |
| 1 | 2024-01-02 | Morning | Oat Latte | Coffee | 1 | 5.50 |
| 2 | 2024-01-02 | Morning | Blueberry Muffin | Food | 2 | 3.75 |
| 3 | 2024-01-02 | Morning | Chai Latte | Tea | 1 | 5.00 |
| 4 | 2024-01-02 | Afternoon | Iced Americano | Coffee | 1 | 4.50 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650 entries, 0 to 3649
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3650 non-null object
1 time_of_day 3650 non-null object
2 product 3650 non-null object
3 category 3650 non-null object
4 quantity 3650 non-null int64
5 unit_price 3650 non-null float64
dtypes: float64(1), int64(1), object(4)
memory usage: 171.2+ KB
No missing values. Types look right (quantity is integer, price is float). The date column loaded as object (string), not a date — they'll fix that in Chapter 11 when they learn about datetime handling. For now, it's fine as a string.
Marcus checks the unique values:
print("Products:", df["product"].nunique())
print("Categories:", df["category"].unique())
print("Time periods:", df["time_of_day"].unique())
Products: 15
Categories: ['Coffee' 'Tea' 'Food' 'Other']
Time periods: ['Morning' 'Afternoon' 'Evening']
Fifteen products, four categories, three time periods. A manageable universe.
Step 2: Create the Revenue Column (30 seconds)
The dataset has quantity and unit_price, but not revenue. In Chapter 6, Marcus would need a loop:
# The old way (Chapter 6)
for row in data:
row["revenue"] = int(row["quantity"]) * float(row["unit_price"])
In pandas, it's one vectorized line:
df["revenue"] = df["quantity"] * df["unit_price"]
print(df[["product", "quantity", "unit_price", "revenue"]].head())
product quantity unit_price revenue
0 Drip Coffee 1 3.00 3.00
1 Oat Latte 1 5.50 5.50
2 Blueberry Muffin 2 3.75 7.50
3 Chai Latte 1 5.00 5.00
4 Iced Americano 1 4.50 4.50
No loops. No type conversion. pandas multiplied the entire quantity column by the entire unit_price column, element by element, in a single operation. Vectorized thinking in action.
Step 3: The Big Picture (2 minutes)
Marcus starts with the overview numbers:
print(f"Total revenue: ${df['revenue'].sum():,.2f}")
print(f"Total items sold: {df['quantity'].sum():,}")
print(f"Average transaction revenue: ${df['revenue'].mean():.2f}")
print(f"Average transaction quantity: {df['quantity'].mean():.1f}")
Total revenue: $52,847.50
Total items sold: 11,245
Average transaction revenue: $14.48
Average transaction quantity: 3.1
About $53,000 in annual revenue from this location. Not bad for a small shop. But Marcus isn't here for the totals — he wants to understand the composition.
Step 4: Revenue by Category (1 minute)
category_summary = (df.groupby("category")["revenue"]
.agg(["sum", "mean", "count"])
.sort_values("sum", ascending=False))
category_summary["pct_of_total"] = (category_summary["sum"] /
category_summary["sum"].sum() * 100)
print(category_summary.round(2))
sum mean count pct_of_total
category
Coffee 28540.50 16.85 1694 54.0
Food 13450.25 11.72 1147 25.4
Tea 7256.75 12.95 560 13.7
Other 3600.00 9.47 380 6.8
Coffee dominates at 54% of revenue. Food is a solid 25%. Tea contributes 14%, and "Other" (bottled water, juice, merchandise) is the smallest at 7%.
Marcus writes a note: "Coffee is the engine. But food has 1,147 transactions — almost as many as coffee. Worth investigating whether food items are underpriced."
Step 5: Revenue by Product (2 minutes)
Now for the question Marcus really cares about: which products earn their spot on the menu?
product_summary = (df.groupby("product")["revenue"]
.sum()
.sort_values(ascending=False))
print(product_summary)
product
Oat Latte 7,562.50
Drip Coffee 6,480.00
Iced Americano 5,418.00
Blueberry Muffin 4,237.50
Chai Latte 3,850.00
Breakfast Sandwich 3,510.00
Cappuccino 3,240.00
Croissant 2,887.50
Earl Grey 1,950.00
Matcha Latte 1,456.75
Green Smoothie 1,350.00
Scone 1,275.25
Bottled Water 1,200.00
Hot Chocolate 1,080.00
Sparkling Water 900.00
Name: revenue, dtype: float64
The top three products — Oat Latte, Drip Coffee, and Iced Americano — account for nearly $19,500, or about 37% of total revenue. At the bottom, Sparkling Water and Hot Chocolate each contribute less than $1,100 for the entire year.
But raw revenue doesn't tell the whole story. A product might have low revenue because it's sold rarely, or because it's cheap. Marcus digs deeper:
product_detail = df.groupby("product").agg(
total_revenue=("revenue", "sum"),
total_quantity=("quantity", "sum"),
avg_price=("unit_price", "mean"),
num_transactions=("revenue", "count")
).sort_values("total_revenue", ascending=False)
print(product_detail.round(2))
total_revenue total_quantity avg_price num_transactions
product
Oat Latte 7562.50 1375 5.50 520
Drip Coffee 6480.00 2160 3.00 685
Iced Americano 5418.00 1204 4.50 412
Blueberry Muffin 4237.50 1130 3.75 395
Chai Latte 3850.00 770 5.00 280
Breakfast Sandwich 3510.00 450 7.80 185
Cappuccino 3240.00 648 5.00 245
Croissant 2887.50 825 3.50 310
Earl Grey 1950.00 390 5.00 155
Matcha Latte 1456.75 265 5.50 98
Green Smoothie 1350.00 225 6.00 90
Scone 1275.25 365 3.50 120
Bottled Water 1200.00 600 2.00 215
Hot Chocolate 1080.00 216 5.00 85
Sparkling Water 900.00 450 2.00 165
Now the picture gets interesting. Drip Coffee has the most transactions (685) and the highest quantity sold (2,160 items) but ranks second in revenue because it's the cheapest at $3.00. The Breakfast Sandwich is high-priced ($7.80) but has relatively few transactions (185) — it's a niche item with good margins.
At the bottom: Hot Chocolate has only 85 transactions all year. That's less than two per week. Matcha Latte has 98. These are the products Marcus should scrutinize.
Step 6: Time-of-Day Patterns (2 minutes)
time_summary = df.groupby("time_of_day")["revenue"].agg(["sum", "count"]).round(2)
time_summary["avg_per_transaction"] = (time_summary["sum"] / time_summary["count"]).round(2)
print(time_summary)
sum count avg_per_transaction
time_of_day
Afternoon 15425.50 1150 13.41
Evening 6847.00 510 13.43
Morning 30575.00 1990 15.36
Mornings account for 58% of revenue. Afternoons are solid at 29%. Evenings are the weakest — only 14% of revenue with 510 transactions. Marcus already knew mornings were busy, but now he has the numbers.
He combines time and category to see what sells when:
time_category = (df.groupby(["time_of_day", "category"])["revenue"]
.sum()
.round(2))
print(time_category)
time_of_day category
Afternoon Coffee 6,850.00
Food 4,250.25
Other 1,575.00
Tea 2,750.25
Evening Coffee 3,240.50
Food 1,950.00
Other 825.00
Tea 831.50
Morning Coffee 18,450.00
Food 7,250.00
Other 1,200.00
Tea 3,675.00
Name: revenue, dtype: float64
Morning coffee ($18,450) is the backbone of the business. Evening tea ($831.50) is almost negligible. This kind of cross-tabulation would have required deeply nested loops in Chapter 6.
Step 7: The Decision (5 minutes of thinking, 5 lines of code)
Marcus creates a final summary to guide his decision. He wants to flag products that contribute less than 3% of total revenue:
total_rev = df["revenue"].sum()
product_pct = (df.groupby("product")["revenue"].sum() / total_rev * 100).round(1)
low_performers = product_pct[product_pct < 3].sort_values()
print("Products below 3% of total revenue:")
print(low_performers)
Products below 3% of total revenue:
product
Sparkling Water 1.7
Hot Chocolate 2.0
Scone 2.4
Green Smoothie 2.6
Matcha Latte 2.8
Name: revenue, dtype: float64
Five products contribute less than 3% each. Together, they account for about 11.5% of revenue. Cutting any single one wouldn't be catastrophic — but Marcus isn't ready to cut blindly. He notices that the Scone and Green Smoothie might serve strategic purposes (variety, health-conscious customers). Hot Chocolate is seasonal — it's probably stronger in winter. Sparkling Water has low revenue but requires no prep labor.
Marcus adds a final note to his notebook:
Recommendation: Don't cut any products yet. But monitor Hot Chocolate and Matcha Latte over the next quarter — if they stay below 2% of revenue during their peak season (winter for HC, spring for ML), they're candidates for removal. Consider a price increase for Breakfast Sandwich ($7.80 is already the highest, but 185 transactions with likely high margins suggests demand is inelastic).
Next steps: Need to analyze seasonal patterns (Chapter 11 for date handling) and profit margins (need cost data, not just revenue).
The Final Count
Marcus checks the clock. Forty-five minutes. He reviews his notebook: it has a complete revenue analysis, category breakdowns, time-of-day patterns, product rankings, and a preliminary business recommendation. The entire analysis is 20 lines of pandas code plus Markdown narration.
In Chapter 6, using pure Python, he managed to compute mean prices by category in an evening of work. In Chapter 7, he went from raw CSV to business decision in under an hour. The code was shorter, yes — but more importantly, his mental energy went to thinking about his business instead of wrestling with Python loops.
That's the real gift of pandas: not fewer keystrokes, but more brain space for the questions that matter.
What Marcus Used from Chapter 7
Let's map Marcus's analysis back to the concepts from the chapter:
| Concept | How Marcus Used It |
|---|---|
pd.read_csv() |
Loaded the CSV file in one line |
df.info() |
Checked data types and missing values |
| Vectorized operation | Created the revenue column: df["quantity"] * df["unit_price"] |
| Boolean indexing | Filtered for products below the 3% threshold |
sort_values() |
Ranked products by revenue |
groupby + aggregation |
Revenue by category, by product, by time of day |
| Method chaining | Combined groupby, aggregation, and sorting in flowing expressions |
| Markdown narration | Documented findings and business reasoning alongside code |
Every one of these tools came from Chapter 7. And every one of them will get even more powerful in the chapters ahead.
Try It Yourself
Think of a small dataset from your own life or work — gym attendance, monthly expenses, hours studied per course, playlist listening history. Export it as a CSV (or create one by hand) and run through the same workflow Marcus did:
- Load and inspect with
info(),head(),describe() - Create a computed column if applicable
- Group by a category and compute summary statistics
- Sort to find the top and bottom performers
- Write a one-paragraph interpretation of what you found
You'll be surprised how much insight 20 lines can produce.