Case Study 2: Marcus's Sales Dashboard — From CSV to Insights in 20 Lines

Contributors to Introduction to Data Science

Case Study 2: Marcus's Sales Dashboard — From CSV to Insights in 20 Lines

Tier 3 — Illustrative/Composite Example: This case study follows Marcus, the small business owner introduced in Chapter 1, as he uses pandas to analyze his coffee shop's sales data. The coffee shop ("The Grind"), product names, sales figures, and all numerical values are fictional, created for pedagogical purposes to illustrate realistic small business analytics challenges.

The Setting

Marcus owns The Grind, a coffee shop near a university campus. Business has been good — or at least, it feels like business has been good. The line is long every morning. The tables are full during lunch. But Marcus has a nagging question that "feeling busy" can't answer: Which products are actually making money, and which are just taking up menu space?

His point-of-sale system exports a CSV file at the end of each month. He's been stacking these files in a folder for a year and doing nothing with them. Last semester, he took the first six chapters of this data science course — he knows Python basics, functions, and data structures. He loaded one month's file in Chapter 6 using csv.DictReader and computed a few averages with loops. It took him an entire evening.

This week he learned pandas, and he wants to try again. His goal: load the data, understand his sales patterns, and decide whether to cut any underperforming products. He's giving himself one hour.

The Data

Marcus's exported CSV, grind_sales_2024.csv, has 3,650 rows — one row per transaction — covering a full calendar year. Here are the columns:

Column	Description	Example
`date`	Sale date (YYYY-MM-DD)	"2024-03-15"
`time_of_day`	Morning, Afternoon, or Evening	"Morning"
`product`	Product name	"Oat Latte"
`category`	Product category	"Coffee", "Tea", "Food", "Other"
`quantity`	Number sold in this transaction	2
`unit_price`	Price per item	5.50

Step 1: Load and Inspect (2 minutes)

Marcus opens a Jupyter notebook and starts typing:

import pandas as pd

df = pd.read_csv("grind_sales_2024.csv")
print(f"Shape: {df.shape}")
df.head()

Shape: (3650, 6)

	date	time_of_day	product	category	quantity	unit_price
0	2024-01-02	Morning	Drip Coffee	Coffee	1	3.00
1	2024-01-02	Morning	Oat Latte	Coffee	1	5.50
2	2024-01-02	Morning	Blueberry Muffin	Food	2	3.75
3	2024-01-02	Morning	Chai Latte	Tea	1	5.00
4	2024-01-02	Afternoon	Iced Americano	Coffee	1	4.50

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650 entries, 0 to 3649
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   date         3650 non-null   object
 1   time_of_day  3650 non-null   object
 2   product      3650 non-null   object
 3   category     3650 non-null   object
 4   quantity     3650 non-null   int64
 5   unit_price   3650 non-null   float64
dtypes: float64(1), int64(1), object(4)
memory usage: 171.2+ KB

No missing values. Types look right (quantity is integer, price is float). The date column loaded as object (string), not a date — they'll fix that in Chapter 11 when they learn about datetime handling. For now, it's fine as a string.

Marcus checks the unique values:

print("Products:", df["product"].nunique())
print("Categories:", df["category"].unique())
print("Time periods:", df["time_of_day"].unique())

Products: 15
Categories: ['Coffee' 'Tea' 'Food' 'Other']
Time periods: ['Morning' 'Afternoon' 'Evening']

Fifteen products, four categories, three time periods. A manageable universe.

Step 2: Create the Revenue Column (30 seconds)

The dataset has quantity and unit_price, but not revenue. In Chapter 6, Marcus would need a loop:

# The old way (Chapter 6)
for row in data:
    row["revenue"] = int(row["quantity"]) * float(row["unit_price"])

In pandas, it's one vectorized line:

df["revenue"] = df["quantity"] * df["unit_price"]
print(df[["product", "quantity", "unit_price", "revenue"]].head())

           product  quantity  unit_price  revenue
0     Drip Coffee         1        3.00     3.00
1       Oat Latte         1        5.50     5.50
2  Blueberry Muffin      2        3.75     7.50
3      Chai Latte         1        5.00     5.00
4  Iced Americano        1        4.50     4.50

No loops. No type conversion. pandas multiplied the entire quantity column by the entire unit_price column, element by element, in a single operation. Vectorized thinking in action.

Step 3: The Big Picture (2 minutes)

Marcus starts with the overview numbers:

print(f"Total revenue: ${df['revenue'].sum():,.2f}")
print(f"Total items sold: {df['quantity'].sum():,}")
print(f"Average transaction revenue: ${df['revenue'].mean():.2f}")
print(f"Average transaction quantity: {df['quantity'].mean():.1f}")

Total revenue: $52,847.50
Total items sold: 11,245
Average transaction revenue: $14.48
Average transaction quantity: 3.1

About $53,000 in annual revenue from this location. Not bad for a small shop. But Marcus isn't here for the totals — he wants to understand the composition.

Step 4: Revenue by Category (1 minute)

category_summary = (df.groupby("category")["revenue"]
                    .agg(["sum", "mean", "count"])
                    .sort_values("sum", ascending=False))
category_summary["pct_of_total"] = (category_summary["sum"] /
                                     category_summary["sum"].sum() * 100)
print(category_summary.round(2))

          sum   mean  count  pct_of_total
category
Coffee  28540.50  16.85  1694         54.0
Food    13450.25  11.72  1147         25.4
Tea      7256.75  12.95   560          13.7
Other    3600.00   9.47   380           6.8

Coffee dominates at 54% of revenue. Food is a solid 25%. Tea contributes 14%, and "Other" (bottled water, juice, merchandise) is the smallest at 7%.

Marcus writes a note: "Coffee is the engine. But food has 1,147 transactions — almost as many as coffee. Worth investigating whether food items are underpriced."

Step 5: Revenue by Product (2 minutes)

Now for the question Marcus really cares about: which products earn their spot on the menu?

product_summary = (df.groupby("product")["revenue"]
                   .sum()
                   .sort_values(ascending=False))
print(product_summary)

product
Oat Latte            7,562.50
Drip Coffee          6,480.00
Iced Americano       5,418.00
Blueberry Muffin     4,237.50
Chai Latte           3,850.00
Breakfast Sandwich   3,510.00
Cappuccino           3,240.00
Croissant            2,887.50
Earl Grey            1,950.00
Matcha Latte         1,456.75
Green Smoothie       1,350.00
Scone                1,275.25
Bottled Water        1,200.00
Hot Chocolate        1,080.00
Sparkling Water        900.00
Name: revenue, dtype: float64

The top three products — Oat Latte, Drip Coffee, and Iced Americano — account for nearly $19,500, or about 37% of total revenue. At the bottom, Sparkling Water and Hot Chocolate each contribute less than $1,100 for the entire year.

But raw revenue doesn't tell the whole story. A product might have low revenue because it's sold rarely, or because it's cheap. Marcus digs deeper:

product_detail = df.groupby("product").agg(
    total_revenue=("revenue", "sum"),
    total_quantity=("quantity", "sum"),
    avg_price=("unit_price", "mean"),
    num_transactions=("revenue", "count")
).sort_values("total_revenue", ascending=False)

print(product_detail.round(2))

                  total_revenue  total_quantity  avg_price  num_transactions
product
Oat Latte              7562.50           1375       5.50               520
Drip Coffee            6480.00           2160       3.00               685
Iced Americano         5418.00           1204       4.50               412
Blueberry Muffin       4237.50           1130       3.75               395
Chai Latte             3850.00            770       5.00               280
Breakfast Sandwich     3510.00            450       7.80               185
Cappuccino             3240.00            648       5.00               245
Croissant              2887.50            825       3.50               310
Earl Grey              1950.00            390       5.00               155
Matcha Latte           1456.75            265       5.50                98
Green Smoothie         1350.00            225       6.00                90
Scone                  1275.25            365       3.50               120
Bottled Water          1200.00            600       2.00               215
Hot Chocolate          1080.00            216       5.00                85
Sparkling Water         900.00            450       2.00               165

Now the picture gets interesting. Drip Coffee has the most transactions (685) and the highest quantity sold (2,160 items) but ranks second in revenue because it's the cheapest at $3.00. The Breakfast Sandwich is high-priced ($7.80) but has relatively few transactions (185) — it's a niche item with good margins.

At the bottom: Hot Chocolate has only 85 transactions all year. That's less than two per week. Matcha Latte has 98. These are the products Marcus should scrutinize.

Step 6: Time-of-Day Patterns (2 minutes)

time_summary = df.groupby("time_of_day")["revenue"].agg(["sum", "count"]).round(2)
time_summary["avg_per_transaction"] = (time_summary["sum"] / time_summary["count"]).round(2)
print(time_summary)

              sum  count  avg_per_transaction
time_of_day
Afternoon  15425.50   1150                13.41
Evening     6847.00    510                13.43
Morning    30575.00   1990                15.36

Mornings account for 58% of revenue. Afternoons are solid at 29%. Evenings are the weakest — only 14% of revenue with 510 transactions. Marcus already knew mornings were busy, but now he has the numbers.

He combines time and category to see what sells when:

time_category = (df.groupby(["time_of_day", "category"])["revenue"]
                 .sum()
                 .round(2))
print(time_category)

time_of_day  category
Afternoon    Coffee       6,850.00
             Food         4,250.25
             Other        1,575.00
             Tea          2,750.25
Evening      Coffee       3,240.50
             Food         1,950.00
             Other          825.00
             Tea            831.50
Morning      Coffee      18,450.00
             Food         7,250.00
             Other        1,200.00
             Tea          3,675.00
Name: revenue, dtype: float64

Morning coffee ($18,450) is the backbone of the business. Evening tea ($831.50) is almost negligible. This kind of cross-tabulation would have required deeply nested loops in Chapter 6.

Step 7: The Decision (5 minutes of thinking, 5 lines of code)

Marcus creates a final summary to guide his decision. He wants to flag products that contribute less than 3% of total revenue:

total_rev = df["revenue"].sum()
product_pct = (df.groupby("product")["revenue"].sum() / total_rev * 100).round(1)
low_performers = product_pct[product_pct < 3].sort_values()
print("Products below 3% of total revenue:")
print(low_performers)

Products below 3% of total revenue:
product
Sparkling Water    1.7
Hot Chocolate      2.0
Scone              2.4
Green Smoothie     2.6
Matcha Latte       2.8
Name: revenue, dtype: float64

Five products contribute less than 3% each. Together, they account for about 11.5% of revenue. Cutting any single one wouldn't be catastrophic — but Marcus isn't ready to cut blindly. He notices that the Scone and Green Smoothie might serve strategic purposes (variety, health-conscious customers). Hot Chocolate is seasonal — it's probably stronger in winter. Sparkling Water has low revenue but requires no prep labor.

Marcus adds a final note to his notebook:

Recommendation: Don't cut any products yet. But monitor Hot Chocolate and Matcha Latte over the next quarter — if they stay below 2% of revenue during their peak season (winter for HC, spring for ML), they're candidates for removal. Consider a price increase for Breakfast Sandwich ($7.80 is already the highest, but 185 transactions with likely high margins suggests demand is inelastic).

Next steps: Need to analyze seasonal patterns (Chapter 11 for date handling) and profit margins (need cost data, not just revenue).

The Final Count

Marcus checks the clock. Forty-five minutes. He reviews his notebook: it has a complete revenue analysis, category breakdowns, time-of-day patterns, product rankings, and a preliminary business recommendation. The entire analysis is 20 lines of pandas code plus Markdown narration.

In Chapter 6, using pure Python, he managed to compute mean prices by category in an evening of work. In Chapter 7, he went from raw CSV to business decision in under an hour. The code was shorter, yes — but more importantly, his mental energy went to thinking about his business instead of wrestling with Python loops.

That's the real gift of pandas: not fewer keystrokes, but more brain space for the questions that matter.

What Marcus Used from Chapter 7

Let's map Marcus's analysis back to the concepts from the chapter:

Concept	How Marcus Used It
`pd.read_csv()`	Loaded the CSV file in one line
`df.info()`	Checked data types and missing values
Vectorized operation	Created the `revenue` column: `df["quantity"] * df["unit_price"]`
Boolean indexing	Filtered for products below the 3% threshold
`sort_values()`	Ranked products by revenue
`groupby` + aggregation	Revenue by category, by product, by time of day
Method chaining	Combined groupby, aggregation, and sorting in flowing expressions
Markdown narration	Documented findings and business reasoning alongside code

Every one of these tools came from Chapter 7. And every one of them will get even more powerful in the chapters ahead.

Try It Yourself

Think of a small dataset from your own life or work — gym attendance, monthly expenses, hours studied per course, playlist listening history. Export it as a CSV (or create one by hand) and run through the same workflow Marcus did:

Load and inspect with info(), head(), describe()
Create a computed column if applicable
Group by a category and compute summary statistics
Sort to find the top and bottom performers
Write a one-paragraph interpretation of what you found

You'll be surprised how much insight 20 lines can produce.