Case Study 1: From Loops to Vectors — Rebuilding the Nutrition Analysis in pandas
Tier 3 — Illustrative/Composite Example: This case study revisits the BurgerBarn nutrition dataset from Chapter 6's Case Study 1. Amara, the nutrition science student, repeats her analysis using pandas and discovers how much faster and more expressive her work becomes. The restaurant, menu items, and all numerical values are fictional, created for pedagogical purposes.
The Setting
Amara stares at her Chapter 6 notebook. It's good work. She's proud of it. Her analysis of the BurgerBarn nutrition data — 120 menu items, 8 columns, three research questions — earned full marks from her professor.
But now, three weeks into Part II of her data science course, she's learned pandas. And she can't stop thinking about all those loops.
Her original analysis was 85 lines of Python. She had a get_numeric_values() function. She had a compute_mean() function. She had a compute_median() function. She had nested loops for computing statistics by category. She had manual string-to-float conversions with try/except blocks scattered everywhere.
It worked. But it was heavy.
Her assignment this week is simple: redo the analysis in pandas. Same dataset, same questions, new tools. Let's watch what happens.
Loading the Data: Then and Now
Chapter 6 version (5 lines + type issues):
import csv
data = []
with open("fastfood_nutrition.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
data.append(row)
print(f"Loaded {len(data)} menu items")
print(f"Columns: {list(data[0].keys())}")
Everything was a string. Amara remembers the frustration of writing float(row["calories"]) dozens of times and catching ValueError when a cell was empty.
Chapter 7 version (2 lines + instant insight):
import pandas as pd
df = pd.read_csv("fastfood_nutrition.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_name 120 non-null object
1 category 120 non-null object
2 calories 118 non-null float64
3 total_fat_g 117 non-null float64
4 sodium_mg 119 non-null float64
5 protein_g 120 non-null float64
6 sugar_g 115 non-null float64
7 serving_size_g 120 non-null float64
dtypes: float64(6), object(2)
memory usage: 7.6+ KB
Two things jump out immediately. First, pandas auto-detected that six columns are numeric and two are text — no manual conversion needed. Second, the non-null counts reveal missing data at a glance: sugar_g has only 115 non-null values (5 missing), total_fat_g has 117 (3 missing), and calories has 118 (2 missing).
In Chapter 6, Amara had to write a function that looped through every column and counted empty strings to discover this information. Now it's one method call.
Question 1: Calorie Distribution Across the Menu
Amara's first question was: What does the calorie distribution look like across the menu?
Chapter 6 approach (15+ lines):
def get_numeric_values(data, column):
values = []
skipped = 0
for row in data:
raw = row[column].strip()
if raw == "":
skipped += 1
continue
try:
values.append(float(raw))
except ValueError:
skipped += 1
return values
cal_values = get_numeric_values(data, "calories")
mean_cal = sum(cal_values) / len(cal_values)
sorted_cal = sorted(cal_values)
median_cal = sorted_cal[len(sorted_cal) // 2]
print(f"Count: {len(cal_values)}")
print(f"Mean: {mean_cal:.0f} calories")
print(f"Min: {min(cal_values):.0f}")
print(f"Max: {max(cal_values):.0f}")
print(f"Median: {median_cal:.0f}")
Chapter 7 approach (1 line):
print(df["calories"].describe())
count 118.000000
mean 487.330508
std 263.152847
min 45.000000
25% 280.000000
50% 445.000000
75% 660.000000
max 1240.000000
Name: calories, dtype: float64
One line. And it gives her more information than her Chapter 6 version — she now has standard deviation and quartiles, which she'd have needed additional functions to compute.
Amara writes her interpretation in a Markdown cell:
The mean calorie count is 487, but the median is 445 — the right skew tells me that a few high-calorie items (up to 1,240!) are pulling the average up. The 25th percentile is 280, suggesting that about a quarter of the menu has relatively modest calorie counts.
Question 2: Category Comparison
How do different menu categories compare nutritionally?
Chapter 6 approach (18+ lines):
category_calories = {}
for row in data:
cat = row["category"]
raw = row["calories"].strip()
if raw == "":
continue
try:
cal = float(raw)
except ValueError:
continue
if cat not in category_calories:
category_calories[cat] = []
category_calories[cat].append(cal)
for cat in sorted(category_calories):
values = category_calories[cat]
mean_val = sum(values) / len(values)
print(f" {cat}: mean = {mean_val:.0f} cal (n={len(values)})")
This was one of the most tedious parts of her Chapter 6 analysis. The nested logic — loop through rows, check the category, convert the value, handle errors, append to the right list — was functional but exhausting.
Chapter 7 approach (1 line):
print(df.groupby("category")["calories"].describe().round(0))
count mean std min 25% 50% 75% max
category
Breakfast 19.0 425.0 168.0 180.0 310.0 395.0 530.0 780.0
Burgers 18.0 720.0 195.0 430.0 570.0 695.0 850.0 1240.0
Chicken 15.0 535.0 145.0 290.0 420.0 510.0 650.0 820.0
Desserts 14.0 390.0 185.0 95.0 250.0 365.0 510.0 750.0
Drinks 22.0 195.0 155.0 45.0 80.0 150.0 280.0 620.0
Salads 12.0 345.0 120.0 165.0 260.0 330.0 415.0 580.0
Sides 18.0 420.0 175.0 120.0 290.0 400.0 545.0 780.0
Every category, every statistic, one line. Amara can now see at a glance that burgers are the highest-calorie category (mean 720, max 1,240) while drinks are the lowest (mean 195). Salads average 345 calories — lower than burgers but not dramatically so, especially when compared to desserts at 390.
She adds this insight:
The "salads are healthy" narrative deserves scrutiny. At a mean of 345 calories, salads are lower than burgers (720) but not far from sides (420). Some salads hit 580 calories — more than the average chicken item. Dressings and toppings likely drive this.
Question 3: Spotting Data Quality Issues
Chapter 6 approach: Required multiple loops to count missing values per column, range checks, and consistency checks.
Chapter 7 approach:
# Missing values — one line
print(df.isnull().sum())
item_name 0
category 0
calories 2
total_fat_g 3
sodium_mg 1
protein_g 0
sugar_g 5
serving_size_g 0
dtype: int64
# Range checks — are any values suspicious?
print(df.describe().loc[["min", "max"]])
calories total_fat_g sodium_mg protein_g sugar_g serving_size_g
min 45.0 0.5 35.0 1.0 0.0 30.0
max 1240.0 72.0 2850.0 58.0 85.0 550.0
# Sodium outlier?
high_sodium = df[df["sodium_mg"] > 2000]
print(high_sodium[["item_name", "category", "sodium_mg"]])
item_name category sodium_mg
47 Loaded Nachos Sides 2850.0
Amara spots it in seconds: one item with 2,850 mg of sodium. In Chapter 6, finding this required extracting numeric values, sorting them, and examining the tails manually. In pandas, a boolean filter pinpoints it immediately.
The Scoreboard
Amara tallies the comparison:
| Operation | Chapter 6 Lines | Chapter 7 Lines |
|---|---|---|
| Load data + first look | 12 | 3 |
| Summary stats (one column) | 15 | 1 |
| Stats by category | 18 | 1 |
| Missing value audit | 10 | 1 |
| Outlier detection | 8 | 2 |
| Total | ~63 | ~8 |
The pandas version isn't just shorter — it's more informative. Every describe() call gives her standard deviation and quartiles that she didn't have in Chapter 6. Every groupby gives her counts alongside the statistics. Every isnull().sum() gives her a complete missing data picture in one shot.
The Lesson
Amara writes a final reflection cell in her notebook:
What I learned from redoing this analysis in pandas:
The Chapter 6 version taught me what data analysis is — the questions, the workflow, the importance of checking data quality. I don't regret doing it manually. Those loops taught me what "computing a mean" actually means, step by step.
But the pandas version taught me what data analysis can feel like — fluid, expressive, focused on the questions rather than the plumbing. I spent my Chapter 6 time wrestling with type conversion and loop mechanics. I spent my Chapter 7 time actually thinking about nutrition.
The biggest shift isn't the code — it's my brain. I'm starting to think "what do I want to compute?" instead of "how do I loop through this?" That's what the textbook calls "thinking in vectors," and I think I'm starting to get it.
Her professor writes in the margin: "This is exactly the transition Part II is about. Now imagine doing this with a million rows instead of 120. That's where pandas really shines."
Try It Yourself
Revisit any analysis you did in Chapter 6 — whether it was the exercises, the case studies, or the project checkpoint. Redo it in pandas. Time yourself both ways. The difference won't just be in the line count. It'll be in how much mental energy you have left for the interesting part: the questions.