Case Study 1: From Loops to Vectors — Rebuilding the Nutrition Analysis in pandas


Tier 3 — Illustrative/Composite Example: This case study revisits the BurgerBarn nutrition dataset from Chapter 6's Case Study 1. Amara, the nutrition science student, repeats her analysis using pandas and discovers how much faster and more expressive her work becomes. The restaurant, menu items, and all numerical values are fictional, created for pedagogical purposes.


The Setting

Amara stares at her Chapter 6 notebook. It's good work. She's proud of it. Her analysis of the BurgerBarn nutrition data — 120 menu items, 8 columns, three research questions — earned full marks from her professor.

But now, three weeks into Part II of her data science course, she's learned pandas. And she can't stop thinking about all those loops.

Her original analysis was 85 lines of Python. She had a get_numeric_values() function. She had a compute_mean() function. She had a compute_median() function. She had nested loops for computing statistics by category. She had manual string-to-float conversions with try/except blocks scattered everywhere.

It worked. But it was heavy.

Her assignment this week is simple: redo the analysis in pandas. Same dataset, same questions, new tools. Let's watch what happens.

Loading the Data: Then and Now

Chapter 6 version (5 lines + type issues):

import csv

data = []
with open("fastfood_nutrition.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        data.append(row)

print(f"Loaded {len(data)} menu items")
print(f"Columns: {list(data[0].keys())}")

Everything was a string. Amara remembers the frustration of writing float(row["calories"]) dozens of times and catching ValueError when a cell was empty.

Chapter 7 version (2 lines + instant insight):

import pandas as pd

df = pd.read_csv("fastfood_nutrition.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   item_name       120 non-null    object
 1   category        120 non-null    object
 2   calories        118 non-null    float64
 3   total_fat_g     117 non-null    float64
 4   sodium_mg       119 non-null    float64
 5   protein_g       120 non-null    float64
 6   sugar_g         115 non-null    float64
 7   serving_size_g  120 non-null    float64
dtypes: float64(6), object(2)
memory usage: 7.6+ KB

Two things jump out immediately. First, pandas auto-detected that six columns are numeric and two are text — no manual conversion needed. Second, the non-null counts reveal missing data at a glance: sugar_g has only 115 non-null values (5 missing), total_fat_g has 117 (3 missing), and calories has 118 (2 missing).

In Chapter 6, Amara had to write a function that looped through every column and counted empty strings to discover this information. Now it's one method call.

Question 1: Calorie Distribution Across the Menu

Amara's first question was: What does the calorie distribution look like across the menu?

Chapter 6 approach (15+ lines):

def get_numeric_values(data, column):
    values = []
    skipped = 0
    for row in data:
        raw = row[column].strip()
        if raw == "":
            skipped += 1
            continue
        try:
            values.append(float(raw))
        except ValueError:
            skipped += 1
    return values

cal_values = get_numeric_values(data, "calories")
mean_cal = sum(cal_values) / len(cal_values)
sorted_cal = sorted(cal_values)
median_cal = sorted_cal[len(sorted_cal) // 2]
print(f"Count: {len(cal_values)}")
print(f"Mean:  {mean_cal:.0f} calories")
print(f"Min:   {min(cal_values):.0f}")
print(f"Max:   {max(cal_values):.0f}")
print(f"Median: {median_cal:.0f}")

Chapter 7 approach (1 line):

print(df["calories"].describe())
count    118.000000
mean     487.330508
std      263.152847
min       45.000000
25%      280.000000
50%      445.000000
75%      660.000000
max     1240.000000
Name: calories, dtype: float64

One line. And it gives her more information than her Chapter 6 version — she now has standard deviation and quartiles, which she'd have needed additional functions to compute.

Amara writes her interpretation in a Markdown cell:

The mean calorie count is 487, but the median is 445 — the right skew tells me that a few high-calorie items (up to 1,240!) are pulling the average up. The 25th percentile is 280, suggesting that about a quarter of the menu has relatively modest calorie counts.

Question 2: Category Comparison

How do different menu categories compare nutritionally?

Chapter 6 approach (18+ lines):

category_calories = {}
for row in data:
    cat = row["category"]
    raw = row["calories"].strip()
    if raw == "":
        continue
    try:
        cal = float(raw)
    except ValueError:
        continue
    if cat not in category_calories:
        category_calories[cat] = []
    category_calories[cat].append(cal)

for cat in sorted(category_calories):
    values = category_calories[cat]
    mean_val = sum(values) / len(values)
    print(f"  {cat}: mean = {mean_val:.0f} cal (n={len(values)})")

This was one of the most tedious parts of her Chapter 6 analysis. The nested logic — loop through rows, check the category, convert the value, handle errors, append to the right list — was functional but exhausting.

Chapter 7 approach (1 line):

print(df.groupby("category")["calories"].describe().round(0))
           count    mean    std    min    25%    50%    75%     max
category
Breakfast   19.0   425.0  168.0  180.0  310.0  395.0  530.0   780.0
Burgers     18.0   720.0  195.0  430.0  570.0  695.0  850.0  1240.0
Chicken     15.0   535.0  145.0  290.0  420.0  510.0  650.0   820.0
Desserts    14.0   390.0  185.0   95.0  250.0  365.0  510.0   750.0
Drinks      22.0   195.0  155.0   45.0   80.0  150.0  280.0   620.0
Salads      12.0   345.0  120.0  165.0  260.0  330.0  415.0   580.0
Sides       18.0   420.0  175.0  120.0  290.0  400.0  545.0   780.0

Every category, every statistic, one line. Amara can now see at a glance that burgers are the highest-calorie category (mean 720, max 1,240) while drinks are the lowest (mean 195). Salads average 345 calories — lower than burgers but not dramatically so, especially when compared to desserts at 390.

She adds this insight:

The "salads are healthy" narrative deserves scrutiny. At a mean of 345 calories, salads are lower than burgers (720) but not far from sides (420). Some salads hit 580 calories — more than the average chicken item. Dressings and toppings likely drive this.

Question 3: Spotting Data Quality Issues

Chapter 6 approach: Required multiple loops to count missing values per column, range checks, and consistency checks.

Chapter 7 approach:

# Missing values — one line
print(df.isnull().sum())
item_name         0
category          0
calories          2
total_fat_g       3
sodium_mg         1
protein_g         0
sugar_g           5
serving_size_g    0
dtype: int64
# Range checks — are any values suspicious?
print(df.describe().loc[["min", "max"]])
     calories  total_fat_g  sodium_mg  protein_g  sugar_g  serving_size_g
min      45.0          0.5       35.0        1.0      0.0            30.0
max    1240.0         72.0     2850.0       58.0     85.0           550.0
# Sodium outlier?
high_sodium = df[df["sodium_mg"] > 2000]
print(high_sodium[["item_name", "category", "sodium_mg"]])
         item_name category  sodium_mg
47  Loaded Nachos    Sides     2850.0

Amara spots it in seconds: one item with 2,850 mg of sodium. In Chapter 6, finding this required extracting numeric values, sorting them, and examining the tails manually. In pandas, a boolean filter pinpoints it immediately.

The Scoreboard

Amara tallies the comparison:

Operation Chapter 6 Lines Chapter 7 Lines
Load data + first look 12 3
Summary stats (one column) 15 1
Stats by category 18 1
Missing value audit 10 1
Outlier detection 8 2
Total ~63 ~8

The pandas version isn't just shorter — it's more informative. Every describe() call gives her standard deviation and quartiles that she didn't have in Chapter 6. Every groupby gives her counts alongside the statistics. Every isnull().sum() gives her a complete missing data picture in one shot.

The Lesson

Amara writes a final reflection cell in her notebook:

What I learned from redoing this analysis in pandas:

The Chapter 6 version taught me what data analysis is — the questions, the workflow, the importance of checking data quality. I don't regret doing it manually. Those loops taught me what "computing a mean" actually means, step by step.

But the pandas version taught me what data analysis can feel like — fluid, expressive, focused on the questions rather than the plumbing. I spent my Chapter 6 time wrestling with type conversion and loop mechanics. I spent my Chapter 7 time actually thinking about nutrition.

The biggest shift isn't the code — it's my brain. I'm starting to think "what do I want to compute?" instead of "how do I loop through this?" That's what the textbook calls "thinking in vectors," and I think I'm starting to get it.

Her professor writes in the margin: "This is exactly the transition Part II is about. Now imagine doing this with a million rows instead of 120. That's where pandas really shines."

Try It Yourself

Revisit any analysis you did in Chapter 6 — whether it was the exercises, the case studies, or the project checkpoint. Redo it in pandas. Time yourself both ways. The difference won't just be in the line count. It'll be in how much mental energy you have left for the interesting part: the questions.