Case Study: Exploring Public Health Data with pandas — Dr. Chen's Flu Surveillance

The Setup

It's mid-January, and Dr. Maya Chen's phone won't stop buzzing. Three hospitals in the county have reported a spike in flu-related ER visits over the past two weeks. The county health director wants answers by Friday: Is this a real surge or just normal seasonal fluctuation? Which communities are hardest hit? Are we seeing anything unusual compared to previous years?

Maya has access to the county's flu surveillance dataset — a simplified version of the kind of data collected by the CDC's Influenza Surveillance System. The dataset contains records for 800 flu cases reported over the past flu season, with each row representing one confirmed flu case.

Her first step, before any fancy analysis, is to look at the data. Let's follow along.

Loading and Exploring the Data

import pandas as pd

url = "https://raw.githubusercontent.com/intro-stats-data/datasets/main/flu_surveillance_800.csv"
flu = pd.read_csv(url)
print(flu.shape)
(800, 10)

800 cases, 10 variables. Let's see what we're working with:

flu.head()
   case_id  age sex  zip_code  strain  hospitalized  days_to_er  vaccinated  recovery_days  report_week
0    F0001   45   F     90210       A             0           3           1              7           42
1    F0002   72   M     90045       B             1           1           0             18           42
2    F0003    8   M     90210       A             0           2           1              5           43
3    F0004   67   F     90301       B             1           2           0             21           44
4    F0005   34   F     90045       A             0           4           1              8           43
flu.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   case_id        800 non-null    object
 1   age            800 non-null    int64
 2   sex            800 non-null    object
 3   zip_code       800 non-null    int64
 4   strain         800 non-null    object
 5   hospitalized   800 non-null    int64
 6   days_to_er     782 non-null    float64
 7   vaccinated     800 non-null    int64
 8   recovery_days  768 non-null    float64
 9   report_week    800 non-null    int64
dtypes: float64(2), int64(5), object(3)
memory usage: 62.6+ KB

Maya immediately notices two things:

  1. zip_code is stored as int64 — pandas thinks it's a number, but it's actually a nominal categorical variable. You can't calculate the "average zip code" (as we saw in Chapter 2's case study on electronic health records). Maya makes a mental note not to include it in any numerical summaries.

  2. Missing values. days_to_er has 782 non-null values (18 missing) and recovery_days has 768 (32 missing). Missing recovery data might mean patients were lost to follow-up — they never reported back. Are the missing cases random, or are they disproportionately severe cases (hospitalized patients who had longer, more complicated recoveries)? This is a question Maya will need to investigate, but for now, she flags it.

Asking the Key Questions

Question 1: Is the surge real?

The health director wants to know if recent weeks are unusually busy. Maya looks at case counts by report week:

flu['report_week'].value_counts().sort_index()
40     28
41     35
42     52
43     78
44    105
45    132
46    148
47    112
48     65
49     32
50     13
Name: report_week, dtype: int64

The pattern is clear: cases rose steadily from week 40 (about 28 cases) to a peak in week 46 (148 cases), then declined. Weeks 44-47 are the high-activity period. If the current week is 46 or 47, the surge is real — but it may already be peaking.

Question 2: Which communities are hardest hit?

flu['zip_code'].value_counts().head(5)
90045    186
90301    172
90210    158
90401    142
90502    103
Name: zip_code, dtype: int64

Zip code 90045 has the most cases (186), followed by 90301 (172). But raw counts can be misleading — Maya knows she needs to compare these to population sizes. A zip code with 100,000 residents and 186 cases has a very different story than one with 20,000 residents and 186 cases. The rate matters more than the count.

For now, she can at least check whether the hardest-hit zip codes also have higher hospitalization rates:

flu.groupby('zip_code')['hospitalized'].mean().round(3).sort_values(ascending=False)
zip_code
90301    0.244
90045    0.220
90502    0.204
90401    0.183
90210    0.152
Name: hospitalized, dtype: float64

Zip code 90301 has the highest hospitalization rate (24.4%), while 90210 has the lowest (15.2%). There's nearly a 10-percentage-point gap. Could this reflect differences in age distribution, access to healthcare, or vaccination rates across these communities? Maya doesn't have enough information to answer that yet, but these are the questions that will guide her deeper analysis.

Question 3: Who's getting hospitalized?

flu.groupby('hospitalized')['age'].describe().round(1)
              count   mean    std    min    25%    50%    75%    max
hospitalized
0             638.0   38.2   18.7    2.0   23.0   36.0   51.0   88.0
1             162.0   61.4   17.3    5.0   49.0   64.0   74.0   92.0

This is striking. Hospitalized patients have a mean age of 61.4 years, compared to 38.2 for non-hospitalized patients. The median ages (64 vs. 36) tell an even clearer story. Older patients are dramatically more likely to be hospitalized — consistent with what we know about flu severity in elderly populations.

Question 4: Does vaccination make a difference?

flu.groupby('vaccinated')['hospitalized'].mean().round(3)
vaccinated
0    0.268
1    0.133
Name: hospitalized, dtype: float64

Among unvaccinated patients, 26.8% were hospitalized. Among vaccinated patients, only 13.3% were. That's half the rate.

But Maya is careful here. This is observational data — people weren't randomly assigned to be vaccinated or not. Maybe vaccinated people tend to be younger, healthier, or have better access to healthcare in general. The association between vaccination and lower hospitalization is real in this data, but calling it causal would require a controlled study (the kind we'll learn about in Chapter 4).

What Maya Reports

By Thursday evening, Maya has assembled her initial findings for the health director:

  1. The surge is real but likely peaking. Cases rose from ~28/week in early October to ~148/week in mid-November. The most recent week shows a decline to ~112 cases.

  2. Geographic concentration. Five zip codes account for the majority of cases, with 90045 and 90301 having both high case counts and high hospitalization rates.

  3. Age is the strongest predictor of severity. Hospitalized patients average 61.4 years old vs. 38.2 for outpatients.

  4. Vaccination is associated with lower hospitalization. Vaccinated patients had roughly half the hospitalization rate of unvaccinated patients (13.3% vs. 26.8%), though this is an observational association, not a proven causal effect.

  5. Data quality note. About 4% of recovery data is missing. Further investigation is needed to determine whether this missingness is related to case severity.

What You Should Notice

This case study demonstrates several key ideas from the chapter:

Tools accelerated the analysis. Maya's initial exploration — loading, summarizing, filtering, grouping — took perhaps 30 minutes of coding. Doing this by hand with 800 records would take days.

Chapter 2 knowledge was essential. Recognizing that zip_code is categorical (not numerical), that vaccinated is a 0/1 binary variable whose mean equals a proportion, and that hospitalized is also binary — these classifications guided every analysis decision.

Statistical thinking shaped the interpretation. Maya didn't just report numbers. She flagged the difference between counts and rates, noted that vaccination's effect could be confounded by age, and qualified her conclusions appropriately. That's the statistical thinking from Chapter 1 in action.

The analysis raised more questions than it answered. Are the missing recovery values random? Are hospitalization rates different because of age or because of zip code characteristics? Does vaccination cause lower hospitalization, or is it confounded? Each finding is a starting point, not an endpoint. This is how real data analysis works.

Your Turn

Using the health dataset from the chapter (or your Data Detective Portfolio dataset), try the following:

  1. Pick one categorical variable and one numerical variable
  2. Group by the categorical variable and compute the mean of the numerical variable
  3. Write a one-paragraph interpretation: What did you find? What additional information would you need to draw a stronger conclusion?

This mirrors exactly what Maya did — and it's the kind of analysis you'll do throughout this course, building more sophisticated tools on top of this foundation.