Chapter 7 Exercises: Introduction to pandas

Contributors to Introduction to Data Science

Chapter 7 Exercises: Introduction to pandas

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 7 material, building from recall to original analysis. Part E applies your skills to a new dataset. Part M mixes in concepts from Chapters 1-6 to reinforce earlier learning. You'll need Python with pandas installed and access to the WHO vaccination CSV for several problems.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension

Part A: Conceptual Understanding (1-star)

These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.

Exercise 7.1 — The threshold concept

Explain the threshold concept from this chapter — "thinking in vectors rather than loops" — in your own words. Give a concrete example showing the same operation written as a loop and as a vectorized pandas expression. Why is the vectorized version preferred?

Guidance

A strong answer explains that vectorized operations apply a transformation to an entire column at once, rather than processing one row at a time. Example: converting coverage percentages to decimals. The loop version iterates and appends; the pandas version is `df["decimal"] = df["pct"] / 100`. The vectorized version is preferred for three reasons: speed (optimized C code), readability (intent is clear), and safety (NaN handling is automatic).

Exercise 7.2 — Series vs. DataFrame

A classmate says, "A Series and a DataFrame are basically the same thing." Explain why this statement is wrong. Include at least two concrete differences. Then explain how they are related.

Guidance

Key differences: (1) A Series is one-dimensional (one column), while a DataFrame is two-dimensional (a table of rows and columns). (2) A Series has a single `dtype`, while a DataFrame can have different `dtype` for each column. (3) Selecting a single column from a DataFrame returns a Series, not a mini-DataFrame. They are related because a DataFrame is essentially a collection of Series that share the same index. Each column of a DataFrame is a Series.

Exercise 7.3 — loc vs. iloc

Without running any code, predict the output of each expression given this DataFrame:

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "score": [88, 92, 75, 95]
}, index=[10, 20, 30, 40])

df.iloc[0]
df.loc[10]
df.iloc[1:3]
df.loc[10:30]
df.iloc[4] — what happens?

Guidance

1. Returns the row with *position* 0 (Alice, 88) — as a Series. 2. Returns the row with *label* 10 (Alice, 88) — same result, but accessed by label. 3. Returns rows at positions 1 and 2 (Bob and Carol) — `iloc` uses exclusive end. 4. Returns rows with labels 10, 20, and 30 (Alice, Bob, Carol) — `loc` uses inclusive end, so this returns three rows. 5. Raises `IndexError` because there is no position 4 (only positions 0-3 exist).

Exercise 7.4 — Why NaN matters

In Chapter 6, missing values in CSV data appeared as empty strings (""). In pandas, they appear as NaN. Explain two practical advantages of NaN over empty strings for data analysis.

Guidance

(1) `NaN` propagates safely in arithmetic: `NaN + 5` gives `NaN`, while `"" + 5` raises a `TypeError`. This means you don't need try/except blocks around every calculation. (2) Statistical methods automatically exclude `NaN`: `df["col"].mean()` computes the mean of non-missing values, while with empty strings you'd need to manually filter them out before computing anything.

Exercise 7.5 — Boolean indexing mechanics

Explain step by step what happens when pandas evaluates df[df["score"] > 80]. Your answer should mention boolean masks and how they act as filters.

Guidance

Step 1: `df["score"]` extracts the "score" column as a Series. Step 2: `> 80` compares every value in the Series to 80, producing a boolean mask (a Series of True/False values). Step 3: Passing this mask into `df[...]` selects only the rows where the mask is `True`, returning a new DataFrame containing just those rows.

Exercise 7.6 — Method chaining readability

Rewrite the following step-by-step code as a single method chain:

step1 = df[df["year"] == 2022]
step2 = step1.sort_values("coverage_pct", ascending=False)
step3 = step2[["country", "coverage_pct"]]
step4 = step3.head(10)

Then explain whether the chained version or the step-by-step version is better. Is there a clear winner?

Guidance

Chained version:

result = (df[df["year"] == 2022]
          .sort_values("coverage_pct", ascending=False)
          [["country", "coverage_pct"]]
          .head(10))

There's no universal winner. The chained version is more concise and avoids cluttering the namespace with intermediate variables. The step-by-step version is easier to debug (you can inspect each step) and may be clearer for beginners. A good guideline: chain when the pipeline is 3-5 steps and each step is self-explanatory; use intermediate variables when steps are complex or when you need to debug.

Part B: Code Implementation (2-star)

Write pandas code for each problem. Use import pandas as pd and assume data is available as described.

Exercise 7.7 — Build a DataFrame from scratch

Create a DataFrame called cities with the following data:

city	country	population_m	continent
Tokyo	Japan	13.96	Asia
Delhi	India	11.03	Asia
Shanghai	China	24.87	Asia
Sao Paulo	Brazil	12.33	South America
Mumbai	India	12.48	Asia
Cairo	Egypt	10.23	Africa

Then write code to: (a) display the shape, (b) display only the city and population columns, (c) find the city with the highest population.

Guidance

Create using a dictionary of lists: `pd.DataFrame({"city": [...], ...})`. (a) `cities.shape` gives `(6, 4)`. (b) `cities[["city", "population_m"]]`. (c) `cities.sort_values("population_m", ascending=False).iloc[0]["city"]` or `cities.loc[cities["population_m"].idxmax(), "city"]`.

Exercise 7.8 — Selecting and filtering

Using the WHO vaccination DataFrame (df = pd.read_csv("who_vaccination_data.csv")), write pandas code to answer each question:

How many records are in the dataset?
What are the unique vaccine types?
How many records are from the year 2022?
Which countries have MCV1 coverage below 50% in any year?
What is the mean coverage for DTP3 in the SEARO region?

Guidance

1. `len(df)` or `df.shape[0]` 2. `df["vaccine"].unique()` 3. `len(df[df["year"] == 2022])` or `(df["year"] == 2022).sum()` 4. `df[(df["vaccine"] == "MCV1") & (df["coverage_pct"] < 50)]["country"].unique()` 5. `df[(df["vaccine"] == "DTP3") & (df["region"] == "SEARO")]["coverage_pct"].mean()`

Exercise 7.9 — Sorting practice

Given a DataFrame of student exam scores:

students = pd.DataFrame({
    "name": ["Aisha", "Ben", "Carla", "Devon", "Elena", "Finn"],
    "subject": ["Math", "Science", "Math", "Science", "Math", "Science"],
    "score": [92, 87, 78, 95, 88, 91]
})

Sort by score, highest first.
Sort by subject (alphabetically), then by score (highest first) within each subject.
What is the index of the first row after sorting by score? Why isn't it 0?

Guidance

1. `students.sort_values("score", ascending=False)` 2. `students.sort_values(["subject", "score"], ascending=[True, False])` 3. The index is 3 (Devon, score 95) because sorting preserves the original index. Use `.reset_index(drop=True)` if you want a clean 0-based index.

Exercise 7.10 — Creating computed columns

Start with this DataFrame:

orders = pd.DataFrame({
    "product": ["Widget", "Gadget", "Widget", "Gizmo", "Gadget"],
    "quantity": [10, 5, 8, 3, 12],
    "unit_price": [2.50, 15.00, 2.50, 45.00, 15.00]
})

Add a column total_price that multiplies quantity by unit price.
Add a column is_bulk that is True if quantity >= 10 and False otherwise.
Add a column discount_price that applies a 10% discount to the total price for bulk orders only (others keep the original total).

Guidance

1. `orders["total_price"] = orders["quantity"] * orders["unit_price"]` 2. `orders["is_bulk"] = orders["quantity"] >= 10` 3. Use `apply` or `np.where`. With apply: define a function that checks quantity and returns discounted or regular price, then `orders["discount_price"] = orders.apply(lambda row: row["total_price"] * 0.9 if row["quantity"] >= 10 else row["total_price"], axis=1)`.

Exercise 7.11 — The full workflow

Load the WHO vaccination data, then write a single method chain that answers: "What are the 5 countries with the lowest DTP3 coverage in 2022?" Show the country name and coverage percentage.

Guidance

result = (df[(df["vaccine"] == "DTP3") & (df["year"] == 2022)]
          .sort_values("coverage_pct")
          [["country", "coverage_pct"]]
          .head(5))

Exercise 7.12 — apply() with a custom function

Write a function called classify_region_income that takes a region code and returns an income classification: - "EURO" and "WPRO" return "High income regions" - "AMRO" returns "Mixed income" - "AFRO", "EMRO", "SEARO" return "Lower income regions"

Then use apply() to create a new column income_group in the WHO DataFrame and count how many records fall into each group.

Guidance

def classify_region_income(region):
    if region in ["EURO", "WPRO"]:
        return "High income regions"
    elif region == "AMRO":
        return "Mixed income"
    else:
        return "Lower income regions"

df["income_group"] = df["region"].apply(classify_region_income)
print(df["income_group"].value_counts())

Exercise 7.13 — describe() interpretation

Run df.describe() on the WHO vaccination DataFrame. Then answer:

What does the count row tell you, and why might it differ from the total number of rows?
The 25% row for coverage_pct is around 73. What does this mean in plain English?
If the mean is lower than the 50% value (median), what does that tell you about the distribution?

Guidance

1. `count` shows the number of non-null values. It differs from total rows when there are missing values (`NaN`). 2. 25% of all coverage values are at or below 73%. In other words, a quarter of all country-vaccine-year records have coverage of 73% or less. 3. The distribution is left-skewed — pulled down by a tail of low values. Most records have higher coverage, but some very low values bring the mean down below the median.

Exercise 7.14 — Boolean indexing with multiple conditions

Using the WHO data, write boolean indexing expressions for each:

Records where coverage is between 50% and 80% (inclusive).
Records from either 2019 or 2022 for the BCG vaccine.
Records that are NOT from the EURO region.
Records from AFRO region where the coverage is below the overall mean coverage.

Guidance

1. `df[(df["coverage_pct"] >= 50) & (df["coverage_pct"] <= 80)]` 2. `df[(df["year"].isin([2019, 2022])) & (df["vaccine"] == "BCG")]` 3. `df[df["region"] != "EURO"]` or `df[~(df["region"] == "EURO")]` 4. `mean_cov = df["coverage_pct"].mean()` then `df[(df["region"] == "AFRO") & (df["coverage_pct"] < mean_cov)]`

Part C: Real-World Application (2-star to 3-star)

These exercises use realistic data scenarios. Create the DataFrames as described, then answer the questions.

Exercise 7.15 — Weather station data

Create a DataFrame from this weather data:

weather = pd.DataFrame({
    "date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05"],
    "station": ["Downtown", "Downtown", "Airport", "Airport", "Downtown"],
    "temp_high_f": [45, 42, 48, 51, 39],
    "temp_low_f": [32, 28, 35, 38, 25],
    "precipitation_in": [0.0, 0.5, 0.0, 0.2, 1.1]
})

Add a column for the temperature range (high minus low).
Add a column converting the high temperature to Celsius: (F - 32) * 5/9.
Filter for days where precipitation was above zero.
What was the average high temperature at each station?

Guidance

1. `weather["temp_range"] = weather["temp_high_f"] - weather["temp_low_f"]` 2. `weather["temp_high_c"] = (weather["temp_high_f"] - 32) * 5 / 9` 3. `weather[weather["precipitation_in"] > 0]` 4. `weather.groupby("station")["temp_high_f"].mean()`

Exercise 7.16 — Book sales analysis

A bookstore has this sales data:

books = pd.DataFrame({
    "title": ["Python Basics", "Data Science 101", "ML Handbook",
              "Python Basics", "Stats Guide", "Data Science 101",
              "ML Handbook", "Python Basics", "Stats Guide"],
    "month": ["Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar", "Mar", "Mar"],
    "copies_sold": [150, 89, 42, 175, 63, 112, 55, 201, 78],
    "price": [29.99, 45.00, 59.99, 29.99, 35.00, 45.00, 59.99, 29.99, 35.00]
})

Add a revenue column (copies_sold * price).
Which book generated the most total revenue across all months?
In which month were the most total copies sold?
Sort by revenue descending and show the top 3 rows.

Guidance

1. `books["revenue"] = books["copies_sold"] * books["price"]` 2. `books.groupby("title")["revenue"].sum().sort_values(ascending=False).head(1)` — "Python Basics" likely leads. 3. `books.groupby("month")["copies_sold"].sum().sort_values(ascending=False).head(1)` 4. `books.sort_values("revenue", ascending=False).head(3)`

Exercise 7.17 — Vaccination data exploration (3-star)

Using the WHO vaccination DataFrame, answer these questions. For each, write the pandas code and interpret the result in one sentence.

What is the overall mean and median coverage across all records?
Which vaccine has the highest mean coverage?
Has mean MCV1 coverage gone up or down from 2019 to 2022?
How many countries have at least one record where coverage_pct is below 30%?
What percentage of all records have missing coverage data?

Guidance

1. `df["coverage_pct"].mean()`, `df["coverage_pct"].median()` 2. `df.groupby("vaccine")["coverage_pct"].mean().sort_values(ascending=False)` 3. `df[df["vaccine"] == "MCV1"].groupby("year")["coverage_pct"].mean()` 4. `df[df["coverage_pct"] < 30]["country"].nunique()` 5. `df["coverage_pct"].isnull().mean() * 100` — multiply by 100 for percentage.

Exercise 7.18 — Comparing pure Python and pandas

Write code to answer this question in both pure Python (Chapter 6 style) and pandas: "What is the maximum DTP3 coverage in the AMRO region?"

Then compare: how many lines does each approach take? Which handles missing values more gracefully?

Guidance

Pure Python requires: loading with csv, looping through rows, checking region == "AMRO" and vaccine == "DTP3", converting strings to floats, handling empty strings, tracking the maximum. Roughly 10-12 lines. pandas: `df[(df["vaccine"] == "DTP3") & (df["region"] == "AMRO")]["coverage_pct"].max()` — 1 line. pandas handles NaN automatically; pure Python requires explicit try/except or if-checks.

Exercise 7.19 — Debugging challenge

Each code snippet has a bug. Identify the error, explain why it occurs, and provide the corrected code.

python df["Country"] # Column is actually "country"
python high = df[df["coverage_pct"] > 90 and df["year"] == 2022]
python df["country", "region"]
python subset = df[df["region"] == "AFRO"] subset["new_col"] = 1 # SettingWithCopyWarning

Guidance

1. **KeyError** — column names are case-sensitive. Fix: `df["country"]`. 2. **ValueError** — use `&` instead of `and`, and wrap conditions in parentheses. Fix: `df[(df["coverage_pct"] > 90) & (df["year"] == 2022)]`. 3. **KeyError** — multiple columns need double brackets. Fix: `df[["country", "region"]]`. 4. **SettingWithCopyWarning** — the subset might be a view. Fix: `subset = df[df["region"] == "AFRO"].copy()` then `subset["new_col"] = 1`.

Exercise 7.20 — Custom summary function

Write a function quick_look(df) that takes any DataFrame and prints: - Number of rows and columns - Column names and their data types - Number of missing values per column (only for columns that have any) - The first 3 rows

Test it on at least two different DataFrames.

Guidance

def quick_look(df):
    print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns\n")
    print("Columns and types:")
    for col in df.columns:
        print(f"  {col}: {df[col].dtype}")
    missing = df.isnull().sum()
    missing = missing[missing > 0]
    if len(missing) > 0:
        print("\nMissing values:")
        for col, count in missing.items():
            print(f"  {col}: {count}")
    else:
        print("\nNo missing values.")
    print(f"\nFirst 3 rows:")
    print(df.head(3))

Part D: Synthesis and Analysis (3-star to 4-star)

These exercises require combining multiple concepts. Think before you code.

Exercise 7.21 — Before-and-after comparison essay

Write a 200-word reflection comparing your experience in Chapter 6 (pure Python data analysis) with Chapter 7 (pandas). Address these questions: - What specific operations became dramatically easier? - What concepts from Chapters 3-5 helped you understand pandas more quickly? - Is there anything you think is harder or less clear in pandas than in pure Python?

Guidance

Strong answers mention: type conversion (automatic in pandas), grouped statistics (one line vs. many), filtering (boolean indexing vs. loop-with-if). Understanding from earlier chapters: dictionaries helped understand DataFrames, functions helped understand apply(), list comprehensions helped understand vectorized thinking. Something harder: the SettingWithCopyWarning is confusing, and the difference between loc and iloc requires careful attention.

Exercise 7.22 — Build a mini-analysis pipeline (4-star)

Using the WHO vaccination data, write a complete mini-analysis that answers: "Which 10 countries showed the largest improvement in MCV1 coverage from 2019 to 2022?"

Steps: 1. Filter for MCV1 vaccine, years 2019 and 2022 only. 2. Create a way to compare each country's 2019 and 2022 values. 3. Compute the change (2022 minus 2019). 4. Sort and display the top 10 improvers.

Hint: You may need to reshape or merge the data. Try filtering into two DataFrames (one for 2019, one for 2022) and merging them on country.

Guidance

mcv1 = df[df["vaccine"] == "MCV1"]
y2019 = mcv1[mcv1["year"] == 2019][["country", "coverage_pct"]].rename(
    columns={"coverage_pct": "cov_2019"})
y2022 = mcv1[mcv1["year"] == 2022][["country", "coverage_pct"]].rename(
    columns={"coverage_pct": "cov_2022"})
merged = y2019.merge(y2022, on="country")
merged["improvement"] = merged["cov_2022"] - merged["cov_2019"]
top10 = merged.sort_values("improvement", ascending=False).head(10)
print(top10[["country", "cov_2019", "cov_2022", "improvement"]])

Note: `merge` is formally introduced in [Chapter 9](../chapter-09-reshaping-transforming/index.md), but this is a natural preview.

Exercise 7.23 — Data grammar translation (3-star)

Translate each English sentence into a pandas expression. Assume df is the WHO vaccination DataFrame.

"Show me only the African countries."
"What's the typical coverage for each vaccine?"
"List countries sorted from worst to best coverage in 2022."
"Create a label that says 'Meeting target' if coverage is 90% or above, otherwise 'Below target'."
"How many countries are in each region?"

Guidance

1. `df[df["region"] == "AFRO"]["country"].unique()` 2. `df.groupby("vaccine")["coverage_pct"].median()` (median = "typical") 3. `df[df["year"] == 2022].sort_values("coverage_pct")[["country", "coverage_pct"]]` 4. `df["status"] = df["coverage_pct"].apply(lambda x: "Meeting target" if x >= 90 else "Below target")` 5. `df.groupby("region")["country"].nunique()`

Exercise 7.24 — Investigating a pattern (4-star)

A colleague claims: "Vaccination coverage has been declining globally since 2019 because of COVID-19 disruptions." Using the WHO vaccination DataFrame, write pandas code to test whether this claim is supported by the data. Your analysis should:

Compute mean coverage by year (across all vaccines and regions).
Compute mean coverage by year for each vaccine separately.
Check whether the decline (if any) is consistent across regions.
Write 3-4 sentences interpreting your findings. Be careful about what the data can and cannot tell you.

Guidance

# Overall trend
print(df.groupby("year")["coverage_pct"].mean())

# By vaccine
print(df.groupby(["year", "vaccine"])["coverage_pct"].mean())

# By region
print(df.groupby(["year", "region"])["coverage_pct"].mean())

Interpretation should note: the data may show a dip around 2020-2021 and partial recovery by 2022, but the data alone cannot prove COVID caused it (correlation vs. causation — a theme from [Chapter 1](../../part-01-welcome/chapter-01-what-is-data-science/index.md)). The pattern may differ by region and vaccine type.

Part E: New Dataset Challenge (3-star)

Apply your pandas skills to a dataset you haven't seen before.

Exercise 7.25 — Build and analyze a classroom dataset

Create the following DataFrame representing student performance:

classroom = pd.DataFrame({
    "student": ["Amir", "Beth", "Carlos", "Diana", "Eva",
                "Frank", "Grace", "Hiro", "Inez", "Jack",
                "Kira", "Leo"],
    "major": ["CS", "Bio", "CS", "Math", "Bio",
              "CS", "Math", "Bio", "CS", "Math",
              "Bio", "CS"],
    "midterm": [85, 78, 92, 88, 71, 95, 82, 69, 90, 76, 83, 88],
    "final": [82, 85, 89, 91, 75, 93, 88, 74, 87, 80, 86, 91],
    "homework_avg": [90, 88, 85, 92, 80, 98, 95, 72, 88, 85, 90, 82]
})

Answer these questions:

Add a course_grade column computed as: 30% midterm + 40% final + 30% homework_avg.
Add a letter_grade column using: A >= 90, B >= 80, C >= 70, otherwise D.
Which major has the highest average course grade?
Who improved the most from midterm to final (largest increase)?
How many students got an A?

Guidance

1. `classroom["course_grade"] = classroom["midterm"] * 0.3 + classroom["final"] * 0.4 + classroom["homework_avg"] * 0.3` 2. Use `apply` with a function that checks thresholds. 3. `classroom.groupby("major")["course_grade"].mean().sort_values(ascending=False)` 4. `classroom["improvement"] = classroom["final"] - classroom["midterm"]` then sort descending. 5. `(classroom["letter_grade"] == "A").sum()`

Exercise 7.26 — Exploring unfamiliar data (3-star)

Find any CSV dataset online (Kaggle, data.gov, your city's open data portal) that has at least 100 rows and 5 columns. Load it with pd.read_csv(), then perform the Chapter 7 workflow:

shape, dtypes, head(), describe(), info()
Identify any missing values.
Select a subset of interesting columns.
Filter for a specific condition.
Sort by a numeric column.
Create one computed column.

Write up your findings in a Jupyter notebook with Markdown explanations between code cells.

Guidance

This is an open-ended exercise. Good datasets for beginners include: Gapminder country data, NYC taxi trip samples, Iris flower measurements, or any city's 311 service request data. The key assessment is whether you follow the workflow and explain what you find, not which dataset you choose.

Part M: Mixed Review (Chapters 1-6)

These exercises deliberately revisit earlier material. Retrieval practice strengthens long-term learning.

Exercise 7.M1 — Vocabulary bridge (Chapter 3 + 7)

In Chapter 3, you learned that Python has several core data types: int, float, str, bool. In pandas, df.dtypes shows types like int64, float64, object, bool. For each pandas type, identify the corresponding Python type and explain the relationship.

Guidance

- `int64` corresponds to `int` — 64-bit integer for larger capacity. - `float64` corresponds to `float` — 64-bit floating-point number. - `object` usually corresponds to `str` — pandas uses "object" as its label for text/string data. - `bool` corresponds to `bool` — same concept, used in boolean masks. The pandas types are NumPy types under the hood, optimized for array-based operations. They're compatible with Python's types but stored more efficiently in memory.

Exercise 7.M2 — Functions in two worlds (Chapter 4 + 7)

In Chapter 4, you wrote a function classify_rate(rate) that returned "Low," "Medium," or "High" based on a vaccination rate. In Chapter 7, you used apply() to call this function on every value in a Series.

Write a function summarize_column(series) that takes a pandas Series and returns a dictionary containing the count, mean, median, min, and max. Test it on df["coverage_pct"].

Guidance

def summarize_column(series):
    return {
        "count": series.count(),
        "mean": series.mean(),
        "median": series.median(),
        "min": series.min(),
        "max": series.max()
    }

print(summarize_column(df["coverage_pct"]))

This bridges [Chapter 4](../../part-01-welcome/chapter-04-python-fundamentals-ii/index.md) (writing functions with return values) and Chapter 7 (Series methods).

Exercise 7.M3 — Data structure evolution (Chapter 5 + 7)

In Chapter 5, you represented the WHO data as a list of dictionaries. Convert this list of dictionaries into a pandas DataFrame and verify the shapes match:

import csv

data = []
with open("who_vaccination_data.csv", "r", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        data.append(row)

# Convert to DataFrame
df_from_list = pd.DataFrame(data)

What are the dtypes of df_from_list? Why are they all object?
How does this differ from pd.read_csv() directly?
What would you need to do to make df_from_list usable for numeric operations?

Guidance

1. All columns are `object` (strings) because `csv.DictReader` reads everything as strings, and `pd.DataFrame()` preserves the types it receives. 2. `pd.read_csv()` performs type inference — it detects numeric columns and converts them automatically. 3. You'd need to manually convert: `df_from_list["coverage_pct"] = pd.to_numeric(df_from_list["coverage_pct"], errors="coerce")` and similarly for other numeric columns.

Exercise 7.M4 — EDA revisited (Chapter 6 + 7)

Chapter 6's threshold concept was "EDA as a conversation with data." Conduct a mini-EDA on the WHO data using pandas, writing your analysis as a conversation. Start with a question, show the code and result, write an observation, and let the observation lead to the next question. Do at least 4 question-answer-observation rounds.

Guidance

Example flow: - Q1: "What's the overall mean coverage?" → Code → Observation: "82.7% — but I wonder if it varies by region." - Q2: "How does coverage vary by region?" → Code → Observation: "AFRO is lowest at 72.3%. Why?" - Q3: "Which AFRO countries have the lowest coverage?" → Code → Observation: "Several countries below 50%. Is this consistent across vaccines?" - Q4: "Do these low-coverage countries struggle with all vaccines or just some?" → Code → Observation. The key is demonstrating the iterative, question-driven workflow.

Exercise 7.M5 — The big picture (Chapter 1 + 7)

In Chapter 1, you learned about the four anchor examples: Elena (public health), Marcus (small business), Priya (sports journalism), and Jordan (university grading). For each person, write one specific pandas expression they might use in their work. Be as concrete as possible — use realistic column names and operations.

Guidance

- Elena: `df[df["region"] == "AFRO"].groupby("country")["coverage_pct"].mean().sort_values()` - Marcus: `sales[sales["month"] == "December"].groupby("product")["revenue"].sum().sort_values(ascending=False)` - Priya: `nba[nba["season"] >= 2015].groupby("season")["three_pt_pct"].mean()` - Jordan: `grades.groupby("department")["gpa"].describe()`