Chapter 6 Exercises: Your First Data Analysis

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 6 material, building from recall to original analysis. Part E applies your skills to a new dataset. Part M mixes in concepts from Chapters 1-5 to reinforce earlier learning. You'll need Python and access to a CSV dataset for most problems.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension


Part A: Conceptual Understanding (1-star)

These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.


Exercise 6.1EDA as conversation

Explain the threshold concept from this chapter — "exploratory data analysis as a conversation with data" — in your own words. Give a concrete example of how one question's answer leads to the next question, using the WHO vaccination dataset as context.

Guidance A strong answer describes the iterative cycle: you ask a question ("What's the average coverage?"), get an answer (82.7%), and that answer provokes a new question ("Is that average similar across regions, or does it vary?"). The regional breakdown reveals AFRO at 72.3% vs. EURO at 93.1%, which prompts yet another question ("Which specific countries in AFRO are driving that low average?"). Each answer is a stepping stone, not an endpoint.

Exercise 6.2Question quality

Rate each of the following questions as "good for EDA," "too vague," or "unanswerable with this data" — assuming you have the WHO vaccination dataset described in the chapter. Justify each rating.

  1. "What's interesting about this data?"
  2. "What is the mean DTP3 coverage in the SEARO region for 2022?"
  3. "Why does South Sudan have low vaccination rates?"
  4. "Which vaccine has the most consistent coverage across countries?"
  5. "How does vaccination coverage correlate with smartphone ownership?"
Guidance 1. **Too vague** — "interesting" isn't specific enough to guide analysis. 2. **Good for EDA** — specific column, specific filter, specific statistic. 3. **Unanswerable with this data** — "why" requires causal information (conflict, infrastructure) not in the dataset. 4. **Good for EDA** — you could compute the standard deviation of coverage for each vaccine and compare. 5. **Unanswerable with this data** — the dataset doesn't contain smartphone ownership information.

Exercise 6.3Missing value categories

Define the three categories of missing data (MCAR, MAR, MNAR) in your own words. For each, create a hypothetical example involving student exam scores.

Guidance - **MCAR:** A student's score is missing because the grading system had a random glitch. The missingness has nothing to do with the score itself or any other variable. - **MAR:** Students who missed class frequently (an observable variable) are more likely to have missing scores. The missingness is related to attendance but not to the score value itself. - **MNAR:** Students who performed poorly chose not to submit the exam, so low scores are missing precisely *because* they're low. The missingness is related to the missing value itself.

Exercise 6.4Data provenance

Explain what data provenance means and why it matters. Then write a five-line provenance note for a hypothetical dataset of restaurant health inspection scores in your city.

Guidance Data provenance documents where data came from, who collected it, when, how, and for what purpose. It matters because data quality, bias, and limitations can't be assessed without knowing the collection process. A sample provenance note might include: source agency, date range, how inspections were conducted, how scores were assigned, and any known limitations (e.g., not all restaurants are inspected every year).

Exercise 6.5Notebook narrative vs. code dump

You're reviewing a classmate's Jupyter notebook. It has 15 code cells with no Markdown text between them. The output shows various numbers and lists, but there's no explanation of what any of it means. Write a paragraph of constructive feedback explaining what's missing and why it matters.

Guidance Your feedback should mention: (1) the notebook lacks context — a reader can't tell what questions are being investigated, (2) there's no interpretation of results — numbers without explanation are meaningless, (3) reproducibility requires documenting *why* each step was taken, not just *what* was done, and (4) adding Markdown headers, question statements, and observation notes would transform it from a personal scratch pad into a professional document.

Exercise 6.6Reproducibility checklist

List five things that make a Jupyter notebook reproducible. For each, explain what would go wrong if that element were missing.

Guidance 1. **Data is accessible** — without the data file, the notebook can't run. 2. **Cells run in order** — out-of-order execution creates hidden state that can't be replicated. 3. **Dependencies are documented** — missing libraries cause import errors. 4. **Random seeds are set** — without seeds, random processes produce different results each run. 5. **Absolute paths are avoided** — hardcoded paths like `C:\Users\myname\...` break on other computers.

Part B: Code Implementation (2-star)

Write Python code for each problem. Use only built-in Python (no pandas, no external libraries except csv).


Exercise 6.7Flexible data loader

Write a function load_csv(filepath) that takes a file path as input and returns a list of dictionaries (one per row). The function should also print the number of rows loaded and the column names. Include error handling for the case where the file doesn't exist.

def load_csv(filepath):
    """Load a CSV file and return a list of dictionaries."""
    # Your code here
Guidance Use `csv.DictReader` inside a `with open(...)` block. Wrap the entire thing in a `try/except FileNotFoundError` block. After loading, print `len(data)` for row count and `list(data[0].keys())` for column names (checking first that `data` is not empty).

Exercise 6.8Count unique values

Write a function count_unique(data, column) that returns a dictionary mapping each unique value in the specified column to its count. Test it with a small example:

sample = [
    {"color": "red", "size": "L"},
    {"color": "blue", "size": "M"},
    {"color": "red", "size": "S"},
    {"color": "green", "size": "L"},
    {"color": "red", "size": "M"},
]
print(count_unique(sample, "color"))
# Expected: {'red': 3, 'blue': 1, 'green': 1}
Guidance Loop through `data`, using `dict.get(value, 0) + 1` to count occurrences. This is the same pattern used in the chapter's `count_by()` function.

Exercise 6.9Safe numeric extraction

Write a function safe_floats(data, column) that extracts numeric values from the specified column, skipping empty strings and non-numeric values. Return a tuple of (values, num_skipped) where values is a list of floats and num_skipped is the count of rows that couldn't be converted.

sample = [
    {"score": "85"}, {"score": ""}, {"score": "92"},
    {"score": "N/A"}, {"score": "78"}, {"score": "95.5"}
]
values, skipped = safe_floats(sample, "score")
# Expected: values = [85.0, 92.0, 78.0, 95.5], skipped = 2
Guidance Use a `try/except ValueError` block around `float(row[column])`. Check for empty strings with `row[column].strip() == ""` before attempting conversion. Keep a counter for skipped values.

Exercise 6.10Standard deviation

You computed mean and median in the chapter. Now implement standard deviation. The formula is:

std_dev = sqrt(sum((x - mean)^2 for each x) / n)

Write a function compute_std_dev(values) that computes this. Test it on [2, 4, 4, 4, 5, 5, 7, 9] (expected standard deviation: approximately 2.0).

Hint: You can compute the square root as value ** 0.5, or import math.sqrt.

Guidance First compute the mean. Then compute the sum of squared differences from the mean. Divide by `n` (population standard deviation) or `n-1` (sample standard deviation — either is acceptable for this exercise). Take the square root. For the test data, the population std dev is 2.0 and the sample std dev is approximately 2.14.

Exercise 6.11Mode (most common value)

Write a function compute_mode(values) that returns the most frequently occurring value in a list. If there's a tie, return any one of the tied values. Test it on [1, 2, 2, 3, 3, 3, 4] (expected: 3).

Guidance Build a frequency dictionary with `dict.get(value, 0) + 1`, then use `max()` with a key function to find the value with the highest count. Alternatively, loop through the dictionary to find the maximum count.

Exercise 6.12Percentile calculation

Write a function compute_percentile(values, p) where p is a number between 0 and 100 representing the desired percentile. The 50th percentile is the median. The 25th percentile is the value below which 25% of the data falls.

Use this approach: sort the values, compute the index as (p/100) * (n-1), and if the index isn't a whole number, interpolate between the two nearest values.

Test it: for [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], the 25th percentile should be approximately 3.25 and the 75th percentile should be approximately 7.75.

Guidance Sort the values. Compute `idx = (p / 100) * (len(values) - 1)`. The lower bound is `int(idx)`, the upper bound is `int(idx) + 1` (clamped to `len(values) - 1`). The fractional part is `idx - int(idx)`. Interpolate: `result = values[lower] + fraction * (values[upper] - values[lower])`.

Exercise 6.13Filter and summarize

Write a function filter_and_summarize(data, column, filter_column, filter_value) that: 1. Filters data to only rows where filter_column equals filter_value 2. Extracts numeric values from column 3. Prints count, min, max, mean, and median

Use it to find the summary statistics for coverage_pct where vaccine is "BCG".

Guidance Combine the filtering pattern from the chapter (list comprehension or loop with `if`) with the `safe_floats` and `summarize` patterns. The function should handle the case where no rows match the filter.

Exercise 6.14Year-over-year comparison

Write code that computes the mean vaccination coverage for each year in the dataset and prints the year-over-year change. Your output should look something like:

2019: 84.2% (baseline)
2020: 82.8% (change: -1.4 percentage points)
2021: 81.5% (change: -1.3 percentage points)
2022: 83.1% (change: +1.6 percentage points)
Guidance Build a dictionary mapping each year to a list of coverage values (filter by year, extract numeric values). Compute the mean for each year. Print the first year as "baseline" and subsequent years with the difference from the previous year. This exercise tests your ability to combine filtering, aggregation, and formatting.

Part C: Real-World Application (2-3 star)

These problems require you to think beyond the code and consider the broader context of data analysis.


Exercise 6.15Data quality detective (2-star)

Write code to answer each of the following data quality questions about the WHO dataset:

  1. Are there any country names that appear with inconsistent spacing (leading/trailing spaces)?
  2. Are there any rows where coverage_pct contains a value that isn't a valid number (not just empty — actually non-numeric text)?
  3. How many country-year-vaccine combinations have duplicate entries?
  4. Are there any negative coverage values?
Guidance For (1), compare `row["country"]` to `row["country"].strip()`. For (2), try `float()` on every non-empty value inside a try/except. For (3), build a set of `(country, year, vaccine)` tuples and compare its length to the total row count. For (4), check if any converted float is less than 0.

Exercise 6.16Pandemic impact analysis (2-star)

The years 2020 and 2021 were the peak of the COVID-19 pandemic. Write code to investigate whether vaccination coverage for routine childhood vaccines (DTP3, MCV1) declined during this period. Compute:

  1. Mean coverage for 2019 (pre-pandemic baseline)
  2. Mean coverage for 2020 and 2021
  3. Mean coverage for 2022 (recovery?)
  4. The overall change from 2019 to 2021

Write a 3-4 sentence interpretation of your findings as a Markdown cell.

Guidance Filter data by vaccine type (DTP3 or MCV1) and year. Compute means for each year. Real WHO data shows that routine immunization coverage did decline during the pandemic — your simulated dataset should show a similar pattern. Your interpretation should note the decline and discuss possible explanations (disrupted health services, lockdowns, supply chain issues).

Exercise 6.17Regional deep dive (3-star)

Pick one WHO region and perform a thorough analysis:

  1. List all countries in that region
  2. Compute mean coverage per country (across all vaccines and years)
  3. Identify the country with the highest and lowest coverage
  4. Count missing values for that region specifically
  5. Write a 5-sentence summary of what you found
Guidance This is an open-ended exercise that tests your ability to combine multiple techniques. Use a filter to isolate the region, then apply `count_unique`, `safe_floats`, `compute_mean`, and missing value counting. The summary should mention specific countries, numbers, and any patterns you noticed.

Exercise 6.18Vaccine comparison (2-star)

Compare the four vaccines in the dataset (BCG, DTP3, MCV1, Pol3) by computing mean and median coverage for each. Which vaccine has the highest average coverage? Which has the most variation (highest range)? Write a 2-3 sentence interpretation.

Guidance Filter by vaccine, extract coverage values, compute mean, median, and range for each. BCG (tuberculosis) typically has very high global coverage, while MCV1 (measles first dose) may show more variation. Your interpretation should relate the findings to what you know (or can look up) about these vaccination programs.

Exercise 6.19Top and bottom performers (2-star)

Write code to find: 1. The 5 countries with the highest mean coverage (across all vaccines and years) 2. The 5 countries with the lowest mean coverage

For each country, print the country name, region, and mean coverage.

Hint: Build a dictionary mapping country names to lists of coverage values, compute means, then sort.

Guidance
country_coverages = {}
for row in data:
    country = row["country"]
    if row["coverage_pct"].strip():
        if country not in country_coverages:
            country_coverages[country] = []
        country_coverages[country].append(float(row["coverage_pct"]))

country_means = {c: sum(v)/len(v) for c, v in country_coverages.items()}
sorted_countries = sorted(country_means.items(), key=lambda x: x[1])
Then print the first 5 (lowest) and last 5 (highest).

Exercise 6.20Build a complete summary report (3-star)

Write a function dataset_report(data) that takes a list of dictionaries (any CSV dataset) and prints a comprehensive summary report including:

  1. Number of rows and columns
  2. Column names
  3. For each column: number of missing values, number of unique values, and 3 sample values
  4. For any column that appears numeric: min, max, and mean

This function should work with any CSV dataset, not just the WHO data.

Guidance For each column, try converting all non-empty values to float. If more than 50% succeed, treat it as numeric and compute stats. Use `set()` for unique counts. This is a genuinely useful utility function that you could reuse in future projects.

Part D: Synthesis and Extension (3-4 star)

These problems push you to think creatively and combine concepts in new ways.


Exercise 6.21Text-based bar chart (3-star)

Since we don't have matplotlib yet, create a simple text-based horizontal bar chart. Write a function text_bar_chart(labels, values, max_width=40) that prints something like:

AFRO  |████████████████████████████          | 72.3%
AMRO  |██████████████████████████████████    | 86.4%
EMRO  |████████████████████████████████      | 80.5%
EURO  |████████████████████████████████████  | 93.1%
SEARO |███████████████████████████████████   | 87.8%
WPRO  |███████████████████████████████████   | 88.2%

Use it to display mean coverage by region.

Guidance Scale each value to the range [0, max_width] by dividing by the maximum value and multiplying by max_width. Use string multiplication (`"█" * bar_length`) to create bars. Use f-string formatting with fixed widths for alignment.

Exercise 6.22Correlation by eye (3-star)

Without computing a formal correlation coefficient, investigate whether there's a relationship between target_population and coverage_pct. For records that have both values:

  1. Divide countries into "small population" (below median target population) and "large population" (above median)
  2. Compute mean coverage for each group
  3. Write a 2-3 sentence interpretation. Does population size seem related to coverage?
Guidance Filter to records where both fields are non-empty and numeric. Compute the median of target_population. Split records into two groups. Compare mean coverage. This gives a rough sense of whether the two variables are related, without formal statistics. You'll learn proper correlation in Chapter 24.

Exercise 6.23Coverage trajectory (3-star)

For a country of your choice, track its vaccination coverage for all four vaccines across all four years. Print a formatted table showing the trajectory. Then write 2-3 sentences interpreting any trends.

Country: Nigeria
         2019   2020   2021   2022
BCG      64%    57%    61%    65%
DTP3     57%    56%    56%    59%
MCV1     54%    53%    51%    57%
Pol3     57%    56%    55%    60%
Guidance Filter data to the chosen country. Build a nested dictionary: `{vaccine: {year: coverage}}`. Handle missing values gracefully (print "N/A" if no data exists for a combination). Use f-string formatting for alignment.

Exercise 6.24Write your own data dictionary (3-star)

Create a small CSV file of your own with at least 5 columns and 20 rows. The data can be about anything — your music library, your exercise log, local weather, sports stats. Include at least one intentional data quality issue (a missing value, an inconsistent format, a suspicious outlier).

Then load your data with csv.DictReader, run a full EDA (shape, unique values, summary statistics, data quality check), and write up your findings in a notebook narrative format.

Guidance This exercise tests whether you can apply the EDA workflow to a novel dataset, not just follow the chapter's example. The key is creating the data with realistic imperfections and then "discovering" those imperfections through systematic analysis. This simulates the real experience of working with unfamiliar data.

Exercise 6.25EDA question generator (4-star)

Write a function generate_questions(data) that automatically generates a list of EDA questions based on the structure of the dataset. The function should:

  1. Identify categorical columns (those with fewer than 20 unique values)
  2. Identify numeric columns (those where most values can be converted to float)
  3. Generate questions like "What is the mean [numeric_col] by [categorical_col]?" and "How many missing values are in [column]?"

Test it on the WHO dataset and on your own dataset from Exercise 6.24.

Guidance This is a creative programming challenge. The function should analyze column characteristics and use string formatting to produce natural-language questions. It won't produce *good* questions (that requires domain knowledge), but it demonstrates how much information can be extracted from data structure alone.

Part E: New Dataset Challenge (2-3 star)

For these exercises, create or use a different dataset from the WHO vaccination data.


Exercise 6.26Weather data exploration (2-star)

Create a CSV file with 30 rows of weather data containing columns: date, city, high_temp, low_temp, precipitation_mm, condition (sunny, cloudy, rainy, snowy). Include 2-3 missing values.

Load the data with csv.DictReader, compute summary statistics for temperature and precipitation, and identify any data quality issues.


Exercise 6.27Bookshelf analysis (2-star)

Create a CSV file with 25 rows representing books on a bookshelf: title, author, year_published, pages, genre, rating (1-5). Include at least one row with a missing rating and one with a suspiciously high page count (like 50,000 — a typo for 500).

Load and analyze the data. Can you programmatically detect the typo?


Exercise 6.28Classroom grades (3-star)

Create a CSV file with 40 rows of student exam scores: student_id, section, midterm_score, final_score, homework_avg. Include some missing values and at least two sections.

Compute and compare: 1. Mean scores by section 2. Count of missing values by section 3. Percentage of students scoring above 90 on the final, by section

Write a 3-4 sentence analysis comparing the two sections.


Part M: Mixed Review — Chapters 1-5 (1-3 star)

These problems integrate material from earlier chapters to keep your foundations strong.


Exercise 6.29Data science lifecycle in action (1-star, Ch.1 review)

Map the work you did in this chapter to the data science lifecycle from Chapter 1. For each lifecycle stage, identify which section of Chapter 6 corresponds to it and give a one-sentence description of what you did.

Guidance 1. **Ask** — Section 6.1 (formulating questions about the WHO data) 2. **Acquire** — Section 6.2 (loading the CSV file) 3. **Clean** — Section 6.5 (identifying data quality issues — though we didn't fully clean the data yet) 4. **Explore** — Sections 6.3 and 6.4 (inspecting structure and computing statistics) 5. **Model** — Not yet (that comes in Parts IV-V) 6. **Communicate** — Section 6.6 (notebook narrative)

Exercise 6.30Functions for reuse (2-star, Ch.4 review)

The chapter defined several utility functions: unique_values(), count_by(), get_numeric_values(), compute_mean(), compute_median(), summarize().

Refactor these into a single Python file called eda_utils.py that you could import into any notebook. Add docstrings and type hints. Then demonstrate importing and using the module:

from eda_utils import load_csv, summarize

data = load_csv("my_data.csv")
values = safe_floats(data, "score")
summarize(values, "Exam Scores")
Guidance This exercise reinforces Chapter 4's lesson about functions as reusable tools. Place all functions in one `.py` file with consistent naming conventions. Add type hints like `def compute_mean(values: list[float]) -> float:`. Import with `from eda_utils import ...`. This is how real data scientists build personal utility libraries.