Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data

Contributors to Introduction to Data Science

35 min read

> "The greatest value of a picture is when it forces us to notice what we never expected to see."

Learning Objectives

Load a real-world CSV dataset into Python data structures and inspect its basic properties
Formulate specific, answerable questions before exploring data, distinguishing good questions from vague ones
Compute basic summary statistics (count, min, max, mean) using pure Python loops and conditionals
Identify data quality issues (missing values, inconsistent formats, suspicious outliers) through manual inspection
Summarize initial findings in a Jupyter notebook that combines code, output, and explanatory Markdown

In This Chapter

Chapter Overview
6.1 The Art of Asking Good Questions
6.2 Loading Real Data: The WHO Vaccination Dataset
6.3 Getting to Know Your Data: First Explorations
6.4 Computing Summary Statistics by Hand
6.5 Spotting Data Quality Issues
6.6 Telling the Story: Notebook as Narrative
6.7 The Limits of Pure Python: Why We Need Better Tools
Project Checkpoint: Your First Real Exploration of WHO Data
Practical Considerations
Summary
Spaced Review: Chapters 1-5
What's Next: Part II — Data Wrangling

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data

"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey, the father of exploratory data analysis

Chapter Overview

This is it.

Five chapters of preparation — understanding what data science is, setting up your toolkit, learning variables and control flow, writing functions, mastering lists and dictionaries — have all been leading to this moment. In this chapter, you're going to do something that most people never do: you're going to open a real dataset, ask it questions, and listen to what it tells you.

Not a toy dataset. Not a contrived classroom exercise. A real collection of vaccination data from the World Health Organization, covering countries across the globe, spanning multiple years, with all the messiness and surprises that real-world data brings. You'll load it using Python's built-in csv module. You'll count rows, examine columns, compute averages, and find patterns. You'll discover missing values, inconsistent entries, and numbers that don't quite make sense. And you'll write everything down — code, output, and your own observations — in a Jupyter notebook that tells the story of your exploration.

Here's what makes this chapter special: you're going to do all of it with the Python you already know. No new libraries. No magic one-liners. Just for loops, if statements, functions, lists, and dictionaries — the tools from Chapters 3 through 5 — applied to a problem that actually matters.

This is deliberate. By the end of this chapter, you'll have accomplished something genuinely impressive. But you'll also feel some friction. You'll notice that computing a simple average requires more lines of code than it should. You'll get frustrated by how many loops you need to group data by region. And that friction? That's the setup for Part II, where we'll introduce pandas — the library that makes everything you do here faster, cleaner, and more expressive. But the understanding you build here, doing it "the hard way," will make pandas feel like a revelation instead of a mystery.

In this chapter, you will learn to:

Load a real-world CSV dataset into Python data structures and inspect its basic properties (all paths)
Formulate specific, answerable questions before exploring data, distinguishing good questions from vague ones (all paths)
Compute basic summary statistics (count, min, max, mean) using pure Python loops and conditionals (all paths)
Identify data quality issues (missing values, inconsistent formats, suspicious outliers) through manual inspection (all paths)
Summarize initial findings in a Jupyter notebook that combines code, output, and explanatory Markdown (all paths)

6.1 The Art of Asking Good Questions

Before we touch a single line of code, we need to talk about questions.

In Chapter 1, we learned that data science always starts with a question. Now that you're about to meet your first real dataset, that principle stops being abstract and becomes practical. The quality of your analysis depends — more than anything else — on the quality of the questions you ask before you start.

Why Questions Come First

Imagine I hand you a spreadsheet with 10,000 rows of data about countries, vaccination rates, and health indicators. What do you do with it? If you don't have a question, you'll just... scroll. Maybe you'll sort by one column, glance at the biggest numbers, say "huh, interesting," and close the file. That's not analysis. That's browsing.

But if you walk into that spreadsheet with a question — "Do wealthier countries consistently have higher vaccination rates?" — suddenly everything changes. You know which columns to focus on. You know what to calculate. You know what a meaningful pattern would look like. The question gives you direction, and direction is what turns data browsing into data analysis.

Elena, our public health analyst, learned this the hard way during her vaccination equity project. In her first week, she spent two days "exploring" the data — opening it in a spreadsheet, scrolling through rows, making a few charts. She generated twelve different visualizations and felt productive. But when her director asked, "So what did you find?" Elena realized she couldn't give a coherent answer. She'd been looking at data without asking it anything.

The next week, she started differently. She wrote three specific questions on a sticky note and put it on her monitor:

Which neighborhoods have the lowest vaccination rates?
How do vaccination rates compare across income levels?
Has the gap between the highest and lowest neighborhoods changed over time?

With those questions guiding her, Elena found clear, communicable answers in a single afternoon.

Three Types of Questions

Back in Chapter 1, we introduced three fundamental types of data science questions. Now that we're about to analyze real data, let's revisit them with sharper focus.

Descriptive questions ask what happened or what does the data look like. These are the foundation of every analysis.

"How many countries are in this dataset?"
"What's the average vaccination rate across all countries?"
"Which country has the highest vaccination rate?"

Predictive questions ask what is likely to happen based on patterns in the data.

"Based on the current trend, which countries are likely to fall below 70% coverage next year?"
"Can we predict a country's vaccination rate from its GDP?"

Causal questions ask what would happen if we changed something — and they're the hardest to answer.

"Did the public outreach campaign cause vaccination rates to increase?"
"Would investing in more clinics reduce the coverage gap?"

For your first data analysis, you'll mostly be asking descriptive questions. That's not a limitation — it's smart strategy. You need to understand what the data contains before you can predict or explain anything. Trying to build a prediction model before you've even looked at the data is like trying to navigate a city you've never visited without first consulting a map.

What Makes a Question Good?

Not all questions are equally useful. Here's a quick framework for evaluating whether a question is ready for analysis:

A good data question is specific. "What's going on with vaccination data?" is too vague. "What is the mean measles vaccination rate for countries in Sub-Saharan Africa in 2022?" is specific enough to produce a concrete answer.

A good data question is answerable with the available data. If your dataset doesn't contain information about healthcare spending, you can't answer questions about healthcare spending — no matter how interesting the question is. Part of being a good analyst is matching your curiosity to your data.

A good data question leads to action or understanding. "How many rows are in the dataset?" is technically answerable, but it's a housekeeping question, not an analytical one. "Do countries with higher healthcare spending consistently have better vaccination coverage?" could inform real policy decisions.

A good data question acknowledges what it can't answer. The best analysts are transparent about the boundaries of their data. If your dataset covers only 2015-2023, you can't make claims about what happened in the 1990s.

🔄 Check Your Understanding

Why is "exploring" a dataset without a question less productive than analyzing with one?

Classify each of these as descriptive, predictive, or causal: (a) "What percentage of countries achieved 90% vaccination coverage in 2022?" (b) "Will Country X reach 80% coverage by 2025?" (c) "Did the funding increase cause the improvement in vaccination rates?"

A friend says, "I want to know everything about this dataset." What advice would you give them?

6.2 Loading Real Data: The WHO Vaccination Dataset

Now for the exciting part. We're going to load actual data from the World Health Organization.

Meet the Dataset

The dataset we'll be working with is a simplified version of WHO immunization coverage estimates. It contains vaccination rates for multiple vaccines across countries and years. Here's what you need to know about it:

Source: Based on WHO/UNICEF Estimates of National Immunization Coverage (WUENIC)
Contents: Country name, WHO region, year, vaccine type, and coverage percentage
Format: CSV (comma-separated values) — a plain text format you met in Chapter 5
Size: Approximately 5,000 rows — large enough to be interesting, small enough to handle with pure Python

Let's say the file is called who_vaccination_data.csv and it lives in the same directory as your Jupyter notebook. Here's what the first few lines look like if you open it in a text editor:

country,region,year,vaccine,coverage_pct,target_population,doses_administered
Afghanistan,EMRO,2019,MCV1,64,,
Afghanistan,EMRO,2020,MCV1,66,,
Afghanistan,EMRO,2021,MCV1,58,,
Afghanistan,EMRO,2022,MCV1,62,,
Albania,EURO,2019,MCV1,96,28000,26880
Albania,EURO,2020,MCV1,95,27500,26125
Albania,EURO,2021,MCV1,96,27000,25920

A few things to notice right away. The columns are country, region, year, vaccine, coverage_pct, target_population, and doses_administered. Some cells appear to be empty — look at Afghanistan's target_population and doses_administered columns. Those missing values are going to be important later.

📊 Real-World Application: Data Provenance

Data provenance means knowing where your data came from and how it was produced. Before analyzing any dataset, a responsible analyst asks: Who collected this data? When? How? What was the original purpose? What might be missing or biased?

For our WHO dataset, the data was collected through a collaboration between WHO and UNICEF, combining official country reports, survey data, and expert review. Not every country has the same quality of reporting. Some countries have excellent vital statistics systems that track every vaccination. Others rely on survey estimates that may be less precise. This provenance information matters — it affects how confident we should be in the numbers.

Throughout your data science career, make it a habit to document data provenance at the top of every notebook. Future-you (and anyone else who reads your work) will thank you.

Loading the Data

In Chapter 5, you learned to read files using Python's built-in open() function and the csv module. Now we'll put those skills to work with a real dataset.

import csv

# Open the file and read all rows into a list of dictionaries
data = []
with open("who_vaccination_data.csv", "r", encoding="utf-8") as file:
    reader = csv.DictReader(file)
    for row in reader:
        data.append(row)

That's it. Five lines of code (not counting comments), and you've loaded your first real dataset into Python. Let's break down what happened:

csv.DictReader(file) reads each row as a dictionary, where the keys are the column names from the header row.
The for loop iterates through every row in the file.
Each row (a dictionary) gets appended to our data list.

After this code runs, data is a list of dictionaries — one dictionary per row. Let's confirm it worked.

First Look: How Many Records?

print(f"Total records loaded: {len(data)}")

Total records loaded: 4892

Nearly 5,000 records. That's a lot more than you can eyeball in a spreadsheet, which is exactly why we need Python.

Examining a Single Row

Let's look at what one row actually contains:

# Print the first row to see its structure
print(data[0])

{'country': 'Afghanistan', 'region': 'EMRO', 'year': '2019',
 'vaccine': 'MCV1', 'coverage_pct': '64',
 'target_population': '', 'doses_administered': ''}

Notice something important: everything is a string. The year '2019' isn't the integer 2019 — it's the string '2019'. The coverage percentage '64' isn't the number 64 — it's the string '64'. And those empty fields? They're empty strings '', not None or 0.

This is a critical insight. The csv module reads everything as text. If we want to do math — computing averages, finding maximums — we'll need to convert strings to numbers first. And we'll need to handle those empty strings carefully, because trying to convert '' to an integer will crash our program.

⚠️ Common Pitfall

One of the most common beginner mistakes is forgetting that CSV data loads as strings. If you try to compute max(row['coverage_pct'] for row in data), you'll get the lexicographic maximum — the string that comes last alphabetically — not the numeric maximum. The string '9' is "greater than" '89' in string comparison because '9' > '8'. Always convert to numeric types before doing math.

Building a Data Dictionary

A data dictionary is a description of every column in your dataset: what it contains, what type it should be, what its allowed values are, and any notes about quality or meaning. Professional data scientists create data dictionaries as one of their first tasks on any project.

Let's build one for our WHO dataset:

# Print all column names
print("Columns:", list(data[0].keys()))

Columns: ['country', 'region', 'year', 'vaccine', 'coverage_pct',
           'target_population', 'doses_administered']

Now let's examine each column more carefully by looking at a few sample values:

# Show the first 5 values for each column
for col in data[0].keys():
    values = [row[col] for row in data[:5]]
    print(f"{col}: {values}")

country: ['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Albania']
region: ['EMRO', 'EMRO', 'EMRO', 'EMRO', 'EURO']
year: ['2019', '2019', '2020', '2021', '2022']
vaccine: ['MCV1', 'MCV1', 'MCV1', 'MCV1', 'MCV1']
coverage_pct: ['64', '66', '58', '62', '96']
target_population: ['', '', '', '', '28000']
doses_administered: ['', '', '', '', '26880']

Here's our data dictionary so far:

Column	Description	Type (should be)	Notes
`country`	Country name	text	Should be unique per country-year-vaccine combination
`region`	WHO region code	text (categorical)	e.g., EMRO, EURO, AFRO, SEARO, WPRO, AMRO
`vaccine`	Vaccine abbreviation	text (categorical)	e.g., MCV1 (measles first dose), DTP3, Pol3, BCG
`year`	Year of observation	integer	Range appears to be 2019-2022
`coverage_pct`	Vaccination coverage as a percentage	float	Range: 0-100 (possibly > 100 in some reporting systems)
`target_population`	Target population for vaccination	integer	Many missing values
`doses_administered`	Total doses given	integer	Many missing values

🔄 Check Your Understanding

Why does csv.DictReader return strings for all values, even numeric ones?

What would go wrong if you tried to compute sum(row['coverage_pct'] for row in data) without converting types?

Why is building a data dictionary a good first step before analysis?

6.3 Getting to Know Your Data: First Explorations

You've loaded the data. You know how many rows you have and what the columns are called. Now it's time for the part that experienced data scientists will tell you is the most fun: getting to know the data through exploration.

🚪 Threshold Concept: EDA as a Conversation

Here's the single most important idea in this chapter — one that will shape how you approach every dataset for the rest of your career:

Exploratory data analysis is a conversation with your data. You ask a question. The data gives an answer. That answer raises a new question. You ask that. And the cycle continues.

This isn't a metaphor — it's a literal description of how good EDA works. You don't sit down and write a 50-line analysis script from scratch. You work interactively, one question at a time, each answer guiding the next question. This is exactly why Jupyter notebooks are so powerful for data exploration — each cell is one question-and-answer exchange in the conversation.

John Tukey, the statistician who coined the term "exploratory data analysis" in his landmark 1977 book, put it this way: the purpose of EDA is to suggest hypotheses, not to confirm them. You're a detective gathering clues, not a lawyer making a case. Stay curious. Stay open to surprise. Let the data talk.

Shape: How Big Is This Dataset?

We already know we have 4,892 rows. Let's figure out how many columns and summarize the "shape":

num_rows = len(data)
num_cols = len(data[0]) if data else 0
print(f"Shape: {num_rows} rows x {num_cols} columns")

Shape: 4892 rows x 7 columns

In pandas (which you'll learn in Chapter 7), this would be a single call to .shape. For now, we built it ourselves.

Column Names

columns = list(data[0].keys())
print("Columns:")
for i, col in enumerate(columns, 1):
    print(f"  {i}. {col}")

Columns:
  1. country
  2. region
  3. year
  4. vaccine
  5. coverage_pct
  6. target_population
  7. doses_administered

First and Last Rows

Checking the first and last few rows is a data science habit that takes five seconds and catches all kinds of problems: truncated files, corrupted endings, unexpected formats.

# First 3 rows
print("=== First 3 rows ===")
for row in data[:3]:
    print(row)

print()

# Last 3 rows
print("=== Last 3 rows ===")
for row in data[-3:]:
    print(row)

=== First 3 rows ===
{'country': 'Afghanistan', 'region': 'EMRO', 'year': '2019', 'vaccine': 'MCV1', 'coverage_pct': '64', 'target_population': '', 'doses_administered': ''}
{'country': 'Afghanistan', 'region': 'EMRO', 'year': '2020', 'vaccine': 'MCV1', 'coverage_pct': '66', 'target_population': '', 'doses_administered': ''}
{'country': 'Afghanistan', 'region': 'EMRO', 'year': '2021', 'vaccine': 'MCV1', 'coverage_pct': '58', 'target_population': '', 'doses_administered': ''}

=== Last 3 rows ===
{'country': 'Zimbabwe', 'region': 'AFRO', 'year': '2020', 'vaccine': 'Pol3', 'coverage_pct': '90', 'target_population': '490000', 'doses_administered': '441000'}
{'country': 'Zimbabwe', 'region': 'AFRO', 'year': '2021', 'vaccine': 'Pol3', 'coverage_pct': '88', 'target_population': '495000', 'doses_administered': '435600'}
{'country': 'Zimbabwe', 'region': 'AFRO', 'year': '2022', 'vaccine': 'Pol3', 'coverage_pct': '92', 'target_population': '500000', 'doses_administered': '460000'}

Good news: the file starts and ends cleanly. Countries are in alphabetical order (Afghanistan to Zimbabwe). The structure is consistent.

Unique Values: What Categories Exist?

One of the first things you want to know about a dataset is: for each column that seems categorical (text-based), what are the unique values?

def unique_values(data, column):
    """Return the set of unique values in a column."""
    return sorted(set(row[column] for row in data))

# Unique regions
regions = unique_values(data, "region")
print(f"Regions ({len(regions)}): {regions}")

# Unique vaccines
vaccines = unique_values(data, "vaccine")
print(f"Vaccines ({len(vaccines)}): {vaccines}")

# Unique years
years = unique_values(data, "year")
print(f"Years ({len(years)}): {years}")

# How many unique countries?
countries = unique_values(data, "country")
print(f"Countries: {len(countries)} unique")

Regions (6): ['AFRO', 'AMRO', 'EMRO', 'EURO', 'SEARO', 'WPRO']
Vaccines (4): ['BCG', 'DTP3', 'MCV1', 'Pol3']
Years (4): ['2019', '2020', '2021', '2022']
Countries: 194 unique

Now we're learning something. The dataset covers 194 countries, 6 WHO regions, 4 vaccines, and 4 years (2019-2022). That means, in theory, we'd expect 194 x 4 x 4 = 3,104 rows if every country has data for every vaccine in every year. But we have 4,892 rows, which is more than that — so either some countries have extra entries, or my arithmetic of unique combinations is off. Let's check:

# How many rows would we expect with complete data?
expected = len(countries) * len(vaccines) * len(years)
print(f"Expected (complete): {expected}")
print(f"Actual: {len(data)}")
print(f"Difference: {len(data) - expected}")

Expected (complete): 3104
Actual: 4892
Difference: 1788

Interesting — we have nearly 1,800 more rows than expected. This could mean some countries appear with multiple entries for the same vaccine and year, or there might be additional vaccine types in some countries. That's something worth investigating. This is the EDA conversation in action: one observation leads to a new question.

Counting Records by Category

Let's count how many records we have per region:

def count_by(data, column):
    """Count the number of rows for each unique value in a column."""
    counts = {}
    for row in data:
        value = row[column]
        counts[value] = counts.get(value, 0) + 1
    return counts

region_counts = count_by(data, "region")
for region, count in sorted(region_counts.items()):
    print(f"  {region}: {count} records")

  AFRO: 1520 records
  AMRO: 820 records
  EMRO: 564 records
  EURO: 1040 records
  SEARO: 308 records
  WPRO: 640 records

AFRO (the African region) has the most records, while SEARO (South-East Asia) has the fewest. This could simply reflect the number of countries in each region. Let's check:

# Count unique countries per region
countries_per_region = {}
for row in data:
    region = row["region"]
    country = row["country"]
    if region not in countries_per_region:
        countries_per_region[region] = set()
    countries_per_region[region].add(country)

for region in sorted(countries_per_region):
    count = len(countries_per_region[region])
    print(f"  {region}: {count} countries")

  AFRO: 47 countries
  AMRO: 35 countries
  EMRO: 22 countries
  EURO: 53 countries
  SEARO: 11 countries
  WPRO: 26 countries

That makes sense — Africa has 47 countries in the dataset, and South-East Asia has 11. The record counts roughly correspond to the number of countries.

🔄 Check Your Understanding

Why do experienced data scientists check the first and last rows of a dataset?

What did we learn by comparing the expected number of rows (3,104) to the actual number (4,892)?

Write a function call using count_by() that would count the number of records per year.

6.4 Computing Summary Statistics by Hand

Now we get to the real work: computing numbers that summarize what the data tells us. In Chapter 19, we'll study summary statistics formally. For now, we'll compute the basics using nothing but the Python you already know.

Step 1: Extract Numeric Values (Safely)

Before we can compute anything, we need to pull out the coverage percentages as actual numbers, handling missing values gracefully:

def get_numeric_values(data, column):
    """Extract non-empty numeric values from a column.

    Returns a list of floats, skipping any rows where the
    value is empty or can't be converted to a number.
    """
    values = []
    skipped = 0
    for row in data:
        raw = row[column].strip()
        if raw == "":
            skipped += 1
            continue
        try:
            values.append(float(raw))
        except ValueError:
            skipped += 1
    print(f"  Extracted {len(values)} values, skipped {skipped}")
    return values

print("Coverage percentages:")
coverage_values = get_numeric_values(data, "coverage_pct")

Coverage percentages:
  Extracted 4815 values, skipped 77

So 77 out of 4,892 records are missing a coverage percentage. That's about 1.6% — not catastrophic, but worth noting.

Count, Min, Max

The simplest summary statistics are also the most informative for a first look:

print(f"Count: {len(coverage_values)}")
print(f"Min:   {min(coverage_values)}")
print(f"Max:   {max(coverage_values)}")

Count: 4815
Min:   6.0
Max:   99.0

A minimum of 6% and a maximum of 99%. That's an enormous range. Some country, somewhere, has a vaccination rate of only 6% for one of these vaccines. That's a story — and it's the kind of story that data science exists to uncover.

Mean (Average)

The mean is the sum of all values divided by how many there are. We can compute it with a loop, or more elegantly with the built-in sum() function:

def compute_mean(values):
    """Compute the arithmetic mean of a list of numbers."""
    if not values:
        return None
    return sum(values) / len(values)

mean_coverage = compute_mean(coverage_values)
print(f"Mean coverage: {mean_coverage:.1f}%")

Mean coverage: 82.7%

On average, across all countries, vaccines, and years, coverage is about 82.7%. That's encouraging — but the average hides the enormous variation between that 6% minimum and 99% maximum.

Median (The Middle Value)

The mean can be pulled around by extreme values. The median — the middle value when you sort all the data — gives you a more robust sense of what's "typical." Computing it requires sorting:

def compute_median(values):
    """Compute the median of a list of numbers."""
    if not values:
        return None
    sorted_vals = sorted(values)
    n = len(sorted_vals)
    mid = n // 2
    if n % 2 == 0:
        # Even number of values: average the two middle ones
        return (sorted_vals[mid - 1] + sorted_vals[mid]) / 2
    else:
        # Odd number of values: take the middle one
        return sorted_vals[mid]

median_coverage = compute_median(coverage_values)
print(f"Median coverage: {median_coverage:.1f}%")

Median coverage: 89.0%

The median (89%) is higher than the mean (82.7%). When the median is higher than the mean, it typically means the data is left-skewed — pulled down by a tail of low values. In practical terms, this tells us that most countries have fairly high vaccination coverage, but a smaller group of countries has much lower rates, dragging the average down. That's an important insight.

Range and Simple Spread

value_range = max(coverage_values) - min(coverage_values)
print(f"Range: {value_range:.1f} percentage points")

Range: 93.0 percentage points

A 93-percentage-point range. The gap between the best-covered and worst-covered country-vaccine-year combinations is staggering.

Putting It All Together: A Summary Function

Let's bundle everything into a reusable function — just like we learned in Chapter 4:

def summarize(values, label="Values"):
    """Print summary statistics for a list of numbers."""
    if not values:
        print(f"{label}: No data available")
        return

    print(f"--- Summary: {label} ---")
    print(f"  Count:  {len(values)}")
    print(f"  Min:    {min(values):.1f}")
    print(f"  Max:    {max(values):.1f}")
    print(f"  Range:  {max(values) - min(values):.1f}")
    print(f"  Mean:   {compute_mean(values):.1f}")
    print(f"  Median: {compute_median(values):.1f}")
    print()

summarize(coverage_values, "Vaccination Coverage (%)")

--- Summary: Vaccination Coverage (%) ---
  Count:  4815
  Min:    6.0
  Max:    99.0
  Range:  93.0
  Mean:   82.7
  Median: 89.0

Now let's use this function to compare coverage across regions. This is where things get really interesting:

# Compute summary stats by region
for region in sorted(set(row["region"] for row in data)):
    # Filter data for this region
    region_values = []
    for row in data:
        if row["region"] == region and row["coverage_pct"].strip() != "":
            try:
                region_values.append(float(row["coverage_pct"]))
            except ValueError:
                pass
    summarize(region_values, f"Region: {region}")

--- Summary: Region: AFRO ---
  Count:  1485
  Min:    6.0
  Max:    99.0
  Range:  93.0
  Mean:   72.3
  Median: 76.0

--- Summary: Region: AMRO ---
  Count:  805
  Min:    20.0
  Max:    99.0
  Range:  79.0
  Mean:   86.4
  Median: 90.0

--- Summary: Region: EMRO ---
  Count:  555
  Min:    11.0
  Max:    99.0
  Range:  88.0
  Mean:   80.5
  Median: 85.0

--- Summary: Region: EURO ---
  Count:  1025
  Min:    42.0
  Max:    99.0
  Range:  57.0
  Mean:   93.1
  Median: 95.0

--- Summary: Region: SEARO ---
  Count:  302
  Min:    40.0
  Max:    99.0
  Range:  59.0
  Mean:   87.8
  Median: 91.0

--- Summary: Region: WPRO ---
  Count:  643
  Min:    21.0
  Max:    99.0
  Range:  78.0
  Mean:   88.2
  Median: 93.0

Now this is a finding. The African region (AFRO) has a mean vaccination coverage of 72.3% and a median of 76% — substantially lower than the European region (EURO), where the mean is 93.1% and the median is 95%. The gap is over 20 percentage points.

This is the moment. Can you feel it? You just discovered a real pattern in real data. You loaded a CSV file, wrote some loops and functions, and found that vaccination coverage varies dramatically across world regions. This is what data science feels like when it clicks.

📊 Real-World Application

The regional disparities you just computed reflect a well-documented global health challenge. According to WHO reports, the African region consistently has lower vaccination coverage due to a complex mix of factors including healthcare infrastructure, supply chain challenges, political instability, and funding gaps. Your analysis, using nothing more than basic Python, rediscovered a pattern that informs billions of dollars in global health spending. Data science doesn't always require sophisticated algorithms — sometimes a well-computed mean tells a powerful story.

🔄 Check Your Understanding

Why did we need to handle empty strings before computing summary statistics?

What does it mean when the median is higher than the mean? What does that tell us about the shape of the distribution?

Look at the regional summaries. Which region has the smallest range? What might that indicate?

6.5 Spotting Data Quality Issues

If the previous section was the thrill of discovery, this section is the reality check. Real data is never perfectly clean. The WHO dataset we've been exploring has issues — and finding them is just as important as computing statistics. In fact, many experienced data scientists would say it's more important, because if you don't know about problems in your data, every conclusion you draw might be wrong.

Missing Values

We already noticed that some coverage percentages are missing. Let's do a thorough count of missing values across all columns:

print("Missing values by column:")
print("-" * 35)
for col in data[0].keys():
    missing = sum(1 for row in data if row[col].strip() == "")
    pct = (missing / len(data)) * 100
    print(f"  {col:25s} {missing:5d}  ({pct:.1f}%)")

Missing values by column:
-----------------------------------
  country                       0  (0.0%)
  region                        0  (0.0%)
  year                          0  (0.0%)
  vaccine                       0  (0.0%)
  coverage_pct                 77  (1.6%)
  target_population          2340  (47.8%)
  doses_administered         2340  (47.8%)

Look at those numbers. Country, region, year, and vaccine are complete — no missing values. Coverage percentage has 77 missing values (1.6%), which is manageable. But target_population and doses_administered are missing for nearly half the dataset (47.8%). That's not a minor issue — it means we can't reliably compute doses-per-capita statistics or verify coverage percentages against raw numbers for about half the countries.

This is a data quality finding that would go directly into your notebook narrative: "Nearly half of all records are missing population and dose data, limiting our ability to verify coverage estimates independently."

Types of Missing Data

Not all missing values are the same. Here are the three main categories:

Missing Completely at Random (MCAR): The data is missing for no systematic reason — it's just a random gap. Maybe a data entry clerk skipped a field by accident.

Missing at Random (MAR): The data is missing in a way that's related to something else you can see. For example, smaller countries might not report population figures as consistently as larger ones.

Missing Not at Random (MNAR): The data is missing because of the value itself. For instance, countries with very low vaccination rates might be less likely to report them — which would systematically bias our analysis toward overestimating coverage.

For our dataset, the fact that target_population and doses_administered are missing together (both at 47.8%) suggests these come from the same reporting process. Countries either report both or neither. This is probably MAR — related to the country's reporting infrastructure rather than to the vaccination rate itself.

Suspicious Values: Looking for Outliers

An outlier is a data point that falls far outside the normal range. Sometimes outliers are real (an actual extreme value), and sometimes they're errors (a typo, a unit conversion mistake, a misplaced decimal point). Part of data quality work is identifying them and deciding which they are.

# Find the lowest coverage values
low_coverage = [(row["country"], row["year"], row["vaccine"],
                 row["coverage_pct"])
                for row in data
                if row["coverage_pct"].strip() != ""
                and float(row["coverage_pct"]) < 20]

print(f"Records with coverage below 20% ({len(low_coverage)} found):")
for country, year, vaccine, cov in sorted(low_coverage, key=lambda x: float(x[3])):
    print(f"  {country}, {year}, {vaccine}: {cov}%")

Records with coverage below 20% (23 found):
  South Sudan, 2021, MCV1: 6%
  Somalia, 2020, MCV1: 11%
  Papua New Guinea, 2021, MCV1: 13%
  South Sudan, 2020, BCG: 14%
  ...

These very low values are concentrated in countries experiencing conflict, political instability, or severe infrastructure challenges — South Sudan, Somalia, Papua New Guinea. They're almost certainly real values, not errors. But it's important that we checked.

What about suspiciously high values?

# Any coverage above 100%?
above_100 = [row for row in data
             if row["coverage_pct"].strip() != ""
             and float(row["coverage_pct"]) > 100]
print(f"Records with coverage above 100%: {len(above_100)}")

Records with coverage above 100%: 0

No values above 100%. In some WHO datasets, coverage can exceed 100% when doses administered include catch-up campaigns or when population estimates are outdated. The fact that our dataset caps at 99% suggests it may have already been cleaned.

Inconsistent Formats

Let's check for inconsistencies in the categorical columns:

# Check for extra whitespace or case variations in country names
country_issues = []
for row in data:
    name = row["country"]
    if name != name.strip():
        country_issues.append(f"Leading/trailing whitespace: '{name}'")
    if name != name.title() and name != name.upper():
        country_issues.append(f"Unusual capitalization: '{name}'")

if country_issues:
    print(f"Found {len(country_issues)} country name issues:")
    for issue in country_issues[:5]:
        print(f"  {issue}")
else:
    print("No obvious country name inconsistencies found.")

No obvious country name inconsistencies found.

Clean. But let's check something subtler — are there any countries that appear with different names in different rows?

# Check for near-duplicate country names (quick check)
countries_list = sorted(set(row["country"] for row in data))
print(f"First 20 countries: {countries_list[:20]}")
print(f"Last 20 countries: {countries_list[-20:]}")

This kind of manual inspection sometimes catches issues like "United States" vs. "United States of America" or "Cote d'Ivoire" vs. "Ivory Coast." Being thorough about data quality now saves you from drawing incorrect conclusions later.

A Data Quality Report

Let's formalize our findings into a data quality summary — the kind you'd include at the top of a professional analysis:

print("=" * 50)
print("DATA QUALITY REPORT")
print("=" * 50)
print()
print(f"Dataset: WHO Vaccination Coverage Estimates")
print(f"Records: {len(data)}")
print(f"Columns: {len(data[0])}")
print()
print("COMPLETENESS:")
for col in data[0].keys():
    missing = sum(1 for row in data if row[col].strip() == "")
    complete_pct = ((len(data) - missing) / len(data)) * 100
    status = "OK" if complete_pct > 95 else "WARNING" if complete_pct > 50 else "CRITICAL"
    print(f"  {col:25s} {complete_pct:5.1f}% complete  [{status}]")
print()
print("RANGE CHECKS:")
print(f"  Coverage values: {min(coverage_values):.0f}% to {max(coverage_values):.0f}%")
print(f"  Values above 100%: None found")
print(f"  Values below 10%:  {sum(1 for v in coverage_values if v < 10)} records")
print()
print("CONSISTENCY:")
print(f"  Country names: No duplicates or format issues detected")
print(f"  Region codes:  {len(regions)} valid codes found")
print(f"  Year range:    {min(years)} to {max(years)}")

📝 Productive Struggle: Explore Before You're Told What to Find

Here's a challenge that's intentionally open-ended: Before reading any further, take 15 minutes to explore the dataset on your own. Ask your own questions and write code to answer them. Some ideas to get you started (but don't limit yourself to these):

Did vaccination coverage go up or down between 2019 and 2022 (the period that includes the COVID-19 pandemic)?

Which vaccine has the highest average coverage? The lowest?

Are there countries that improved dramatically from one year to the next?

Is there a relationship between having complete data (no missing values) and having higher coverage?

There are no right or wrong answers here. The point is to practice the habit of curiosity — to look at data and generate questions naturally. Come back when you've found something that surprised you.

🔄 Check Your Understanding

Why are target_population and doses_administered missing at exactly the same rate? What does this suggest?

Give an example of a missing value that would be MNAR (missing not at random) in a vaccination dataset.

Why is it important to check for values above 100% in a percentage column?

6.6 Telling the Story: Notebook as Narrative

You've loaded data. You've computed statistics. You've found missing values and explored regional differences. Now comes a step that separates good data science from great data science: telling the story.

The Notebook Is Not a Code Dump

If you hand someone a Jupyter notebook that's just cell after cell of code and output, they'll learn nothing. A notebook should read like a document — with an introduction that explains what you're investigating, section headers that guide the reader through your logic, and Markdown text between code cells that explains why you did what you did and what you learned.

This concept is called a notebook narrative. It means treating your Jupyter notebook not as a scratch pad, but as a communication tool.

Here's what a well-structured notebook looks like:

# WHO Vaccination Coverage: Initial Exploration

## Background and Questions

This notebook explores the WHO vaccination coverage dataset (WUENIC estimates).
We're investigating three questions:
1. What does the overall distribution of coverage look like?
2. How does coverage vary by region?
3. What data quality issues should we be aware of?

## Data Loading

We'll use Python's csv module to load the data.

Then a code cell with the loading code. Then:

## First Impressions

We have 4,892 records covering 194 countries, 6 WHO regions, 4 vaccines,
and 4 years (2019-2022). Let's explore the structure.

Then code for exploration. Then your observations. And so on.

The Principle of Reproducibility

Reproducibility means that someone else — or future-you — can run your notebook from top to bottom and get the same results. This is one of the foundational principles of scientific work, and it applies to data science just as much as it applies to laboratory science.

Here's what reproducibility requires in practice:

The data file is accessible. Either it's included with the notebook, or the notebook documents exactly where to download it.
The code runs in order. Every cell builds on the cells above it. No hidden state from running cells out of order.
Dependencies are documented. In our case, we're only using the standard library (csv), so there's nothing extra to install. But in real projects, you'd list your requirements.
Random processes are seeded. If any step involves randomness, set a seed so results are identical on every run. (We're not using randomness here, but you will later.)

A good habit is to periodically restart your Jupyter kernel and run all cells from scratch to confirm that the notebook actually works from top to bottom. You'd be surprised how often notebooks break because the analyst ran cells out of order and built up hidden state.

Marcus's Story: From Numbers to Narrative

Marcus, our bakery owner from Chapter 1, recently finished his first data analysis. He'd exported his sales data to a CSV file and used Python to compute average daily sales, find his best-selling items, and check whether Saturday sales were really higher than weekday sales (they were — by 43%).

His first notebook was just code. When he showed it to his business partner Kenji, Kenji's eyes glazed over. So Marcus rewrote it as a narrative:

"I wanted to know whether our Saturday marketing push is actually working. I loaded three months of sales data (2,847 transactions) and compared Saturday revenue to our weekday average. The result: Saturday average revenue is $2,340, compared to the weekday average of $1,637. That's a 43% lift. But here's what surprised me — our margin on Saturdays is actually lower, because we sell more of the lower-margin pastries on weekends. We might want to rethink our Saturday specials to promote higher-margin items."

That's a notebook narrative. It starts with a question, presents evidence, and arrives at a recommendation. The code is still there — Kenji can check it if he wants — but the story is what drives the document.

Your Turn: Document Your Exploration

As you work through the WHO dataset, keep these principles in mind:

Start each section with a question. Instead of a code cell that just appears out of nowhere, write a Markdown cell above it: "Next, I want to know whether coverage changed during the COVID-19 pandemic (2019-2022)."

Annotate surprising results. When your output shows something unexpected — like the AFRO region having 20+ percentage points lower coverage than EURO — write a sentence explaining what you see and what it might mean.

End with a summary. After your exploration, write a "Findings" section that lists 3-5 things you learned. This forces you to synthesize your work and is the most valuable part of the notebook for any reader.

🔄 Check Your Understanding

What is the difference between a notebook that's a "code dump" and one that's a "narrative"?

Name three things that make a notebook reproducible.

Why did Marcus's rewritten notebook work better for his business partner than the code-only version?

6.7 The Limits of Pure Python: Why We Need Better Tools

By now, you should be feeling two things simultaneously: pride at what you've accomplished, and frustration at how much code it took.

That frustration is by design. Let me show you what I mean.

The Verbosity Problem

To compute the mean coverage by region, we wrote something like this:

for region in sorted(set(row["region"] for row in data)):
    region_values = []
    for row in data:
        if row["region"] == region and row["coverage_pct"].strip() != "":
            try:
                region_values.append(float(row["coverage_pct"]))
            except ValueError:
                pass
    if region_values:
        mean_val = sum(region_values) / len(region_values)
        print(f"{region}: {mean_val:.1f}%")

That's about 10 lines of code to compute one statistic grouped by one column. In pandas (Chapter 7), the same computation is:

df.groupby("region")["coverage_pct"].mean()

One line. And it's not just shorter — it's clearer. Even someone who doesn't know Python can read that line and understand what it does: group by region, take the coverage percentage column, compute the mean.

The Type Conversion Problem

Every time we wanted to do math, we had to convert strings to numbers and handle empty strings. In pandas, when you load a CSV, it automatically detects numeric columns and converts them for you. Missing values become NaN (Not a Number), which pandas handles gracefully in all calculations.

The Filtering Problem

Suppose you want to find all records for countries in the African region with measles (MCV1) coverage below 50% in 2022. In pure Python:

results = []
for row in data:
    if (row["region"] == "AFRO"
        and row["vaccine"] == "MCV1"
        and row["year"] == "2022"
        and row["coverage_pct"].strip() != ""
        and float(row["coverage_pct"]) < 50):
        results.append(row)

In pandas:

df[(df["region"] == "AFRO") &
   (df["vaccine"] == "MCV1") &
   (df["year"] == 2022) &
   (df["coverage_pct"] < 50)]

Both work. But the pandas version is more concise, more readable, and runs much faster on large datasets.

The Visualization Problem

We haven't even tried to make a chart yet. With pure Python, creating a plot from scratch is possible but painful. With matplotlib (Chapter 15) and pandas plotting, it's a single line. Imagine trying to create a bar chart of mean coverage by region using only print() statements. You can do it — but you shouldn't have to.

Why This Matters

I'm not showing you these comparisons to make you feel bad about the code you just wrote. Quite the opposite — I'm showing you to emphasize that what you did in this chapter is remarkable. You loaded real data, computed real statistics, and found real patterns using nothing but Python fundamentals. You understand what those operations actually do, at the lowest level. When you learn pandas in Chapter 7, you won't just be memorizing magic incantations — you'll understand what's happening under the hood.

This is like learning to cook from scratch before using a food processor. Yes, the food processor is faster. But the person who has hand-chopped onions knows something the person who's only pressed "pulse" will never know. That knowledge makes you a better cook — and in our case, a better data scientist.

📊 Real-World Application

Priya, our sports journalist, recently talked about this exact experience in a podcast interview. She said: "When I first learned pandas, I didn't appreciate it because I didn't know what it was saving me from. It was just, you know, 'type this and stuff appears.' Then I took a workshop where they made us do everything in pure Python first. Loading CSVs, computing averages, handling missing data — all by hand. After that, when they showed us pandas, I literally said 'oh thank God' out loud. I finally understood why the library exists."

A Preview of Part II

In Part II, we'll introduce the tools that make data analysis feel less like building a house brick by brick and more like snapping together LEGO blocks:

Chapter 7: Introduction to pandas — DataFrames, Series, and the grammar of data manipulation
Chapter 8: Cleaning Messy Data — Professional techniques for handling the problems you spotted manually
Chapter 9: Reshaping and Transforming — Merging datasets, pivoting tables, grouping and aggregating
Chapters 10-13: Working with text, dates, files, and web data — Specialized tools for specialized data types

Every single one of those chapters will build on what you learned here. The questions you asked, the patterns you noticed, the frustrations you felt — they all carry forward.

Project Checkpoint: Your First Real Exploration of WHO Data

This is the major milestone for Part I of the progressive project. Here's what you should have in your project notebook by the end of this chapter:

Deliverable: "First Look" Notebook Section

Your project notebook should now contain a new section with the following elements:

1. Data Loading and Provenance

# Document the data source
print("Dataset: WHO/UNICEF Estimates of National Immunization Coverage")
print("Source: World Health Organization")
print("File: who_vaccination_data.csv")
print("Downloaded: [your date]")

# Load the data
import csv
data = []
with open("who_vaccination_data.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        data.append(row)

print(f"Loaded {len(data)} records")

2. Basic Inspection - Shape (rows x columns) - Column names and types - First and last 5 rows - Unique values for categorical columns

3. Summary Statistics - Overall coverage: count, min, max, mean, median - Coverage by region - Coverage by vaccine type

4. Data Quality Assessment - Missing values by column - Any values outside expected ranges - Consistency checks on categorical columns

5. Initial Observations (Markdown) Write at least three observations in plain English. For example: - "Vaccination coverage varies dramatically by region, with AFRO averaging 72.3% compared to EURO's 93.1%." - "Nearly half of records are missing population and dose data." - "The lowest coverage values are concentrated in conflict-affected countries."

6. Questions for Further Investigation List 2-3 questions that your exploration raised but didn't answer. For example: - "Did coverage rates decline during COVID-19 (2020-2021) and recover in 2022?" - "Which specific countries are driving the low averages in the AFRO region?"

✅ Action Checklist: The EDA Workflow

Use this checklist every time you sit down with a new dataset. Tape it to your monitor until it becomes second nature.

[ ] Define your questions. Write 2-3 specific questions before touching the data.

[ ] Document provenance. Record where the data came from, when, and any known limitations.

[ ] Load and verify. Load the data and confirm the row count, column count, and column names.

[ ] Inspect structure. Check first/last rows, data types, and unique values for categorical columns.

[ ] Assess completeness. Count missing values for every column. Note any patterns.

[ ] Compute summary statistics. Count, min, max, mean, median for each numeric column.

[ ] Check ranges. Look for impossible or suspicious values (negative percentages, ages over 150, etc.).

[ ] Break it down. Compute statistics by relevant groups (region, year, category).

[ ] Document findings. Write your observations in plain English between code cells.

[ ] List next questions. End with questions your exploration raised but didn't answer.

Practical Considerations

Working with Larger Files

The WHO dataset we used has about 5,000 rows, which pure Python handles easily. But what if your file had 5 million rows? You'd start noticing performance issues. Reading the entire file into a list of dictionaries uses a lot of memory, and looping over millions of rows in Python is slow compared to optimized libraries.

For now, the pure Python approach works fine for datasets under about 100,000 rows. Beyond that, you'll want the tools from Part II (pandas uses optimized C code under the hood, so it's dramatically faster).

File Encoding

We used encoding="utf-8" when opening the file. This is a good default, but you'll occasionally encounter files with different encodings — especially files from older systems or files that originated in non-English-speaking countries. If you see garbled characters (like Ã© instead of é), the encoding is likely wrong. Common alternatives include "latin-1" and "cp1252". Chapter 12 will cover this in depth.

File Paths

We assumed the CSV file is in the same directory as the notebook. In practice, it's good to organize your project with a dedicated data/ folder. You'd then load it as:

with open("data/who_vaccination_data.csv", "r", encoding="utf-8") as f:
    ...

Or use Python's pathlib module for cross-platform path handling:

from pathlib import Path
data_path = Path("data") / "who_vaccination_data.csv"

Summary

Let's step back and appreciate what you accomplished in this chapter.

You took a raw CSV file — a flat text file full of commas and strings — and turned it into knowledge. You know that global vaccination coverage averages about 83%, but that average hides a 93-percentage-point range from the lowest to the highest. You know that the African region has significantly lower coverage than Europe. You know that nearly half the dataset is missing population data. And you know that even a simple analysis raises more questions than it answers — which is exactly how data science is supposed to work.

Key concepts from this chapter:

Exploratory data analysis (EDA) is the process of systematically examining a dataset to discover patterns, spot anomalies, check assumptions, and generate hypotheses. It's a conversation with your data.
Data loading with Python's csv.DictReader gives you a list of dictionaries — one per row, with column names as keys and everything as strings.
Data inspection means checking shape, column names, first/last rows, unique values, and data types — the first things you do with any new dataset.
Summary statistics (count, min, max, mean, median) give you a numerical snapshot of each column's distribution.
Data quality assessment involves checking for missing values, outliers, inconsistent formats, and suspicious values. Every quality issue you find now prevents a wrong conclusion later.
A data dictionary documents what each column contains, its expected type, and any notes about quality or meaning.
Data provenance records where the data came from, who collected it, when, and for what purpose.
A notebook narrative combines code, output, and explanatory text into a document that tells the story of your analysis.
Reproducibility means anyone can rerun your notebook and get the same results.

The threshold concept: EDA is a conversation. You ask, the data answers, and each answer leads to the next question. Good data scientists never stop being curious — and the best analyses are the ones where the data surprised the analyst.

Spaced Review: Chapters 1-5

This is the end of Part I. Before moving to Part II, let's make sure the foundational concepts from the first five chapters are solid. Try to answer these from memory before checking back.

From Chapter 1: What Is Data Science? 1. What are the three main skills that overlap in data science? 2. Name the six stages of the data science lifecycle. 3. What's the difference between a descriptive question and a predictive question?

From Chapter 2: Setting Up Your Toolkit 4. What is a Jupyter notebook, and why is it useful for data science? 5. What is the difference between a code cell and a Markdown cell?

From Chapter 3: Python Fundamentals I 6. What is the difference between = and == in Python? 7. What does an if/elif/else block do? 8. What is the output of "hello" + " " + "world"?

From Chapter 4: Python Fundamentals II 9. What is a function, and why do we write them? 10. What is the difference between a parameter and an argument? 11. What does a for loop do?

From Chapter 5: Working with Data Structures 12. What is the difference between a list and a dictionary? 13. How do you access the value associated with the key "name" in a dictionary called person? 14. What does csv.DictReader return for each row of a CSV file?

If any of these felt shaky, go back and review the relevant chapter before moving on. Part II assumes solid command of all this material.

What's Next: Part II — Data Wrangling

Congratulations — you've completed Part I: Welcome to Data Science. You've gone from "what is data science?" to actually doing data science. That's a significant accomplishment, and you should feel proud of it.

But you've also felt the limitations. Computing grouped averages with nested loops. Converting strings to numbers by hand. Filtering with multi-line if statements. These are real pain points, and they're about to go away.

In Part II, we introduce the tools that professional data scientists use every day:

Chapter 7 introduces pandas, the library that turns all of your loop-and-dictionary code into concise, powerful one-liners. The same analysis you did in this chapter — loading data, computing statistics by region, finding missing values — will take about ten lines of pandas code instead of a hundred lines of pure Python. You'll rebuild your WHO analysis in pandas and feel the difference immediately.

Chapters 8-13 teach you to clean messy data, reshape tables, work with text and dates, load data from files and the web, and handle all the real-world complexity that comes with genuine datasets.

The frustration you felt in this chapter is about to transform into relief. And the understanding you built — of what each operation actually does — will make you a more effective pandas user than someone who jumped straight to the library without doing it the hard way first.

Turn the page. Part II is waiting. And it's going to feel like a superpower.