Chapter 5: Working with Data Structures: Dictionaries, Files, and Thinking in Data

Contributors to Introduction to Data Science

19 min read

> "The question of data is not 'what number?' but 'how is it organized?'"

Prerequisites

{'chapter': 4, 'description': 'Control flow, functions, and basic list usage from Python Fundamentals II'}

Learning Objectives

Create and manipulate lists, dictionaries, sets, and tuples, choosing the right structure for the data at hand
Access nested data structures (list of dictionaries, dictionary of lists) that mirror real-world data shapes
Read data from a text file and a CSV file using Python's built-in open() and csv module
Transform data structures using list comprehensions and dictionary comprehensions
Compare Python data structures (list vs. dict vs. set vs. tuple) and justify which to use in a given scenario

In This Chapter

What You'll Learn
Why This Chapter Matters
5.1 Lists Revisited: Beyond Simple Sequences
5.2 Dictionaries: The Data Scientist's Best Friend
5.3 Sets and Tuples: When Order and Uniqueness Matter
5.4 Nesting: Lists of Dictionaries and Beyond
5.5 List and Dictionary Comprehensions: Concise Data Transformation
5.6 Reading Data from Files
5.7 Writing Data to Files
Project Checkpoint: Country-to-Region Mappings and Your First CSV Read
Practical Considerations
Summary: Choosing the Right Data Structure
Spaced Review: Retrieval Practice from Chapters 1-4
What's Next
Chapter 5 Vocabulary Reference

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 5: Working with Data Structures: Dictionaries, Files, and Thinking in Data

"The question of data is not 'what number?' but 'how is it organized?'" — Adapted from a common observation in data engineering

What You'll Learn

By the end of this chapter, you will be able to:

Create and manipulate lists, dictionaries, sets, and tuples, choosing the right structure for the data at hand
Access nested data structures (list of dictionaries, dictionary of lists) that mirror real-world data shapes
Read data from a text file and a CSV file using Python's built-in open() and csv module
Transform data structures using list comprehensions and dictionary comprehensions
Compare Python data structures (list vs. dict vs. set vs. tuple) and justify which to use in a given scenario

Estimated time: 5 hours

Why This Chapter Matters

In Chapter 3 you learned to store single values in variables: a patient count, a country name, a vaccination rate. In Chapter 4 you learned to write loops and functions that process those values. But here is the truth that every data scientist discovers quickly: real data is never a single value. It is collections of values, organized by structure.

A vaccination record is not just a number. It is a country name linked to a date linked to a count linked to a vaccine type. A dataset is not just a list of numbers. It is rows and columns, keys and values, relationships and hierarchies.

This chapter is where your mental model shifts. Up until now, you have been learning to program. Starting now, you are learning to think in data. The tools are Python's built-in data structures — lists, dictionaries, sets, and tuples — and Python's file-reading capabilities. These are the building blocks that every data science library (pandas, NumPy, scikit-learn) is built on top of. Understanding them deeply will make everything that follows in this book easier.

🚪 Threshold Concept: Data as Structured Collections

Here is the shift that separates "I know some Python" from "I can work with data": instead of thinking about isolated values, you start seeing the world as mappings, sequences, and sets.

A patient's vitals? That is a dictionary — a mapping from measurement names to values.

The countries in South America? That is a set — a collection where uniqueness matters and order does not.

A row of CSV data? That is a list — an ordered sequence of fields.

A GPS coordinate? That is a tuple — a fixed pair of values that should never change.

A whole dataset? That is a list of dictionaries — each dictionary is a row, each key is a column name.

Once you start seeing data this way, you will notice it everywhere: in spreadsheets, in JSON files from web APIs, in database tables, in the structure of a news article's data. This chapter builds that vision.

5.1 Lists Revisited: Beyond Simple Sequences

You met lists briefly in Chapter 4 when you iterated over them with for loops. Let's go deeper. Lists are the workhorse of Python data handling, and mastering their methods will serve you in every chapter that follows.

Creating and Inspecting Lists

A list is an ordered, mutable collection of items. "Ordered" means items have positions (indices). "Mutable" means you can change, add, or remove items after creation.

# A list of vaccination rates (percentages) for five countries
vaccination_rates = [72.3, 85.1, 41.7, 93.2, 68.5]

# A list of country names
countries = ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia"]

# Lists can hold different types (though in data science, they usually don't)
mixed = [42, "hello", True, 3.14]

# How many items?
print(len(vaccination_rates))  # 5

# What is the first item? (Indices start at 0)
print(countries[0])  # "Brazil"

# What is the last item?
print(countries[-1])  # "Ethiopia"

Essential List Methods

Here are the methods you will use constantly. Try each one in a Jupyter cell.

temperatures = [22.1, 19.8, 25.4, 18.3, 22.1]

# append — add an item to the end
temperatures.append(27.0)
print(temperatures)  # [22.1, 19.8, 25.4, 18.3, 22.1, 27.0]

# insert — add an item at a specific position
temperatures.insert(0, 15.5)
print(temperatures)  # [15.5, 22.1, 19.8, 25.4, 18.3, 22.1, 27.0]

# remove — remove the first occurrence of a value
temperatures.remove(22.1)
print(temperatures)  # [15.5, 19.8, 25.4, 18.3, 22.1, 27.0]

# pop — remove and return the item at an index (default: last)
last = temperatures.pop()
print(last)           # 27.0
print(temperatures)   # [15.5, 19.8, 25.4, 18.3, 22.1]

# sort — sort in place (modifies the list itself)
temperatures.sort()
print(temperatures)   # [15.5, 18.3, 19.8, 22.1, 25.4]

# sorted() — return a NEW sorted list (original unchanged)
original = [3, 1, 4, 1, 5]
new_sorted = sorted(original)
print(original)       # [3, 1, 4, 1, 5]  (unchanged!)
print(new_sorted)     # [1, 1, 3, 4, 5]

# count — how many times does a value appear?
print([3, 1, 4, 1, 5].count(1))  # 2

# index — where does a value first appear?
print(["a", "b", "c", "b"].index("b"))  # 1

Check Your Understanding: What is the difference between temperatures.sort() and sorted(temperatures)? Why does the distinction matter?

Answer

.sort() modifies the list in place and returns None. sorted() returns a new list and leaves the original untouched. This matters because if you write result = temperatures.sort(), the variable result will be None — a common beginner mistake. Use sorted() when you want to keep the original list intact.

Slicing: Extracting Sublists

Slicing lets you grab a portion of a list. The syntax is list[start:stop:step], where start is inclusive and stop is exclusive.

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# First quarter
print(months[0:3])    # ["Jan", "Feb", "Mar"]

# Third quarter
print(months[6:9])    # ["Jul", "Aug", "Sep"]

# Every other month
print(months[::2])    # ["Jan", "Mar", "May", "Jul", "Sep", "Nov"]

# Reverse the list
print(months[::-1])   # ["Dec", "Nov", "Oct", ..., "Jan"]

Lists as Tables: Nested Lists

Here is a preview of "thinking in data." You can represent a simple table as a list of lists, where each inner list is a row:

# Country, Region, Vaccination Rate (%)
health_data = [
    ["Brazil",    "Americas",       72.3],
    ["Canada",    "Americas",       85.1],
    ["Chad",      "Africa",         41.7],
    ["Denmark",   "Europe",         93.2],
    ["Ethiopia",  "Africa",         68.5],
]

# Access Denmark's vaccination rate (row 3, column 2)
print(health_data[3][2])  # 93.2

# Print all country names
for row in health_data:
    print(row[0])

This works, but it is fragile. What does health_data[3][2] mean? You have to remember that index 2 is the vaccination rate. If you add a column, all the index numbers shift. There has to be a better way.

There is. It is called a dictionary.

5.2 Dictionaries: The Data Scientist's Best Friend

If you learn one data structure deeply from this chapter, make it the dictionary. Dictionaries are so central to data science in Python that you will encounter them in virtually every chapter that follows.

What Is a Dictionary?

A dictionary (abbreviated dict) is an unordered collection of key-value pairs. Instead of accessing items by position (like a list), you access them by name.

Think of a real dictionary: you look up a word (the key) and find its definition (the value). A Python dictionary works the same way.

# A patient record as a dictionary
patient = {
    "name": "Maria Santos",
    "age": 34,
    "blood_pressure": "120/80",
    "vaccinated": True,
    "vaccine_type": "mRNA"
}

# Access a value by its key
print(patient["name"])       # "Maria Santos"
print(patient["vaccinated"]) # True

Notice the difference from lists: patient["name"] is far more readable than patient[0]. The key describes what the value represents. Your code documents itself.

Creating Dictionaries

# Method 1: Curly braces with key: value pairs
country_info = {
    "name": "Brazil",
    "region": "Americas",
    "population": 214_000_000,
    "vaccination_rate": 72.3
}

# Method 2: dict() constructor with keyword arguments
country_info = dict(
    name="Brazil",
    region="Americas",
    population=214_000_000,
    vaccination_rate=72.3
)

# Method 3: From a list of tuples (useful when building dicts programmatically)
pairs = [("name", "Brazil"), ("region", "Americas")]
country_info = dict(pairs)

# An empty dictionary
empty = {}
# or
empty = dict()

Accessing, Adding, and Modifying Values

country = {"name": "Chad", "region": "Africa", "vaccination_rate": 41.7}

# Access a value
print(country["name"])  # "Chad"

# Add a new key-value pair
country["population"] = 17_400_000
print(country)
# {"name": "Chad", "region": "Africa", "vaccination_rate": 41.7, "population": 17400000}

# Modify an existing value
country["vaccination_rate"] = 43.2  # Updated!

# Delete a key-value pair
del country["population"]

The .get() Method: Safe Access

What happens when you try to access a key that does not exist?

country = {"name": "Chad", "region": "Africa"}
print(country["vaccination_rate"])  # KeyError!

This KeyError is one of the most common errors in Python data work. The .get() method provides a safe alternative:

# .get() returns None if the key doesn't exist
rate = country.get("vaccination_rate")
print(rate)  # None

# You can provide a default value
rate = country.get("vaccination_rate", 0.0)
print(rate)  # 0.0

# The key "name" does exist, so .get() returns the value
name = country.get("name", "Unknown")
print(name)  # "Chad"

🐛 Debugging Walkthrough: The KeyError

You will encounter KeyError many times in your data science journey. Here is a typical scenario and how to fix it:

```python patient = {"name": "Li Wei", "age": 28, "blood_type": "A+"}

Attempt to access a key that doesn't exist

print(patient["bloodtype"]) # KeyError: 'bloodtype' ```

What happened? The key is "blood_type" (with an underscore), but you typed "bloodtype" (without). Python is case-sensitive and character-exact.

How to fix it: 1. Check the exact key name: print(patient.keys()) 2. Use .get() for safe access: patient.get("bloodtype", "not found") 3. Use "bloodtype" in patient to test if a key exists before accessing it

Prevention tip: When working with data from files or APIs, always inspect the keys first. A quick print(data.keys()) or print(list(data.keys())) saves hours of debugging.

Dictionary Methods You Should Know

patient = {"name": "Maria", "age": 34, "vaccinated": True}

# .keys() — all the keys
print(list(patient.keys()))    # ["name", "age", "vaccinated"]

# .values() — all the values
print(list(patient.values()))  # ["Maria", 34, True]

# .items() — all key-value pairs as tuples
print(list(patient.items()))   # [("name", "Maria"), ("age", 34), ("vaccinated", True)]

# .update() — merge another dictionary into this one
patient.update({"blood_type": "O+", "age": 35})  # age is updated too!
print(patient)
# {"name": "Maria", "age": 35, "vaccinated": True, "blood_type": "O+"}

# Iterating over a dictionary
for key, value in patient.items():
    print(f"{key}: {value}")

Country-to-Region Mappings: A Data Science Workhorse

One of the most common uses of dictionaries in data science is mapping — translating from one set of values to another. Here is a real pattern you will use in the progressive project:

# WHO region assignments for selected countries
country_to_region = {
    "Brazil": "Americas",
    "Canada": "Americas",
    "Chad": "Africa",
    "China": "Western Pacific",
    "Denmark": "Europe",
    "Ethiopia": "Africa",
    "India": "South-East Asia",
    "Japan": "Western Pacific",
    "Mexico": "Americas",
    "Nigeria": "Africa",
    "United Kingdom": "Europe",
    "United States": "Americas",
}

# Look up a country's region
print(country_to_region["India"])  # "South-East Asia"

# Use it in a loop
countries_of_interest = ["Brazil", "India", "Nigeria", "Denmark"]
for country in countries_of_interest:
    region = country_to_region[country]
    print(f"{country} is in the WHO region: {region}")

This kind of mapping dictionary is everywhere in data science: mapping state abbreviations to full names, mapping product codes to categories, mapping ZIP codes to cities. The dictionary provides instant lookup by name, without needing to search through a list.

Patient Records: Dictionaries in the Wild

Here is how a health clinic might represent patient data:

patient_record = {
    "patient_id": "P-20240315",
    "name": "Aisha Ibrahim",
    "date_of_birth": "1990-06-12",
    "allergies": ["penicillin", "sulfa"],
    "vitals": {
        "blood_pressure": "118/76",
        "heart_rate": 72,
        "temperature": 36.8
    },
    "vaccinations": [
        {"vaccine": "COVID-19", "date": "2024-01-15", "dose": 3},
        {"vaccine": "Influenza", "date": "2024-10-01", "dose": 1}
    ]
}

# Access nested data
print(patient_record["vitals"]["heart_rate"])        # 72
print(patient_record["vaccinations"][0]["vaccine"])   # "COVID-19"
print(patient_record["allergies"][1])                 # "sulfa"

Notice how the dictionary naturally mirrors the structure of real-world data. A patient has vitals (another dictionary). A patient has a list of vaccinations (a list of dictionaries). This nesting is not a Python quirk — it reflects how data is actually organized.

Check Your Understanding: Given the patient_record above, write the expression to access the date of the patient's second vaccination.

Answer

patient_record["vaccinations"][1]["date"] returns "2024-10-01".

Work from the outside in: patient_record["vaccinations"] gives you the list. [1] gives you the second item (a dictionary). ["date"] gives you the date value.

5.3 Sets and Tuples: When Order and Uniqueness Matter

Lists and dictionaries handle most data science needs, but two other structures fill important niches: sets (when you care about uniqueness) and tuples (when you need immutability).

Sets: Collections of Unique Items

A set is an unordered collection of unique items. If you add the same item twice, the set keeps only one copy.

# Create a set
regions = {"Americas", "Europe", "Africa", "Europe", "Americas"}
print(regions)  # {"Americas", "Europe", "Africa"} — duplicates removed!

# Create from a list (great for finding unique values)
raw_data = ["USA", "Canada", "USA", "Mexico", "Canada", "USA"]
unique_countries = set(raw_data)
print(unique_countries)  # {"USA", "Canada", "Mexico"}
print(len(unique_countries))  # 3

# Convert back to a sorted list if you need order
print(sorted(unique_countries))  # ["Canada", "Mexico", "USA"]

Why sets matter in data science: When you are cleaning a dataset, one of the first things you check is "what are the unique values in this column?" Sets answer that instantly.

Set Operations: Union, Intersection, Difference

Sets support mathematical operations that are remarkably useful for data comparisons:

# Countries in dataset A (vaccination data)
dataset_a = {"Brazil", "Canada", "Chad", "Denmark", "Ethiopia"}

# Countries in dataset B (GDP data)
dataset_b = {"Brazil", "Denmark", "Ethiopia", "France", "Germany"}

# Union — countries in EITHER dataset
print(dataset_a | dataset_b)
# {"Brazil", "Canada", "Chad", "Denmark", "Ethiopia", "France", "Germany"}

# Intersection — countries in BOTH datasets
print(dataset_a & dataset_b)
# {"Brazil", "Denmark", "Ethiopia"}

# Difference — countries in A but NOT in B
print(dataset_a - dataset_b)
# {"Canada", "Chad"}

# Symmetric difference — countries in one but not both
print(dataset_a ^ dataset_b)
# {"Canada", "Chad", "France", "Germany"}

This is not abstract mathematics. When you merge two datasets in Chapter 9 (or even by hand in Chapter 6), you will need to know which records exist in both datasets, which exist in only one, and which are missing. Set operations answer those questions in a single line.

Tuples: Immutable Sequences

A tuple is like a list, but it cannot be changed after creation. It is immutable.

# Create a tuple
coordinates = (40.7128, -74.0060)  # New York City lat/long

# Access items (just like a list)
print(coordinates[0])  # 40.7128

# But you CANNOT modify a tuple
coordinates[0] = 41.0  # TypeError: 'tuple' object does not support item assignment

When do you use tuples?

Fixed data: GPS coordinates, RGB colors, (year, month, day) date components. If the data should never change, a tuple signals that intent.
Dictionary keys: Lists cannot be dictionary keys (because they are mutable), but tuples can:

# Using (latitude, longitude) tuples as dictionary keys
city_populations = {
    (40.7128, -74.0060): 8_336_817,   # New York
    (34.0522, -118.2437): 3_979_576,  # Los Angeles
    (41.8781, -87.6298): 2_693_976,   # Chicago
}

Multiple return values from functions: When a function returns two or more values, Python packs them into a tuple:

def min_max(numbers):
    return min(numbers), max(numbers)  # Returns a tuple

result = min_max([72.3, 85.1, 41.7, 93.2, 68.5])
print(result)     # (41.7, 93.2)
print(result[0])  # 41.7

# Tuple unpacking — assign each value to a separate variable
low, high = min_max([72.3, 85.1, 41.7, 93.2, 68.5])
print(f"Range: {low} to {high}")  # "Range: 41.7 to 93.2"

Mutable vs. Immutable: Why It Matters

This is a concept that trips up beginners but is essential for understanding Python:

Property	Mutable (can change)	Immutable (cannot change)
List	Yes — append, remove, modify
Dictionary	Yes — add, delete, modify values
Set	Yes — add, discard
Tuple		Yes — fixed after creation
String		Yes — strings never change in place
Integer/Float		Yes — numbers are immutable

Why does this matter? When you pass a mutable object (like a list) to a function, the function can accidentally change it:

def add_timestamp(record):
    record.append("2024-03-15")  # Modifies the ORIGINAL list!
    return record

original = ["Brazil", "Americas", 72.3]
result = add_timestamp(original)
print(original)  # ["Brazil", "Americas", 72.3, "2024-03-15"] — changed!

With a tuple, this cannot happen. Immutability is a safety net.

Check Your Understanding: Can you use a list as a dictionary key? Why or why not?

Answer

No. Dictionary keys must be hashable, which in practice means they must be immutable. Lists are mutable, so they cannot be keys. Use a tuple instead: {(1, 2): "value"} works, but {[1, 2]: "value"} raises TypeError: unhashable type: 'list'.

5.4 Nesting: Lists of Dictionaries and Beyond

Real data almost always has structure beyond a simple flat list. In this section, we build toward the shape that most closely mirrors a spreadsheet or database table: a list of dictionaries.

A List of Dictionaries: Tabular Data Before pandas

This is the single most important pattern in this chapter. Each dictionary represents one row, and each key represents a column name:

health_data = [
    {"country": "Brazil",    "region": "Americas", "vaccination_rate": 72.3},
    {"country": "Canada",    "region": "Americas", "vaccination_rate": 85.1},
    {"country": "Chad",      "region": "Africa",   "vaccination_rate": 41.7},
    {"country": "Denmark",   "region": "Europe",   "vaccination_rate": 93.2},
    {"country": "Ethiopia",  "region": "Africa",   "vaccination_rate": 68.5},
]

Compare this to the list-of-lists approach from Section 5.1:

# List of lists (from earlier)
health_data_lists = [
    ["Brazil",    "Americas", 72.3],
    ["Canada",    "Americas", 85.1],
    ...
]
print(health_data_lists[0][2])  # 72.3 — what does index 2 mean?

# List of dictionaries
print(health_data[0]["vaccination_rate"])  # 72.3 — crystal clear!

The list-of-dictionaries approach is self-documenting. When you revisit your code a week later, row["vaccination_rate"] tells you exactly what you are looking at.

Working with a List of Dictionaries

Here are the patterns you will use over and over:

# Loop through all records
for record in health_data:
    print(f"{record['country']}: {record['vaccination_rate']}%")

# Find all countries in Africa
african_countries = []
for record in health_data:
    if record["region"] == "Africa":
        african_countries.append(record["country"])
print(african_countries)  # ["Chad", "Ethiopia"]

# Calculate the average vaccination rate
total = 0
for record in health_data:
    total += record["vaccination_rate"]
average = total / len(health_data)
print(f"Average: {average:.1f}%")  # Average: 72.2%

# Find the country with the highest vaccination rate
best = health_data[0]
for record in health_data[1:]:
    if record["vaccination_rate"] > best["vaccination_rate"]:
        best = record
print(f"Highest: {best['country']} at {best['vaccination_rate']}%")
# Highest: Denmark at 93.2%

A Dictionary of Lists: Column-Oriented Data

Sometimes it is more natural to organize data by column rather than by row:

health_columns = {
    "country":          ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia"],
    "region":           ["Americas", "Americas", "Africa", "Europe", "Africa"],
    "vaccination_rate": [72.3, 85.1, 41.7, 93.2, 68.5],
}

# All vaccination rates in one go
print(health_columns["vaccination_rate"])  # [72.3, 85.1, 41.7, 93.2, 68.5]

# Access a specific row requires knowing the index
row_index = 2  # Chad
for column_name in health_columns:
    print(f"{column_name}: {health_columns[column_name][row_index]}")

Fun fact: this dictionary-of-lists shape is exactly what pandas uses internally to create a DataFrame. In Chapter 7, you will write pd.DataFrame(health_columns) and get a beautiful table. The structure you are learning now is the foundation.

Deeper Nesting: When Data Gets Complex

Real-world data can nest many levels deep. Here is a WHO-style country profile:

country_profile = {
    "country": "Ethiopia",
    "iso_code": "ETH",
    "region": "Africa",
    "demographics": {
        "population": 120_000_000,
        "median_age": 19.5,
        "urban_percentage": 21.7
    },
    "health_indicators": {
        "life_expectancy": 66.6,
        "maternal_mortality_ratio": 267,
        "vaccination_rates": {
            "covid_19": 68.5,
            "measles": 57.0,
            "polio": 87.2
        }
    },
    "income_group": "Low income"
}

# Navigate the nesting
print(country_profile["health_indicators"]["vaccination_rates"]["measles"])
# 57.0

# Build the path step by step if nesting gets confusing
health = country_profile["health_indicators"]
vaccines = health["vaccination_rates"]
measles_rate = vaccines["measles"]
print(measles_rate)  # 57.0

Productive Struggle: Design a Data Structure

Before reading further, try this on your own. Spend 5-10 minutes designing a dictionary (or nested structure) to represent one of the following:

A single NBA player's season statistics (name, team, games played, points per game, rebounds, assists, three-point percentage)

A university course (department, course number, title, instructor, meeting times, enrolled students)

A weather forecast for one city over the next three days

There is no single "correct" answer. The goal is to practice the skill of translating real-world information into a Python data structure. Write your answer in a Jupyter cell, then print one piece of nested data from it to make sure your structure works.

When you are done, compare your approach with a classmate's. Did you structure the data differently? Which approach makes it easier to answer specific questions?

5.5 List and Dictionary Comprehensions: Concise Data Transformation

So far, extracting data from a collection has required a loop with multiple lines:

# Extract all vaccination rates (the multi-line way)
rates = []
for record in health_data:
    rates.append(record["vaccination_rate"])
print(rates)  # [72.3, 85.1, 41.7, 93.2, 68.5]

Python offers a more concise alternative: comprehensions.

List Comprehensions

A list comprehension creates a new list by applying an expression to each item in an existing iterable, all in a single line:

# Same result as the loop above
rates = [record["vaccination_rate"] for record in health_data]
print(rates)  # [72.3, 85.1, 41.7, 93.2, 68.5]

The general pattern is:

[expression for item in iterable]

Read it as: "Give me expression for each item in iterable."

More examples:

# Country names in uppercase
upper_names = [record["country"].upper() for record in health_data]
print(upper_names)  # ["BRAZIL", "CANADA", "CHAD", "DENMARK", "ETHIOPIA"]

# Squares of numbers 1 through 10
squares = [n ** 2 for n in range(1, 11)]
print(squares)  # [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

# Lengths of words
words = ["data", "science", "is", "fascinating"]
lengths = [len(word) for word in words]
print(lengths)  # [4, 7, 2, 11]

Filtering with Comprehensions

You can add a condition to include only certain items:

# Countries with vaccination rates above 70%
high_vax = [record["country"] for record in health_data
            if record["vaccination_rate"] > 70]
print(high_vax)  # ["Brazil", "Canada", "Denmark"]

# Even numbers from 1 to 20
evens = [n for n in range(1, 21) if n % 2 == 0]
print(evens)  # [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

The pattern with a filter:

[expression for item in iterable if condition]

Read it as: "Give me expression for each item in iterable, but only if condition is true."

Dictionary Comprehensions

The same idea, but for creating dictionaries:

# Country name -> vaccination rate mapping
rate_lookup = {record["country"]: record["vaccination_rate"]
               for record in health_data}
print(rate_lookup)
# {"Brazil": 72.3, "Canada": 85.1, "Chad": 41.7, "Denmark": 93.2, "Ethiopia": 68.5}

# Reverse a mapping (value -> key)
region_to_countries = {}
# (This is trickier — one region maps to MULTIPLE countries, so we need a list)
# We will revisit this pattern below.

# Simple dictionary comprehension: word -> length
word_lengths = {word: len(word) for word in ["data", "science", "python"]}
print(word_lengths)  # {"data": 4, "science": 7, "python": 6}

# With filtering: only countries with rates > 70%
high_rate_lookup = {record["country"]: record["vaccination_rate"]
                    for record in health_data
                    if record["vaccination_rate"] > 70}
print(high_rate_lookup)
# {"Brazil": 72.3, "Canada": 85.1, "Denmark": 93.2}

When to Use Comprehensions (and When Not To)

Comprehensions are Pythonic and concise, but do not overuse them:

Use comprehensions when: - The transformation is simple (one expression, optionally one filter) - The result is a new list or dictionary - The code is still readable on one or two lines

Use a regular loop when: - You need multiple steps per iteration - You need to handle errors or edge cases - The logic is complex enough that a comprehension would be hard to read

# Too complex for a comprehension — use a loop instead
results = []
for record in health_data:
    rate = record["vaccination_rate"]
    if rate >= 80:
        category = "high"
    elif rate >= 60:
        category = "medium"
    else:
        category = "low"
    results.append({"country": record["country"], "category": category})

Retrieval Practice: Without looking at the examples above, write a list comprehension that extracts all country names from health_data where the region is "Africa". Then check your answer.

Answer

python african = [record["country"] for record in health_data if record["region"] == "Africa"] print(african) # ["Chad", "Ethiopia"]

5.6 Reading Data from Files

So far, we have typed our data directly into Python code. In real data science, data lives in files — CSV files, JSON files, text files, and many other formats. Learning to read files is your gateway to working with real data.

Opening and Reading a Text File

Python's built-in open() function opens a file. The with statement ensures the file is properly closed when you are done:

# Writing a sample file first (so we have something to read)
with open("sample.txt", "w") as f:
    f.write("Country,Region,VaxRate\n")
    f.write("Brazil,Americas,72.3\n")
    f.write("Canada,Americas,85.1\n")
    f.write("Chad,Africa,41.7\n")

# Reading the file
with open("sample.txt", "r") as f:
    contents = f.read()
print(contents)

Output:

Country,Region,VaxRate
Brazil,Americas,72.3
Canada,Americas,85.1
Chad,Africa,41.7

Reading Line by Line

For large files, reading the entire file into memory at once is wasteful. Instead, read line by line:

with open("sample.txt", "r") as f:
    for line in f:
        print(line.strip())  # .strip() removes the trailing newline

Or read all lines into a list:

with open("sample.txt", "r") as f:
    lines = f.readlines()

print(lines)
# ["Country,Region,VaxRate\n", "Brazil,Americas,72.3\n", ...]

The csv Module: Reading CSV Files Properly

CSV (Comma-Separated Values) is the most common format for tabular data. You could parse it by splitting each line on commas, but the csv module handles edge cases (commas inside quoted fields, different delimiters) that manual splitting misses.

import csv

with open("sample.txt", "r") as f:
    reader = csv.reader(f)
    header = next(reader)  # Read the first row (column names)
    print("Columns:", header)

    for row in reader:
        print(row)

Output:

Columns: ['Country', 'Region', 'VaxRate']
['Brazil', 'Americas', '72.3']
['Canada', 'Americas', '85.1']
['Chad', 'Africa', '41.7']

Notice something important: csv.reader returns every value as a string. The vaccination rate 72.3 is the string "72.3", not the float 72.3. You must convert it yourself:

import csv

records = []
with open("sample.txt", "r") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        record = {
            "country": row[0],
            "region": row[1],
            "vaccination_rate": float(row[2])
        }
        records.append(record)

print(records)
# [{"country": "Brazil", "region": "Americas", "vaccination_rate": 72.3}, ...]

csv.DictReader: Automatic Column Names

Even better, csv.DictReader automatically maps each row to a dictionary using the header row as keys:

import csv

with open("sample.txt", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

Output:

{'Country': 'Brazil', 'Region': 'Americas', 'VaxRate': '72.3'}
{'Country': 'Canada', 'Region': 'Americas', 'VaxRate': '85.1'}
{'Country': 'Chad', 'Region': 'Africa', 'VaxRate': '41.7'}

This is the recommended way to read CSV files in pure Python. Each row is already a dictionary — the same structure we built by hand in Section 5.4.

Reading JSON Files

JSON (JavaScript Object Notation) is the second most common data format you will encounter, especially from web APIs. Python's json module makes reading it straightforward:

import json

# First, let's create a sample JSON file
sample_data = {
    "dataset": "WHO Vaccination Rates",
    "last_updated": "2024-03-15",
    "countries": [
        {"name": "Brazil", "region": "Americas", "rate": 72.3},
        {"name": "Canada", "region": "Americas", "rate": 85.1},
        {"name": "Chad", "region": "Africa", "rate": 41.7}
    ]
}

with open("sample.json", "w") as f:
    json.dump(sample_data, f, indent=2)

# Now read it back
with open("sample.json", "r") as f:
    data = json.load(f)

print(type(data))            # <class 'dict'>
print(data["dataset"])       # "WHO Vaccination Rates"
print(data["countries"][0])  # {"name": "Brazil", "region": "Americas", "rate": 72.3}

# Access nested data
for country in data["countries"]:
    print(f"{country['name']}: {country['rate']}%")

JSON maps directly to Python data structures: JSON objects become dictionaries, JSON arrays become lists, JSON strings become Python strings, and JSON numbers become Python ints or floats. The mapping is seamless.

File Paths: A Quick Guide

When specifying file paths, you have several options:

# Relative path (relative to where your notebook is)
with open("data/health_data.csv", "r") as f:
    pass

# Absolute path (full path from the root of the filesystem)
# On Windows:
with open("C:/Users/student/data/health_data.csv", "r") as f:
    pass

# On Mac/Linux:
with open("/home/student/data/health_data.csv", "r") as f:
    pass

# Using raw strings on Windows (to avoid backslash issues)
with open(r"C:\Users\student\data\health_data.csv", "r") as f:
    pass

🐛 Debugging Walkthrough: FileNotFoundError

```python with open("my_data.csv", "r") as f: data = f.read()

FileNotFoundError: [Errno 2] No such file or directory: 'my_data.csv'

```

Common causes: 1. The file is in a different directory than your notebook. Try printing import os; print(os.getcwd()) to see where Python is looking. 2. The filename has a typo: my_data.csv vs. my-data.csv vs. mydata.csv. 3. The file extension is hidden by your operating system: what you see as data.csv might actually be data.csv.txt.

Fix: Use os.listdir(".") to see what files are actually in the current directory. Match the exact filename.

5.7 Writing Data to Files

Reading files is essential. Writing files completes the picture — you can now save your processed results for later use or sharing.

Writing Text Files

# Write results to a text file
results = [
    "Country Analysis Report",
    "=======================",
    "",
    "Countries with high vaccination rates (>70%):",
    "- Brazil: 72.3%",
    "- Canada: 85.1%",
    "- Denmark: 93.2%",
]

with open("report.txt", "w") as f:
    for line in results:
        f.write(line + "\n")

print("Report saved!")

Writing CSV Files

import csv

health_data = [
    {"country": "Brazil", "region": "Americas", "vaccination_rate": 72.3},
    {"country": "Canada", "region": "Americas", "vaccination_rate": 85.1},
    {"country": "Chad",   "region": "Africa",   "vaccination_rate": 41.7},
]

with open("output.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["country", "region", "vaccination_rate"])
    writer.writeheader()
    for record in health_data:
        writer.writerow(record)

print("CSV saved!")

The newline="" parameter prevents extra blank lines on Windows. The fieldnames argument specifies the column order. writeheader() writes the header row, and writerow() writes each data row.

Writing JSON Files

import json

output_data = {
    "analysis": "Vaccination Rate Summary",
    "date": "2024-03-15",
    "results": [
        {"country": "Brazil", "rate": 72.3, "category": "high"},
        {"country": "Chad",   "rate": 41.7, "category": "low"},
    ]
}

with open("results.json", "w") as f:
    json.dump(output_data, f, indent=2)

print("JSON saved!")

The indent=2 parameter makes the output human-readable (pretty-printed) instead of a single compressed line.

A Common Pattern: Read, Process, Write

Here is the full pipeline that mirrors real data work:

import csv

# STEP 1: Read
records = []
with open("sample.txt", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        records.append({
            "country": row["Country"],
            "region": row["Region"],
            "vaccination_rate": float(row["VaxRate"])
        })

# STEP 2: Process
for record in records:
    rate = record["vaccination_rate"]
    if rate >= 80:
        record["category"] = "high"
    elif rate >= 60:
        record["category"] = "medium"
    else:
        record["category"] = "low"

# STEP 3: Write
with open("categorized.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["country", "region", "vaccination_rate", "category"])
    writer.writeheader()
    for record in records:
        writer.writerow(record)

print("Processing complete!")

This read-process-write pipeline is the skeleton of every data analysis you will ever do. In Chapter 7, pandas will let you do this in a fraction of the code — but understanding the mechanics of what happens under the hood makes you a stronger data scientist.

Project Checkpoint: Country-to-Region Mappings and Your First CSV Read

It is time to apply what you have learned to the progressive project. You have two tasks:

Task 1: Build the WHO Region Mapping Dictionary

Create a dictionary that maps country names to their WHO region. You will use this throughout the book whenever you need to group countries by region.

# WHO region assignments (subset for our project)
who_regions = {
    "Afghanistan": "Eastern Mediterranean",
    "Argentina": "Americas",
    "Australia": "Western Pacific",
    "Brazil": "Americas",
    "Canada": "Americas",
    "Chad": "Africa",
    "China": "Western Pacific",
    "Denmark": "Europe",
    "Egypt": "Eastern Mediterranean",
    "Ethiopia": "Africa",
    "France": "Europe",
    "Germany": "Europe",
    "India": "South-East Asia",
    "Indonesia": "South-East Asia",
    "Japan": "Western Pacific",
    "Kenya": "Africa",
    "Mexico": "Americas",
    "Nigeria": "Africa",
    "Pakistan": "Eastern Mediterranean",
    "Russia": "Europe",
    "South Africa": "Africa",
    "United Kingdom": "Europe",
    "United States": "Americas",
}

# Quick check: how many countries per region?
region_counts = {}
for country, region in who_regions.items():
    region_counts[region] = region_counts.get(region, 0) + 1

for region, count in sorted(region_counts.items()):
    print(f"{region}: {count} countries")

Task 2: Read a Sample CSV Row

If you have a CSV file from the progressive project (or the sample file we created earlier), read the first data row into a dictionary and inspect it:

import csv

# Use the sample file, or replace with your project CSV
with open("sample.txt", "r") as f:
    reader = csv.DictReader(f)
    first_row = next(reader)

print("First row:", first_row)
print("Type:", type(first_row))
print("Keys:", list(first_row.keys()))

Save this code in your project notebook. These building blocks — the region mapping and the CSV reading pattern — will be the foundation for Chapter 6, where you perform your first real data analysis.

Practical Considerations

Performance: When Data Gets Big

For small datasets (thousands of rows), any approach works fine. But as data grows, structure choices matter:

Dictionary lookups are fast. Looking up country_to_region["Brazil"] takes the same amount of time whether the dictionary has 10 entries or 10 million. This is called O(1) — constant time. Searching a list, by contrast, requires checking each element one by one: O(n) — linear time.
Set membership tests are fast. "Brazil" in my_set is much faster than "Brazil" in my_list for large collections.
Lists are fast for iteration. Looping through all items in a list is very efficient. Lists are stored contiguously in memory.

You do not need to worry about Big-O notation at this stage. Just remember: if you need to look things up by name, use a dictionary. If you need to check whether something is "in" a collection, use a set.

Common Mistakes with Data Structures

Forgetting that CSV values are always strings. You write if row["rate"] > 70 and get unexpected results because "72.3" > 70 does a string-integer comparison. Always convert: float(row["rate"]).
Modifying a list while iterating over it. This causes strange behavior:

# BAD — don't do this
numbers = [1, 2, 3, 4, 5]
for n in numbers:
    if n % 2 == 0:
        numbers.remove(n)  # Dangerous!

Instead, build a new list:

# GOOD
numbers = [1, 2, 3, 4, 5]
odds = [n for n in numbers if n % 2 != 0]

Using = instead of .copy() for lists.

original = [1, 2, 3]
copy = original      # NOT a copy — both names point to the same list!
copy.append(4)
print(original)      # [1, 2, 3, 4] — original changed too!

# Use .copy() or list() to make an actual copy
real_copy = original.copy()

Summary: Choosing the Right Data Structure

This is the comparison table you should bookmark and return to throughout the course:

Structure	Ordered?	Mutable?	Duplicates?	Access By	Best For
List `[]`	Yes	Yes	Yes	Index (position)	Ordered collections, sequences of records, iteration
Dictionary `{}`	Insertion order (3.7+)	Yes (values)	Keys: No, Values: Yes	Key (name)	Named data, lookups by key, mappings, records
Set `set()`	No	Yes	No	N/A (membership test)	Unique values, membership testing, set operations
Tuple `()`	Yes	No	Yes	Index (position)	Fixed data, dict keys, function return values

Decision heuristic: - Need to look things up by name? Dictionary - Need an ordered collection you will modify? List - Need to track unique values or test membership? Set - Need a fixed, unchangeable group of values? Tuple

Spaced Review: Retrieval Practice from Chapters 1-4

These questions help you strengthen earlier material. Try answering from memory before checking.

From Chapter 1: What are the six stages of the data science lifecycle? (If you are shaky on any, review Chapter 1's key takeaways.)

From Chapter 2: In a Jupyter notebook, what is the difference between a code cell and a Markdown cell? When would you use each?

From Chapter 3: What is the difference between int("42") and str(42)? Which direction of conversion is more likely to raise an error, and why?

From Chapter 4: Write a function called classify_rate that takes a vaccination rate (a number) and returns "high" if it is 80 or above, "medium" if it is 60-79, and "low" if it is below 60. (Then compare it to the loop in Section 5.7 — same logic, different context.)

What's Next

You now have all the Python building blocks: variables, types, control flow, functions, and data structures. You can store data, organize it, read it from files, and write it back.

In Chapter 6: Your First Data Analysis, you will bring everything together. You will download a real WHO vaccination dataset, load it using the csv module, explore it with the tools from this chapter, and discover something real in the data. Chapter 6 is where the "programming" part of this course transforms into the "data science" part — and the skills you built in this chapter are what make that transformation possible.

You have been building tools. Now you get to use them.

Chapter 5 Vocabulary Reference

Term	Definition
list	An ordered, mutable collection of items, accessed by index. Created with `[]`.
dictionary	An unordered collection of key-value pairs, accessed by key. Created with `{}`.
key-value pair	A single entry in a dictionary: the key is the lookup name, the value is the stored data.
set	An unordered collection of unique items. Created with `set()` or `{}` (non-empty).
tuple	An ordered, immutable collection of items. Created with `()`.
nested data structure	A data structure that contains other data structures (e.g., a list of dictionaries).
list comprehension	A concise syntax for creating lists: `[expr for item in iterable if condition]`.
dictionary comprehension	A concise syntax for creating dictionaries: `{key_expr: val_expr for item in iterable}`.
CSV	Comma-Separated Values — a plain-text format for tabular data where columns are separated by commas.
file I/O	File Input/Output — reading data from files (input) and writing data to files (output).
open()	Python's built-in function for opening files. Use with `with` for automatic closing.
json module	Python's built-in module for reading and writing JSON (JavaScript Object Notation) data.
csv module	Python's built-in module for reading and writing CSV files, handling edge cases like quoted fields.
mutable	An object that can be changed after creation. Lists, dictionaries, and sets are mutable.
immutable	An object that cannot be changed after creation. Tuples, strings, and numbers are immutable.