Chapter 5 Exercises: Working with Data Structures

Contributors to Introduction to Data Science

Chapter 5 Exercises: Working with Data Structures

How to use these exercises: Work through the sections in order. Each section builds on the previous one, moving from recall through application to synthesis. Type every code exercise into a Jupyter cell and run it — reading code is not the same as writing it.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding ⭐

These questions check whether you absorbed the core ideas from the chapter. Aim for clear, concise answers.

Exercise 5.1 — Choosing the right structure

For each scenario below, state which data structure (list, dictionary, set, or tuple) is the best choice. Explain your reasoning in one sentence.

Storing the names of all countries in a dataset, in the order they appear
Mapping country names to their ISO 3166-1 alpha-3 codes (e.g., "Brazil" to "BRA")
Finding all unique vaccine manufacturers mentioned in a dataset
Representing a fixed pair of latitude and longitude coordinates
Storing a patient's medical record with named fields like "name," "age," and "blood_type"

Guidance

1. **List** — You need ordered data and may have duplicates if the same country appears in multiple rows. 2. **Dictionary** — You need a name-to-value mapping with fast lookup. 3. **Set** — You need uniqueness, and order does not matter. 4. **Tuple** — Coordinates are fixed data that should not change; tuples also work as dictionary keys. 5. **Dictionary** — Named fields (keys) make the data self-documenting and easy to access.

Exercise 5.2 — Mutable vs. immutable

Without running the code, predict what each snippet will print. Then run it to check.

# Snippet A
a = [1, 2, 3]
b = a
b.append(4)
print(a)

# Snippet B
x = (1, 2, 3)
y = x
# y.append(4)  — What would happen if you uncommented this line?
print(x)

# Snippet C
s = "hello"
s.upper()
print(s)

Guidance

- **Snippet A:** Prints `[1, 2, 3, 4]`. `b = a` does not copy the list — both `a` and `b` point to the same list object. Modifying through `b` also changes `a`. - **Snippet B:** Prints `(1, 2, 3)`. Uncommenting the append line would raise `AttributeError` because tuples are immutable and do not have an `append` method. - **Snippet C:** Prints `hello`. `s.upper()` returns a new string `"HELLO"` but does not modify `s`. Strings are immutable. You would need `s = s.upper()` to update the variable.

Exercise 5.3 — Dictionary access patterns

Given this dictionary:

student = {
    "name": "Jordan Kim",
    "major": "Data Science",
    "gpa": 3.7,
    "courses": ["Stats 101", "CS 110", "Data Ethics"]
}

Write the Python expression to access each of the following (do not use variables — write the full expression):

Jordan's major
Jordan's third course
The number of courses Jordan is taking
Whether Jordan's GPA is above 3.5 (should evaluate to True or False)

Guidance

1. `student["major"]` → `"Data Science"` 2. `student["courses"][2]` → `"Data Ethics"` 3. `len(student["courses"])` → `3` 4. `student["gpa"] > 3.5` → `True`

Exercise 5.4 — Comprehension anatomy

Rewrite each list comprehension as an equivalent for loop with .append(). Then rewrite each for loop as an equivalent list comprehension.

# Comprehension 1 — rewrite as a loop
doubled = [n * 2 for n in [5, 10, 15, 20]]

# Comprehension 2 — rewrite as a loop
short_words = [w for w in ["data", "is", "powerful", "and", "fun"] if len(w) <= 3]

# Loop 1 — rewrite as a comprehension
result = []
for temp_c in [0, 20, 37, 100]:
    temp_f = temp_c * 9/5 + 32
    result.append(temp_f)

# Loop 2 — rewrite as a comprehension
names = []
for record in [{"name": "A", "score": 90}, {"name": "B", "score": 55}]:
    if record["score"] >= 60:
        names.append(record["name"])

Guidance

# Comprehension 1 as loop
doubled = []
for n in [5, 10, 15, 20]:
    doubled.append(n * 2)

# Comprehension 2 as loop
short_words = []
for w in ["data", "is", "powerful", "and", "fun"]:
    if len(w) <= 3:
        short_words.append(w)

# Loop 1 as comprehension
result = [temp_c * 9/5 + 32 for temp_c in [0, 20, 37, 100]]

# Loop 2 as comprehension
names = [record["name"] for record in [{"name": "A", "score": 90}, {"name": "B", "score": 55}]
         if record["score"] >= 60]

Exercise 5.5 — Reading the error message

Each code snippet below produces an error. Without running the code, identify: (a) the error type, and (b) how to fix it. Then run the code to verify.

# Snippet A
patient = {"name": "Elena", "age": 31}
print(patient["Name"])

# Snippet B
data = [10, 20, 30]
print(data[3])

# Snippet C
coordinates = (40.7, -74.0)
coordinates[0] = 41.0

# Snippet D
import csv
with open("nonexistent_file.csv", "r") as f:
    reader = csv.reader(f)

Guidance

- **A:** `KeyError: 'Name'` — keys are case-sensitive. Fix: `patient["name"]` (lowercase n). - **B:** `IndexError: list index out of range` — index 3 does not exist (valid indices are 0, 1, 2). Fix: `data[2]` for the last item, or `data[-1]`. - **C:** `TypeError: 'tuple' object does not support item assignment` — tuples are immutable. Fix: create a new tuple, e.g., `coordinates = (41.0, coordinates[1])`. - **D:** `FileNotFoundError` — the file does not exist. Fix: use a filename that actually exists, or create the file first.

Exercise 5.6 — The file reading pattern

Fill in the blanks in this description of the file-reading pattern (write your answers before checking):

"To read a CSV file in Python, we import the _ module. We open the file using the _ function, typically inside a _ statement to ensure the file is closed automatically. For CSV files, we create a _ object (or a _ for automatic column-name mapping). Each row from csv.reader is returned as a _. Each row from csv.DictReader is returned as a _. All values from CSV files are always _, so numeric values must be converted using float() or int()."

Guidance

csv; `open()`; `with`; `csv.reader`; `csv.DictReader`; list (of strings); dictionary; strings.

Exercise 5.7 — Set operations in plain English

Given:

enrolled_2024 = {"Alice", "Bob", "Carol", "Dave", "Eve"}
enrolled_2025 = {"Bob", "Carol", "Frank", "Grace"}

Describe in plain English what each expression computes, then verify by running the code:

enrolled_2024 & enrolled_2025
enrolled_2024 - enrolled_2025
enrolled_2025 - enrolled_2024
enrolled_2024 | enrolled_2025
enrolled_2024 ^ enrolled_2025

Guidance

1. Students enrolled in *both* years: `{"Bob", "Carol"}` 2. Students who left (in 2024 but not 2025): `{"Alice", "Dave", "Eve"}` 3. New students (in 2025 but not 2024): `{"Frank", "Grace"}` 4. All students ever enrolled: `{"Alice", "Bob", "Carol", "Dave", "Eve", "Frank", "Grace"}` 5. Students in exactly one year (not both): `{"Alice", "Dave", "Eve", "Frank", "Grace"}`

Part B: Applied Practice ⭐⭐

These problems require you to write code. Work in Jupyter and test your solutions.

Exercise 5.8 — Building a data record

Create a dictionary representing a single NBA player with the following fields: name, team, position, games_played, points_per_game, rebounds_per_game, assists_per_game, and three_point_pct. Use realistic values for any player you choose (or invent one). Then write a function player_summary(player_dict) that takes this dictionary and returns a formatted string like:

"LeBron James (LAL) — 25.7 PPG, 7.3 RPG, 8.3 APG"

Guidance

player = {
    "name": "LeBron James",
    "team": "LAL",
    "position": "SF",
    "games_played": 71,
    "points_per_game": 25.7,
    "rebounds_per_game": 7.3,
    "assists_per_game": 8.3,
    "three_point_pct": 0.410
}

def player_summary(p):
    return (f"{p['name']} ({p['team']}) — "
            f"{p['points_per_game']} PPG, "
            f"{p['rebounds_per_game']} RPG, "
            f"{p['assists_per_game']} APG")

print(player_summary(player))

Exercise 5.9 — Filtering a list of dictionaries

Given the following dataset:

students = [
    {"name": "Alice", "major": "CS", "gpa": 3.8},
    {"name": "Bob", "major": "Data Science", "gpa": 3.2},
    {"name": "Carol", "major": "CS", "gpa": 3.9},
    {"name": "Dave", "major": "Statistics", "gpa": 2.7},
    {"name": "Eve", "major": "Data Science", "gpa": 3.5},
    {"name": "Frank", "major": "CS", "gpa": 3.1},
]

Write code to: 1. Create a list of names of all students with a GPA of 3.5 or higher (use a list comprehension) 2. Create a list of all Data Science majors (use a list comprehension) 3. Calculate the average GPA across all students 4. Find the student with the highest GPA (without using max() — write a loop)

Guidance

# 1
honor_roll = [s["name"] for s in students if s["gpa"] >= 3.5]
# ["Alice", "Carol", "Eve"]

# 2
ds_majors = [s["name"] for s in students if s["major"] == "Data Science"]
# ["Bob", "Eve"]

# 3
avg_gpa = sum(s["gpa"] for s in students) / len(students)
# 3.3666...

# 4
best = students[0]
for s in students[1:]:
    if s["gpa"] > best["gpa"]:
        best = s
print(f"{best['name']}: {best['gpa']}")  # Carol: 3.9

Exercise 5.10 — Dictionary from two lists

Given:

countries = ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia"]
rates = [72.3, 85.1, 41.7, 93.2, 68.5]

Use zip() and a dictionary comprehension to create a dictionary mapping each country to its rate. Then use this dictionary to look up Denmark's rate and print it.

Guidance

country_rates = {country: rate for country, rate in zip(countries, rates)}
print(country_rates["Denmark"])  # 93.2

`zip()` pairs up corresponding elements from two sequences. The comprehension `{k: v for k, v in zip(...)}` builds a dictionary from those pairs. You could also write `dict(zip(countries, rates))` for the same result.

Exercise 5.11 — Counting with dictionaries

Write a function count_by_region(records) that takes a list of dictionaries (each with a "region" key) and returns a dictionary mapping each region to the number of countries in that region.

Test it with:

data = [
    {"country": "Brazil", "region": "Americas"},
    {"country": "Canada", "region": "Americas"},
    {"country": "Chad", "region": "Africa"},
    {"country": "Denmark", "region": "Europe"},
    {"country": "Ethiopia", "region": "Africa"},
    {"country": "France", "region": "Europe"},
]

Expected output: {"Americas": 2, "Africa": 2, "Europe": 2}

Guidance

def count_by_region(records):
    counts = {}
    for record in records:
        region = record["region"]
        counts[region] = counts.get(region, 0) + 1
    return counts

print(count_by_region(data))

The `.get(region, 0)` pattern is essential: it returns the current count if the region is already in the dictionary, or 0 if it is the first time seeing that region.

Exercise 5.12 — Write and read a CSV

Create a list of at least 5 dictionaries representing books (with keys "title", "author", "year", "pages").
Write this data to a file called books.csv using csv.DictWriter.
Read the file back using csv.DictReader and print each book's title and year.
Verify that the year values you read back are strings. Convert them to integers and compute the average publication year.

Guidance

import csv

books = [
    {"title": "Weapons of Math Destruction", "author": "Cathy O'Neil", "year": "2016", "pages": "272"},
    {"title": "The Signal and the Noise", "author": "Nate Silver", "year": "2012", "pages": "544"},
    {"title": "Python for Data Analysis", "author": "Wes McKinney", "year": "2022", "pages": "579"},
    {"title": "The Art of Statistics", "author": "David Spiegelhalter", "year": "2019", "pages": "426"},
    {"title": "Factfulness", "author": "Hans Rosling", "year": "2018", "pages": "352"},
]

# Write
with open("books.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "author", "year", "pages"])
    writer.writeheader()
    for book in books:
        writer.writerow(book)

# Read
with open("books.csv", "r") as f:
    reader = csv.DictReader(f)
    read_books = list(reader)

for book in read_books:
    print(f"{book['title']} ({book['year']})")
    print(f"  year type: {type(book['year'])}")  # <class 'str'>

# Average year
years = [int(book["year"]) for book in read_books]
print(f"Average publication year: {sum(years) / len(years):.0f}")

Exercise 5.13 — Nested data access

Given this nested structure:

university = {
    "name": "State University",
    "departments": {
        "Computer Science": {
            "faculty_count": 45,
            "courses": ["CS 101", "CS 201", "CS 301", "Data Science 110"],
            "chair": "Dr. Park"
        },
        "Statistics": {
            "faculty_count": 28,
            "courses": ["Stats 101", "Stats 201", "Bayesian Methods"],
            "chair": "Dr. Ramirez"
        },
        "Mathematics": {
            "faculty_count": 52,
            "courses": ["Calc I", "Calc II", "Linear Algebra", "Probability"],
            "chair": "Dr. Chen"
        }
    }
}

Write expressions to access: 1. The name of the Statistics department chair 2. The second course in Computer Science 3. The total number of faculty across all departments 4. A list of all department names 5. All courses across all departments combined into a single list

Guidance

# 1
print(university["departments"]["Statistics"]["chair"])  # "Dr. Ramirez"

# 2
print(university["departments"]["Computer Science"]["courses"][1])  # "CS 201"

# 3
total = sum(dept["faculty_count"] for dept in university["departments"].values())
print(total)  # 125

# 4
print(list(university["departments"].keys()))
# ["Computer Science", "Statistics", "Mathematics"]

# 5
all_courses = []
for dept in university["departments"].values():
    all_courses.extend(dept["courses"])
print(all_courses)
# or as a comprehension:
all_courses = [course for dept in university["departments"].values()
               for course in dept["courses"]]

Exercise 5.14 — Building a frequency table

Write a function frequency_table(items) that takes a list and returns a dictionary mapping each unique item to its count. Test with:

colors = ["red", "blue", "red", "green", "blue", "red", "blue", "green", "red"]
print(frequency_table(colors))
# {"red": 4, "blue": 3, "green": 2}

Then sort the result by count (highest first) and print it. Hint: sorted() can take a key argument.

Guidance

def frequency_table(items):
    counts = {}
    for item in items:
        counts[item] = counts.get(item, 0) + 1
    return counts

result = frequency_table(colors)
print(result)

# Sort by count, descending
for color, count in sorted(result.items(), key=lambda x: x[1], reverse=True):
    print(f"{color}: {count}")

Exercise 5.15 — Set-based data cleaning

You have two lists of country names from different data sources. They should contain the same countries but may not:

source_a = ["Brazil", "Canada", "Chad", "Denmark", "ethiopia", "France"]
source_b = ["brazil", "Canada", "Chad", "Denmark", "Ethiopia", "France", "Germany"]

Normalize both lists to lowercase
Find countries in both sources
Find countries in source A but not source B
Find countries in source B but not source A
Combine all unique countries into a single sorted list

Guidance

set_a = {c.lower() for c in source_a}
set_b = {c.lower() for c in source_b}

print("In both:", set_a & set_b)
print("Only A:", set_a - set_b)
print("Only B:", set_b - set_a)
print("All:", sorted(set_a | set_b))

Part C: Real-World Application ⭐⭐-⭐⭐⭐

These exercises connect chapter concepts to realistic data scenarios.

Exercise 5.16 — Weather data processing

Create a list of dictionaries representing 7 days of weather data with keys "day" (e.g., "Monday"), "high_temp" (Fahrenheit), "low_temp", and "precipitation" (inches). Then:

Calculate the average high temperature for the week
Find the day with the largest temperature range (high minus low)
Calculate total precipitation for the week
Create a list of days where it rained (precipitation > 0)
Write the data to a CSV file and read it back to verify

Guidance

Design the data yourself with realistic values. The coding patterns are the same as Exercises 5.9 and 5.12. The key challenge is combining multiple techniques in sequence — this is what real data work looks like.

Exercise 5.17 — Inverting a mapping

The country_to_region dictionary maps country names to WHO regions. Write a function invert_mapping(mapping) that creates the reverse mapping: region names to lists of countries. For example, "Americas" should map to ["Brazil", "Canada", "Mexico", ...].

country_to_region = {
    "Brazil": "Americas",
    "Canada": "Americas",
    "Chad": "Africa",
    "Denmark": "Europe",
    "Ethiopia": "Africa",
    "France": "Europe",
    "India": "South-East Asia",
    "Nigeria": "Africa",
}

region_to_countries = invert_mapping(country_to_region)
print(region_to_countries["Africa"])  # ["Chad", "Ethiopia", "Nigeria"]

Guidance

def invert_mapping(mapping):
    inverted = {}
    for key, value in mapping.items():
        if value not in inverted:
            inverted[value] = []
        inverted[value].append(key)
    return inverted

Note that a simple dictionary comprehension will not work here because multiple keys can map to the same value. You need to build lists.

Exercise 5.18 — Grade distribution analysis

Jordan has grade data for a class:

grades = [
    {"student": "S001", "dept": "CS", "grade": "A"},
    {"student": "S002", "dept": "CS", "grade": "B+"},
    {"student": "S003", "dept": "Stats", "grade": "A-"},
    {"student": "S004", "dept": "CS", "grade": "B"},
    {"student": "S005", "dept": "Stats", "grade": "A"},
    {"student": "S006", "dept": "CS", "grade": "C+"},
    {"student": "S007", "dept": "Stats", "grade": "B+"},
    {"student": "S008", "dept": "CS", "grade": "A-"},
]

Create a grade-point mapping dictionary: {"A": 4.0, "A-": 3.7, "B+": 3.3, "B": 3.0, "B-": 2.7, "C+": 2.3, "C": 2.0}
Calculate the average GPA for CS students vs. Statistics students
Which department has the higher average? (This is the kind of question Jordan investigates in the anchor example.)

Guidance

gp_map = {"A": 4.0, "A-": 3.7, "B+": 3.3, "B": 3.0, "B-": 2.7, "C+": 2.3, "C": 2.0}

cs_grades = [gp_map[g["grade"]] for g in grades if g["dept"] == "CS"]
stats_grades = [gp_map[g["grade"]] for g in grades if g["dept"] == "Stats"]

cs_avg = sum(cs_grades) / len(cs_grades)
stats_avg = sum(stats_grades) / len(stats_grades)

print(f"CS average: {cs_avg:.2f}")
print(f"Stats average: {stats_avg:.2f}")

Exercise 5.19 — JSON data exploration

Write a Python script that: 1. Creates a JSON file called team_roster.json containing a dictionary with keys "team_name", "sport", "season", and "players" (a list of dictionaries, each with "name", "number", and "position") 2. Reads the file back 3. Prints the team name and the number of players 4. Prints the name and position of each player

Use at least 5 players with realistic data.

Guidance

Follow the JSON writing pattern from Section 5.7 (`json.dump` with `indent=2`) and the reading pattern from Section 5.6 (`json.load`). The nested access follows the same principles as Exercise 5.13.

Exercise 5.20 — Marcus's sales analysis

Marcus has daily sales records for his store:

sales = [
    {"date": "2024-03-01", "product": "Widget A", "quantity": 15, "unit_price": 9.99},
    {"date": "2024-03-01", "product": "Widget B", "quantity": 8, "unit_price": 14.99},
    {"date": "2024-03-02", "product": "Widget A", "quantity": 22, "unit_price": 9.99},
    {"date": "2024-03-02", "product": "Widget C", "quantity": 5, "unit_price": 24.99},
    {"date": "2024-03-03", "product": "Widget B", "quantity": 12, "unit_price": 14.99},
    {"date": "2024-03-03", "product": "Widget A", "quantity": 18, "unit_price": 9.99},
    {"date": "2024-03-03", "product": "Widget C", "quantity": 3, "unit_price": 24.99},
]

Calculate total revenue (quantity * unit_price) for each record (add a "revenue" key)
Calculate total revenue per product using a dictionary
Find which product generated the most total revenue
Calculate total revenue per day
Write the enriched data (with the revenue column) to a CSV file

Guidance

# 1
for record in sales:
    record["revenue"] = record["quantity"] * record["unit_price"]

# 2
product_revenue = {}
for record in sales:
    product = record["product"]
    product_revenue[product] = product_revenue.get(product, 0) + record["revenue"]
print(product_revenue)

# 3
best_product = max(product_revenue, key=product_revenue.get)
print(f"Top product: {best_product} (${product_revenue[best_product]:.2f})")

# 4 and 5 follow similar patterns

Part D: Synthesis & Critical Thinking ⭐⭐⭐

These problems require you to connect ideas, compare approaches, or design solutions.

Exercise 5.21 — Data structure trade-offs

You are building a simple contact book that stores people's names and phone numbers. Compare three implementations:

A list of tuples: [("Alice", "555-0101"), ("Bob", "555-0102"), ...]
A dictionary: {"Alice": "555-0101", "Bob": "555-0102", ...}
A list of dictionaries: [{"name": "Alice", "phone": "555-0101"}, ...]

For each operation below, state which implementation makes it easiest and explain why: - Look up Alice's phone number - Add a new contact - Check if "Carol" is in the contact book - Add a second phone number for Alice (home and work) - Sort contacts by name

Guidance

- **Lookup:** Dictionary is fastest — O(1) by key. List of tuples requires searching. - **Add:** All three are easy (append or assign). - **Check membership:** Dictionary (`"Carol" in contacts`) is fastest. - **Second number:** List of dictionaries is most flexible — each dict can have a `"phones"` list. The simple dictionary would need to change its value type from string to list. - **Sort:** List of tuples and list of dicts are easy to sort. Dictionaries maintain insertion order in Python 3.7+ but are not designed for ordered retrieval by position. The deeper point: there is no universally "best" structure. The right choice depends on which operations you do most.

Exercise 5.22 — Designing a mini-database

Design a data structure to represent a university's course catalog. It should support these queries efficiently: - "What courses does the CS department offer?" - "Who teaches Stats 201?" - "What are the prerequisites for Data Science 110?" - "How many total courses are offered?"

Sketch your data structure in Python (you do not need to populate it with a lot of data — 3-4 courses are enough). Then show Python code that answers each of the four queries.

Guidance

One effective design uses a dictionary of dictionaries keyed by department, with each course as a nested dictionary:

catalog = {
    "CS": {
        "CS 101": {"title": "Intro to CS", "instructor": "Dr. Park", "prereqs": []},
        "CS 201": {"title": "Data Structures", "instructor": "Dr. Lee", "prereqs": ["CS 101"]},
    },
    "Stats": {
        "Stats 201": {"title": "Statistical Methods", "instructor": "Dr. Ramirez", "prereqs": ["Stats 101"]},
    },
    "Data Science": {
        "DS 110": {"title": "Intro to Data Science", "instructor": "Dr. Park", "prereqs": ["CS 101", "Stats 101"]},
    }
}

# Query 1
print(list(catalog["CS"].keys()))
# Query 2
print(catalog["Stats"]["Stats 201"]["instructor"])
# Query 3
print(catalog["Data Science"]["DS 110"]["prereqs"])
# Query 4
total = sum(len(courses) for courses in catalog.values())

Other designs are valid. The key is explaining *why* your design supports the required queries.

Exercise 5.23 — From spreadsheet to code

Here is a small spreadsheet of data:

City	State	Population	Area_sq_mi
New York	NY	8336817	302.6
Los Angeles	CA	3979576	468.7
Chicago	IL	2693976	227.6
Houston	TX	2304580	671.7

Represent this as a list of dictionaries
Write a function that calculates the population density (population / area) for each city and adds it as a new key
Which city has the highest population density? Write code to find out.
Write a dictionary comprehension mapping city names to population density
Write the enriched data to a CSV file

Guidance

This exercise integrates nearly every concept from the chapter: list of dicts, adding keys, loops or comprehensions, dict comprehensions, and csv.DictWriter. Work through it step by step. The answer to "highest density" should be New York (about 27,541 people per square mile).

Exercise 5.24 — Comparing file formats

Given the same data (a list of 3 countries with name, region, and vaccination rate), write it to both a CSV file and a JSON file. Then:

Compare the file sizes (use os.path.getsize())
Read each file back and verify you get the same data
In 3-4 sentences, discuss: when would you choose CSV over JSON, and vice versa?

Guidance

CSV is more compact for simple tabular data and widely supported by spreadsheet software. JSON handles nested/hierarchical data naturally and is the standard for web APIs. CSV is better for flat tables; JSON is better for complex structures. Both are human-readable text formats.

Exercise 5.25 — The data pipeline

Write a complete data processing pipeline that: 1. Reads a CSV file containing student names and test scores 2. Calculates each student's average score 3. Classifies each student as "pass" (average >= 60) or "fail" 4. Writes a new CSV file with the original data plus two new columns: average and status

First, create the input CSV file with at least 6 students and 3 test score columns each. Then process it.

Guidance

This is a synthesis exercise combining file reading, data processing with dictionaries, and file writing. The key steps are: 1. Use `csv.DictReader` to read 2. Convert score strings to floats 3. Compute the average 4. Add `"average"` and `"status"` keys to each record 5. Use `csv.DictWriter` with updated fieldnames to write This read-process-write pattern is the foundation of all data pipeline work you will do later.

Part M: Mixed Practice (Chapters 1-4 Review) ⭐⭐

These problems blend current and previous material to build cumulative fluency.

Exercise 5.26 — Lifecycle with dictionaries (Chapter 1 + Chapter 5)

Map Marcus's small-business analysis onto the data science lifecycle (from Chapter 1). For each stage, describe both (a) what Marcus would do and (b) which Python data structure from this chapter he would likely use. For example: "In the data collection stage, Marcus would export his POS data. He would store each transaction as a dictionary with keys for date, product, quantity, and price."

Guidance

Walk through all six stages: question formulation (no data structure needed), data collection (list of dictionaries from CSV), data cleaning (sets for finding unique values, dictionaries for mapping corrections), exploration (loops over list of dicts, comprehensions for filtering), modeling (dictionaries for storing results), communication (writing results to files). The specific structures you choose should be justified.

Exercise 5.27 — Functions with data structures (Chapter 4 + Chapter 5)

Write the following functions:

lookup_region(country, mapping) — takes a country name and a country-to-region dictionary, returns the region or "Unknown" if the country is not found
filter_by_region(records, region) — takes a list of dictionaries and a region name, returns a list of records matching that region
compute_average(records, key) — takes a list of dictionaries and a key name (e.g., "vaccination_rate"), returns the average value for that key

Test all three functions with sample data.

Guidance

def lookup_region(country, mapping):
    return mapping.get(country, "Unknown")

def filter_by_region(records, region):
    return [r for r in records if r["region"] == region]

def compute_average(records, key):
    values = [r[key] for r in records]
    return sum(values) / len(values) if values else 0

Exercise 5.28 — Type conversion meets data structures (Chapter 3 + Chapter 5)

The following data was read from a CSV file, so all values are strings:

raw_data = [
    {"country": "Brazil", "population": "214000000", "rate": "72.3", "vaccinated": "True"},
    {"country": "Chad", "population": "17400000", "rate": "41.7", "vaccinated": "False"},
]

Write a function clean_record(record) that returns a new dictionary with: - population converted to int - rate converted to float - vaccinated converted to bool (careful: bool("False") is True!) - country left as a string

Test it and verify the types are correct.

Guidance

The `bool` conversion is the tricky part. `bool("False")` returns `True` because any non-empty string is truthy. Use a comparison instead:

def clean_record(record):
    return {
        "country": record["country"],
        "population": int(record["population"]),
        "rate": float(record["rate"]),
        "vaccinated": record["vaccinated"] == "True"
    }

Exercise 5.29 — Conditionals inside comprehensions (Chapter 4 + Chapter 5)

Write a single list comprehension that takes a list of vaccination rates and produces a list of category strings:

rates = [72.3, 85.1, 41.7, 93.2, 68.5, 55.0, 12.8]
# Expected: ["medium", "high", "low", "high", "medium", "medium", "low"]

Where: high >= 80, medium >= 50, low < 50.

Hint: You can use a conditional expression (ternary) inside a comprehension: "high" if x >= 80 else "medium" if x >= 50 else "low".

Guidance

categories = ["high" if r >= 80 else "medium" if r >= 50 else "low" for r in rates]

This works but is at the edge of readability. For more complex logic, a helper function called inside the comprehension is cleaner:

def categorize(rate):
    if rate >= 80: return "high"
    elif rate >= 50: return "medium"
    else: return "low"

categories = [categorize(r) for r in rates]

Exercise 5.30 — Jupyter notebook narrative (Chapter 2 + Chapter 5)

Create a Jupyter notebook called ch5_exploration.ipynb that tells a story. Use Markdown cells to explain what you are doing and why. The notebook should:

Start with a title and a one-paragraph introduction
Create a list of dictionaries representing at least 8 countries with name, region, vaccination_rate, and population
Use a comprehension to extract country names
Use a loop to find the country with the highest vaccination rate
Use a dictionary to count countries per region
Write the data to a CSV file and read it back
End with a "Findings" section summarizing what you observed

This exercise practices the notebook as narrative, which you will do extensively starting in Chapter 6.

Guidance

The code components use techniques from this chapter. The key addition is the Markdown narrative: explain *why* you are performing each step, not just *what* you are doing. A good notebook reads like a report, not a code dump. Revisit [Chapter 2](../chapter-02-setting-up-toolkit/index.md)'s discussion of Markdown cells if needed.

Part E: Research & Extension ⭐⭐⭐⭐

These are open-ended projects that go beyond the chapter. Spend 30-60 minutes on one.

Exercise 5.31 — Beyond built-in: collections module

Research Python's collections module, specifically Counter, defaultdict, and OrderedDict. For each: 1. Describe what it does in one sentence 2. Write a short code example using data-science-relevant data 3. Explain when you would use it instead of a plain dict

Guidance

- `Counter` is a dictionary subclass for counting hashable objects. Example: `Counter(["A", "B", "A", "C", "A"])` gives `Counter({"A": 3, "B": 1, "C": 1})`. Use instead of the `.get(key, 0) + 1` pattern. - `defaultdict` is a dictionary that provides default values for missing keys. Example: `defaultdict(list)` lets you append without checking if a key exists. Use for the "invert mapping" pattern in Exercise 5.17. - `OrderedDict` was historically needed for order-preserving dictionaries (before Python 3.7). It is less necessary now but still useful for its `move_to_end()` method. See the official Python docs: https://docs.python.org/3/library/collections.html

Exercise 5.32 — Real data challenge

Find a publicly available CSV file from a real source (e.g., data.gov, the World Bank, Kaggle, or your university's open data portal). Choose a small one — under 100 rows. Then:

Download it
Read it into a list of dictionaries using csv.DictReader
Answer at least two specific questions about the data using the techniques from this chapter
Write a cleaned or enriched version to a new CSV file

Document your work in a Jupyter notebook with Markdown explanations.

Guidance

Good sources for small, beginner-friendly datasets: - World Bank Open Data (data.worldbank.org) — search for a specific indicator - WHO Global Health Observatory (gho.who.int) — health statistics by country - data.gov — US government open data - Kaggle (kaggle.com/datasets) — filter for small, CSV-format datasets The goal is to practice the read-process-write pipeline with real, messy data. Expect to encounter issues like missing values, unexpected column names, or inconsistent formatting. That is the point.

End of Chapter 5 Exercises. If Parts A and B felt comfortable and Parts C and D stretched you, you are in exactly the right place. The fluency you are building with data structures will pay dividends starting in the very next chapter, when you load your first real dataset.