Chapter 5 Exercises: Working with Data Structures

How to use these exercises: Work through the sections in order. Each section builds on the previous one, moving from recall through application to synthesis. Type every code exercise into a Jupyter cell and run it — reading code is not the same as writing it.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension


Part A: Conceptual Understanding ⭐

These questions check whether you absorbed the core ideas from the chapter. Aim for clear, concise answers.


Exercise 5.1Choosing the right structure

For each scenario below, state which data structure (list, dictionary, set, or tuple) is the best choice. Explain your reasoning in one sentence.

  1. Storing the names of all countries in a dataset, in the order they appear
  2. Mapping country names to their ISO 3166-1 alpha-3 codes (e.g., "Brazil" to "BRA")
  3. Finding all unique vaccine manufacturers mentioned in a dataset
  4. Representing a fixed pair of latitude and longitude coordinates
  5. Storing a patient's medical record with named fields like "name," "age," and "blood_type"
Guidance 1. **List** — You need ordered data and may have duplicates if the same country appears in multiple rows. 2. **Dictionary** — You need a name-to-value mapping with fast lookup. 3. **Set** — You need uniqueness, and order does not matter. 4. **Tuple** — Coordinates are fixed data that should not change; tuples also work as dictionary keys. 5. **Dictionary** — Named fields (keys) make the data self-documenting and easy to access.

Exercise 5.2Mutable vs. immutable

Without running the code, predict what each snippet will print. Then run it to check.

# Snippet A
a = [1, 2, 3]
b = a
b.append(4)
print(a)

# Snippet B
x = (1, 2, 3)
y = x
# y.append(4)  — What would happen if you uncommented this line?
print(x)

# Snippet C
s = "hello"
s.upper()
print(s)
Guidance - **Snippet A:** Prints `[1, 2, 3, 4]`. `b = a` does not copy the list — both `a` and `b` point to the same list object. Modifying through `b` also changes `a`. - **Snippet B:** Prints `(1, 2, 3)`. Uncommenting the append line would raise `AttributeError` because tuples are immutable and do not have an `append` method. - **Snippet C:** Prints `hello`. `s.upper()` returns a new string `"HELLO"` but does not modify `s`. Strings are immutable. You would need `s = s.upper()` to update the variable.

Exercise 5.3Dictionary access patterns

Given this dictionary:

student = {
    "name": "Jordan Kim",
    "major": "Data Science",
    "gpa": 3.7,
    "courses": ["Stats 101", "CS 110", "Data Ethics"]
}

Write the Python expression to access each of the following (do not use variables — write the full expression):

  1. Jordan's major
  2. Jordan's third course
  3. The number of courses Jordan is taking
  4. Whether Jordan's GPA is above 3.5 (should evaluate to True or False)
Guidance 1. `student["major"]` → `"Data Science"` 2. `student["courses"][2]` → `"Data Ethics"` 3. `len(student["courses"])` → `3` 4. `student["gpa"] > 3.5` → `True`

Exercise 5.4Comprehension anatomy

Rewrite each list comprehension as an equivalent for loop with .append(). Then rewrite each for loop as an equivalent list comprehension.

# Comprehension 1 — rewrite as a loop
doubled = [n * 2 for n in [5, 10, 15, 20]]

# Comprehension 2 — rewrite as a loop
short_words = [w for w in ["data", "is", "powerful", "and", "fun"] if len(w) <= 3]

# Loop 1 — rewrite as a comprehension
result = []
for temp_c in [0, 20, 37, 100]:
    temp_f = temp_c * 9/5 + 32
    result.append(temp_f)

# Loop 2 — rewrite as a comprehension
names = []
for record in [{"name": "A", "score": 90}, {"name": "B", "score": 55}]:
    if record["score"] >= 60:
        names.append(record["name"])
Guidance
# Comprehension 1 as loop
doubled = []
for n in [5, 10, 15, 20]:
    doubled.append(n * 2)

# Comprehension 2 as loop
short_words = []
for w in ["data", "is", "powerful", "and", "fun"]:
    if len(w) <= 3:
        short_words.append(w)

# Loop 1 as comprehension
result = [temp_c * 9/5 + 32 for temp_c in [0, 20, 37, 100]]

# Loop 2 as comprehension
names = [record["name"] for record in [{"name": "A", "score": 90}, {"name": "B", "score": 55}]
         if record["score"] >= 60]

Exercise 5.5Reading the error message

Each code snippet below produces an error. Without running the code, identify: (a) the error type, and (b) how to fix it. Then run the code to verify.

# Snippet A
patient = {"name": "Elena", "age": 31}
print(patient["Name"])

# Snippet B
data = [10, 20, 30]
print(data[3])

# Snippet C
coordinates = (40.7, -74.0)
coordinates[0] = 41.0

# Snippet D
import csv
with open("nonexistent_file.csv", "r") as f:
    reader = csv.reader(f)
Guidance - **A:** `KeyError: 'Name'` — keys are case-sensitive. Fix: `patient["name"]` (lowercase n). - **B:** `IndexError: list index out of range` — index 3 does not exist (valid indices are 0, 1, 2). Fix: `data[2]` for the last item, or `data[-1]`. - **C:** `TypeError: 'tuple' object does not support item assignment` — tuples are immutable. Fix: create a new tuple, e.g., `coordinates = (41.0, coordinates[1])`. - **D:** `FileNotFoundError` — the file does not exist. Fix: use a filename that actually exists, or create the file first.

Exercise 5.6The file reading pattern

Fill in the blanks in this description of the file-reading pattern (write your answers before checking):

"To read a CSV file in Python, we import the _ module. We open the file using the _ function, typically inside a _ statement to ensure the file is closed automatically. For CSV files, we create a _ object (or a _ for automatic column-name mapping). Each row from csv.reader is returned as a _. Each row from csv.DictReader is returned as a _. All values from CSV files are always _, so numeric values must be converted using float() or int()."

Guidance csv; `open()`; `with`; `csv.reader`; `csv.DictReader`; list (of strings); dictionary; strings.

Exercise 5.7Set operations in plain English

Given:

enrolled_2024 = {"Alice", "Bob", "Carol", "Dave", "Eve"}
enrolled_2025 = {"Bob", "Carol", "Frank", "Grace"}

Describe in plain English what each expression computes, then verify by running the code:

  1. enrolled_2024 & enrolled_2025
  2. enrolled_2024 - enrolled_2025
  3. enrolled_2025 - enrolled_2024
  4. enrolled_2024 | enrolled_2025
  5. enrolled_2024 ^ enrolled_2025
Guidance 1. Students enrolled in *both* years: `{"Bob", "Carol"}` 2. Students who left (in 2024 but not 2025): `{"Alice", "Dave", "Eve"}` 3. New students (in 2025 but not 2024): `{"Frank", "Grace"}` 4. All students ever enrolled: `{"Alice", "Bob", "Carol", "Dave", "Eve", "Frank", "Grace"}` 5. Students in exactly one year (not both): `{"Alice", "Dave", "Eve", "Frank", "Grace"}`

Part B: Applied Practice ⭐⭐

These problems require you to write code. Work in Jupyter and test your solutions.


Exercise 5.8Building a data record

Create a dictionary representing a single NBA player with the following fields: name, team, position, games_played, points_per_game, rebounds_per_game, assists_per_game, and three_point_pct. Use realistic values for any player you choose (or invent one). Then write a function player_summary(player_dict) that takes this dictionary and returns a formatted string like:

"LeBron James (LAL) — 25.7 PPG, 7.3 RPG, 8.3 APG"
Guidance
player = {
    "name": "LeBron James",
    "team": "LAL",
    "position": "SF",
    "games_played": 71,
    "points_per_game": 25.7,
    "rebounds_per_game": 7.3,
    "assists_per_game": 8.3,
    "three_point_pct": 0.410
}

def player_summary(p):
    return (f"{p['name']} ({p['team']}) — "
            f"{p['points_per_game']} PPG, "
            f"{p['rebounds_per_game']} RPG, "
            f"{p['assists_per_game']} APG")

print(player_summary(player))

Exercise 5.9Filtering a list of dictionaries

Given the following dataset:

students = [
    {"name": "Alice", "major": "CS", "gpa": 3.8},
    {"name": "Bob", "major": "Data Science", "gpa": 3.2},
    {"name": "Carol", "major": "CS", "gpa": 3.9},
    {"name": "Dave", "major": "Statistics", "gpa": 2.7},
    {"name": "Eve", "major": "Data Science", "gpa": 3.5},
    {"name": "Frank", "major": "CS", "gpa": 3.1},
]

Write code to: 1. Create a list of names of all students with a GPA of 3.5 or higher (use a list comprehension) 2. Create a list of all Data Science majors (use a list comprehension) 3. Calculate the average GPA across all students 4. Find the student with the highest GPA (without using max() — write a loop)

Guidance
# 1
honor_roll = [s["name"] for s in students if s["gpa"] >= 3.5]
# ["Alice", "Carol", "Eve"]

# 2
ds_majors = [s["name"] for s in students if s["major"] == "Data Science"]
# ["Bob", "Eve"]

# 3
avg_gpa = sum(s["gpa"] for s in students) / len(students)
# 3.3666...

# 4
best = students[0]
for s in students[1:]:
    if s["gpa"] > best["gpa"]:
        best = s
print(f"{best['name']}: {best['gpa']}")  # Carol: 3.9

Exercise 5.10Dictionary from two lists

Given:

countries = ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia"]
rates = [72.3, 85.1, 41.7, 93.2, 68.5]

Use zip() and a dictionary comprehension to create a dictionary mapping each country to its rate. Then use this dictionary to look up Denmark's rate and print it.

Guidance
country_rates = {country: rate for country, rate in zip(countries, rates)}
print(country_rates["Denmark"])  # 93.2
`zip()` pairs up corresponding elements from two sequences. The comprehension `{k: v for k, v in zip(...)}` builds a dictionary from those pairs. You could also write `dict(zip(countries, rates))` for the same result.

Exercise 5.11Counting with dictionaries

Write a function count_by_region(records) that takes a list of dictionaries (each with a "region" key) and returns a dictionary mapping each region to the number of countries in that region.

Test it with:

data = [
    {"country": "Brazil", "region": "Americas"},
    {"country": "Canada", "region": "Americas"},
    {"country": "Chad", "region": "Africa"},
    {"country": "Denmark", "region": "Europe"},
    {"country": "Ethiopia", "region": "Africa"},
    {"country": "France", "region": "Europe"},
]

Expected output: {"Americas": 2, "Africa": 2, "Europe": 2}

Guidance
def count_by_region(records):
    counts = {}
    for record in records:
        region = record["region"]
        counts[region] = counts.get(region, 0) + 1
    return counts

print(count_by_region(data))
The `.get(region, 0)` pattern is essential: it returns the current count if the region is already in the dictionary, or 0 if it is the first time seeing that region.

Exercise 5.12Write and read a CSV

  1. Create a list of at least 5 dictionaries representing books (with keys "title", "author", "year", "pages").
  2. Write this data to a file called books.csv using csv.DictWriter.
  3. Read the file back using csv.DictReader and print each book's title and year.
  4. Verify that the year values you read back are strings. Convert them to integers and compute the average publication year.
Guidance
import csv

books = [
    {"title": "Weapons of Math Destruction", "author": "Cathy O'Neil", "year": "2016", "pages": "272"},
    {"title": "The Signal and the Noise", "author": "Nate Silver", "year": "2012", "pages": "544"},
    {"title": "Python for Data Analysis", "author": "Wes McKinney", "year": "2022", "pages": "579"},
    {"title": "The Art of Statistics", "author": "David Spiegelhalter", "year": "2019", "pages": "426"},
    {"title": "Factfulness", "author": "Hans Rosling", "year": "2018", "pages": "352"},
]

# Write
with open("books.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "author", "year", "pages"])
    writer.writeheader()
    for book in books:
        writer.writerow(book)

# Read
with open("books.csv", "r") as f:
    reader = csv.DictReader(f)
    read_books = list(reader)

for book in read_books:
    print(f"{book['title']} ({book['year']})")
    print(f"  year type: {type(book['year'])}")  # <class 'str'>

# Average year
years = [int(book["year"]) for book in read_books]
print(f"Average publication year: {sum(years) / len(years):.0f}")

Exercise 5.13Nested data access

Given this nested structure:

university = {
    "name": "State University",
    "departments": {
        "Computer Science": {
            "faculty_count": 45,
            "courses": ["CS 101", "CS 201", "CS 301", "Data Science 110"],
            "chair": "Dr. Park"
        },
        "Statistics": {
            "faculty_count": 28,
            "courses": ["Stats 101", "Stats 201", "Bayesian Methods"],
            "chair": "Dr. Ramirez"
        },
        "Mathematics": {
            "faculty_count": 52,
            "courses": ["Calc I", "Calc II", "Linear Algebra", "Probability"],
            "chair": "Dr. Chen"
        }
    }
}

Write expressions to access: 1. The name of the Statistics department chair 2. The second course in Computer Science 3. The total number of faculty across all departments 4. A list of all department names 5. All courses across all departments combined into a single list

Guidance
# 1
print(university["departments"]["Statistics"]["chair"])  # "Dr. Ramirez"

# 2
print(university["departments"]["Computer Science"]["courses"][1])  # "CS 201"

# 3
total = sum(dept["faculty_count"] for dept in university["departments"].values())
print(total)  # 125

# 4
print(list(university["departments"].keys()))
# ["Computer Science", "Statistics", "Mathematics"]

# 5
all_courses = []
for dept in university["departments"].values():
    all_courses.extend(dept["courses"])
print(all_courses)
# or as a comprehension:
all_courses = [course for dept in university["departments"].values()
               for course in dept["courses"]]

Exercise 5.14Building a frequency table

Write a function frequency_table(items) that takes a list and returns a dictionary mapping each unique item to its count. Test with:

colors = ["red", "blue", "red", "green", "blue", "red", "blue", "green", "red"]
print(frequency_table(colors))
# {"red": 4, "blue": 3, "green": 2}

Then sort the result by count (highest first) and print it. Hint: sorted() can take a key argument.

Guidance
def frequency_table(items):
    counts = {}
    for item in items:
        counts[item] = counts.get(item, 0) + 1
    return counts

result = frequency_table(colors)
print(result)

# Sort by count, descending
for color, count in sorted(result.items(), key=lambda x: x[1], reverse=True):
    print(f"{color}: {count}")

Exercise 5.15Set-based data cleaning

You have two lists of country names from different data sources. They should contain the same countries but may not:

source_a = ["Brazil", "Canada", "Chad", "Denmark", "ethiopia", "France"]
source_b = ["brazil", "Canada", "Chad", "Denmark", "Ethiopia", "France", "Germany"]
  1. Normalize both lists to lowercase
  2. Find countries in both sources
  3. Find countries in source A but not source B
  4. Find countries in source B but not source A
  5. Combine all unique countries into a single sorted list
Guidance
set_a = {c.lower() for c in source_a}
set_b = {c.lower() for c in source_b}

print("In both:", set_a & set_b)
print("Only A:", set_a - set_b)
print("Only B:", set_b - set_a)
print("All:", sorted(set_a | set_b))

Part C: Real-World Application ⭐⭐-⭐⭐⭐

These exercises connect chapter concepts to realistic data scenarios.


Exercise 5.16Weather data processing

Create a list of dictionaries representing 7 days of weather data with keys "day" (e.g., "Monday"), "high_temp" (Fahrenheit), "low_temp", and "precipitation" (inches). Then:

  1. Calculate the average high temperature for the week
  2. Find the day with the largest temperature range (high minus low)
  3. Calculate total precipitation for the week
  4. Create a list of days where it rained (precipitation > 0)
  5. Write the data to a CSV file and read it back to verify
Guidance Design the data yourself with realistic values. The coding patterns are the same as Exercises 5.9 and 5.12. The key challenge is combining multiple techniques in sequence — this is what real data work looks like.

Exercise 5.17Inverting a mapping

The country_to_region dictionary maps country names to WHO regions. Write a function invert_mapping(mapping) that creates the reverse mapping: region names to lists of countries. For example, "Americas" should map to ["Brazil", "Canada", "Mexico", ...].

country_to_region = {
    "Brazil": "Americas",
    "Canada": "Americas",
    "Chad": "Africa",
    "Denmark": "Europe",
    "Ethiopia": "Africa",
    "France": "Europe",
    "India": "South-East Asia",
    "Nigeria": "Africa",
}

region_to_countries = invert_mapping(country_to_region)
print(region_to_countries["Africa"])  # ["Chad", "Ethiopia", "Nigeria"]
Guidance
def invert_mapping(mapping):
    inverted = {}
    for key, value in mapping.items():
        if value not in inverted:
            inverted[value] = []
        inverted[value].append(key)
    return inverted
Note that a simple dictionary comprehension will not work here because multiple keys can map to the same value. You need to build lists.

Exercise 5.18Grade distribution analysis

Jordan has grade data for a class:

grades = [
    {"student": "S001", "dept": "CS", "grade": "A"},
    {"student": "S002", "dept": "CS", "grade": "B+"},
    {"student": "S003", "dept": "Stats", "grade": "A-"},
    {"student": "S004", "dept": "CS", "grade": "B"},
    {"student": "S005", "dept": "Stats", "grade": "A"},
    {"student": "S006", "dept": "CS", "grade": "C+"},
    {"student": "S007", "dept": "Stats", "grade": "B+"},
    {"student": "S008", "dept": "CS", "grade": "A-"},
]
  1. Create a grade-point mapping dictionary: {"A": 4.0, "A-": 3.7, "B+": 3.3, "B": 3.0, "B-": 2.7, "C+": 2.3, "C": 2.0}
  2. Calculate the average GPA for CS students vs. Statistics students
  3. Which department has the higher average? (This is the kind of question Jordan investigates in the anchor example.)
Guidance
gp_map = {"A": 4.0, "A-": 3.7, "B+": 3.3, "B": 3.0, "B-": 2.7, "C+": 2.3, "C": 2.0}

cs_grades = [gp_map[g["grade"]] for g in grades if g["dept"] == "CS"]
stats_grades = [gp_map[g["grade"]] for g in grades if g["dept"] == "Stats"]

cs_avg = sum(cs_grades) / len(cs_grades)
stats_avg = sum(stats_grades) / len(stats_grades)

print(f"CS average: {cs_avg:.2f}")
print(f"Stats average: {stats_avg:.2f}")

Exercise 5.19JSON data exploration

Write a Python script that: 1. Creates a JSON file called team_roster.json containing a dictionary with keys "team_name", "sport", "season", and "players" (a list of dictionaries, each with "name", "number", and "position") 2. Reads the file back 3. Prints the team name and the number of players 4. Prints the name and position of each player

Use at least 5 players with realistic data.

Guidance Follow the JSON writing pattern from Section 5.7 (`json.dump` with `indent=2`) and the reading pattern from Section 5.6 (`json.load`). The nested access follows the same principles as Exercise 5.13.

Exercise 5.20Marcus's sales analysis

Marcus has daily sales records for his store:

sales = [
    {"date": "2024-03-01", "product": "Widget A", "quantity": 15, "unit_price": 9.99},
    {"date": "2024-03-01", "product": "Widget B", "quantity": 8, "unit_price": 14.99},
    {"date": "2024-03-02", "product": "Widget A", "quantity": 22, "unit_price": 9.99},
    {"date": "2024-03-02", "product": "Widget C", "quantity": 5, "unit_price": 24.99},
    {"date": "2024-03-03", "product": "Widget B", "quantity": 12, "unit_price": 14.99},
    {"date": "2024-03-03", "product": "Widget A", "quantity": 18, "unit_price": 9.99},
    {"date": "2024-03-03", "product": "Widget C", "quantity": 3, "unit_price": 24.99},
]
  1. Calculate total revenue (quantity * unit_price) for each record (add a "revenue" key)
  2. Calculate total revenue per product using a dictionary
  3. Find which product generated the most total revenue
  4. Calculate total revenue per day
  5. Write the enriched data (with the revenue column) to a CSV file
Guidance
# 1
for record in sales:
    record["revenue"] = record["quantity"] * record["unit_price"]

# 2
product_revenue = {}
for record in sales:
    product = record["product"]
    product_revenue[product] = product_revenue.get(product, 0) + record["revenue"]
print(product_revenue)

# 3
best_product = max(product_revenue, key=product_revenue.get)
print(f"Top product: {best_product} (${product_revenue[best_product]:.2f})")

# 4 and 5 follow similar patterns

Part D: Synthesis & Critical Thinking ⭐⭐⭐

These problems require you to connect ideas, compare approaches, or design solutions.


Exercise 5.21Data structure trade-offs

You are building a simple contact book that stores people's names and phone numbers. Compare three implementations:

  1. A list of tuples: [("Alice", "555-0101"), ("Bob", "555-0102"), ...]
  2. A dictionary: {"Alice": "555-0101", "Bob": "555-0102", ...}
  3. A list of dictionaries: [{"name": "Alice", "phone": "555-0101"}, ...]

For each operation below, state which implementation makes it easiest and explain why: - Look up Alice's phone number - Add a new contact - Check if "Carol" is in the contact book - Add a second phone number for Alice (home and work) - Sort contacts by name

Guidance - **Lookup:** Dictionary is fastest — O(1) by key. List of tuples requires searching. - **Add:** All three are easy (append or assign). - **Check membership:** Dictionary (`"Carol" in contacts`) is fastest. - **Second number:** List of dictionaries is most flexible — each dict can have a `"phones"` list. The simple dictionary would need to change its value type from string to list. - **Sort:** List of tuples and list of dicts are easy to sort. Dictionaries maintain insertion order in Python 3.7+ but are not designed for ordered retrieval by position. The deeper point: there is no universally "best" structure. The right choice depends on which operations you do most.

Exercise 5.22Designing a mini-database

Design a data structure to represent a university's course catalog. It should support these queries efficiently: - "What courses does the CS department offer?" - "Who teaches Stats 201?" - "What are the prerequisites for Data Science 110?" - "How many total courses are offered?"

Sketch your data structure in Python (you do not need to populate it with a lot of data — 3-4 courses are enough). Then show Python code that answers each of the four queries.

Guidance One effective design uses a dictionary of dictionaries keyed by department, with each course as a nested dictionary:
catalog = {
    "CS": {
        "CS 101": {"title": "Intro to CS", "instructor": "Dr. Park", "prereqs": []},
        "CS 201": {"title": "Data Structures", "instructor": "Dr. Lee", "prereqs": ["CS 101"]},
    },
    "Stats": {
        "Stats 201": {"title": "Statistical Methods", "instructor": "Dr. Ramirez", "prereqs": ["Stats 101"]},
    },
    "Data Science": {
        "DS 110": {"title": "Intro to Data Science", "instructor": "Dr. Park", "prereqs": ["CS 101", "Stats 101"]},
    }
}

# Query 1
print(list(catalog["CS"].keys()))
# Query 2
print(catalog["Stats"]["Stats 201"]["instructor"])
# Query 3
print(catalog["Data Science"]["DS 110"]["prereqs"])
# Query 4
total = sum(len(courses) for courses in catalog.values())
Other designs are valid. The key is explaining *why* your design supports the required queries.

Exercise 5.23From spreadsheet to code

Here is a small spreadsheet of data:

City State Population Area_sq_mi
New York NY 8336817 302.6
Los Angeles CA 3979576 468.7
Chicago IL 2693976 227.6
Houston TX 2304580 671.7
  1. Represent this as a list of dictionaries
  2. Write a function that calculates the population density (population / area) for each city and adds it as a new key
  3. Which city has the highest population density? Write code to find out.
  4. Write a dictionary comprehension mapping city names to population density
  5. Write the enriched data to a CSV file
Guidance This exercise integrates nearly every concept from the chapter: list of dicts, adding keys, loops or comprehensions, dict comprehensions, and csv.DictWriter. Work through it step by step. The answer to "highest density" should be New York (about 27,541 people per square mile).

Exercise 5.24Comparing file formats

Given the same data (a list of 3 countries with name, region, and vaccination rate), write it to both a CSV file and a JSON file. Then:

  1. Compare the file sizes (use os.path.getsize())
  2. Read each file back and verify you get the same data
  3. In 3-4 sentences, discuss: when would you choose CSV over JSON, and vice versa?
Guidance CSV is more compact for simple tabular data and widely supported by spreadsheet software. JSON handles nested/hierarchical data naturally and is the standard for web APIs. CSV is better for flat tables; JSON is better for complex structures. Both are human-readable text formats.

Exercise 5.25The data pipeline

Write a complete data processing pipeline that: 1. Reads a CSV file containing student names and test scores 2. Calculates each student's average score 3. Classifies each student as "pass" (average >= 60) or "fail" 4. Writes a new CSV file with the original data plus two new columns: average and status

First, create the input CSV file with at least 6 students and 3 test score columns each. Then process it.

Guidance This is a synthesis exercise combining file reading, data processing with dictionaries, and file writing. The key steps are: 1. Use `csv.DictReader` to read 2. Convert score strings to floats 3. Compute the average 4. Add `"average"` and `"status"` keys to each record 5. Use `csv.DictWriter` with updated fieldnames to write This read-process-write pattern is the foundation of all data pipeline work you will do later.

Part M: Mixed Practice (Chapters 1-4 Review) ⭐⭐

These problems blend current and previous material to build cumulative fluency.


Exercise 5.26Lifecycle with dictionaries (Chapter 1 + Chapter 5)

Map Marcus's small-business analysis onto the data science lifecycle (from Chapter 1). For each stage, describe both (a) what Marcus would do and (b) which Python data structure from this chapter he would likely use. For example: "In the data collection stage, Marcus would export his POS data. He would store each transaction as a dictionary with keys for date, product, quantity, and price."

Guidance Walk through all six stages: question formulation (no data structure needed), data collection (list of dictionaries from CSV), data cleaning (sets for finding unique values, dictionaries for mapping corrections), exploration (loops over list of dicts, comprehensions for filtering), modeling (dictionaries for storing results), communication (writing results to files). The specific structures you choose should be justified.

Exercise 5.27Functions with data structures (Chapter 4 + Chapter 5)

Write the following functions:

  1. lookup_region(country, mapping) — takes a country name and a country-to-region dictionary, returns the region or "Unknown" if the country is not found
  2. filter_by_region(records, region) — takes a list of dictionaries and a region name, returns a list of records matching that region
  3. compute_average(records, key) — takes a list of dictionaries and a key name (e.g., "vaccination_rate"), returns the average value for that key

Test all three functions with sample data.

Guidance
def lookup_region(country, mapping):
    return mapping.get(country, "Unknown")

def filter_by_region(records, region):
    return [r for r in records if r["region"] == region]

def compute_average(records, key):
    values = [r[key] for r in records]
    return sum(values) / len(values) if values else 0

Exercise 5.28Type conversion meets data structures (Chapter 3 + Chapter 5)

The following data was read from a CSV file, so all values are strings:

raw_data = [
    {"country": "Brazil", "population": "214000000", "rate": "72.3", "vaccinated": "True"},
    {"country": "Chad", "population": "17400000", "rate": "41.7", "vaccinated": "False"},
]

Write a function clean_record(record) that returns a new dictionary with: - population converted to int - rate converted to float - vaccinated converted to bool (careful: bool("False") is True!) - country left as a string

Test it and verify the types are correct.

Guidance The `bool` conversion is the tricky part. `bool("False")` returns `True` because any non-empty string is truthy. Use a comparison instead:
def clean_record(record):
    return {
        "country": record["country"],
        "population": int(record["population"]),
        "rate": float(record["rate"]),
        "vaccinated": record["vaccinated"] == "True"
    }

Exercise 5.29Conditionals inside comprehensions (Chapter 4 + Chapter 5)

Write a single list comprehension that takes a list of vaccination rates and produces a list of category strings:

rates = [72.3, 85.1, 41.7, 93.2, 68.5, 55.0, 12.8]
# Expected: ["medium", "high", "low", "high", "medium", "medium", "low"]

Where: high >= 80, medium >= 50, low < 50.

Hint: You can use a conditional expression (ternary) inside a comprehension: "high" if x >= 80 else "medium" if x >= 50 else "low".

Guidance
categories = ["high" if r >= 80 else "medium" if r >= 50 else "low" for r in rates]
This works but is at the edge of readability. For more complex logic, a helper function called inside the comprehension is cleaner:
def categorize(rate):
    if rate >= 80: return "high"
    elif rate >= 50: return "medium"
    else: return "low"

categories = [categorize(r) for r in rates]

Exercise 5.30Jupyter notebook narrative (Chapter 2 + Chapter 5)

Create a Jupyter notebook called ch5_exploration.ipynb that tells a story. Use Markdown cells to explain what you are doing and why. The notebook should:

  1. Start with a title and a one-paragraph introduction
  2. Create a list of dictionaries representing at least 8 countries with name, region, vaccination_rate, and population
  3. Use a comprehension to extract country names
  4. Use a loop to find the country with the highest vaccination rate
  5. Use a dictionary to count countries per region
  6. Write the data to a CSV file and read it back
  7. End with a "Findings" section summarizing what you observed

This exercise practices the notebook as narrative, which you will do extensively starting in Chapter 6.

Guidance The code components use techniques from this chapter. The key addition is the Markdown narrative: explain *why* you are performing each step, not just *what* you are doing. A good notebook reads like a report, not a code dump. Revisit Chapter 2's discussion of Markdown cells if needed.

Part E: Research & Extension ⭐⭐⭐⭐

These are open-ended projects that go beyond the chapter. Spend 30-60 minutes on one.


Exercise 5.31Beyond built-in: collections module

Research Python's collections module, specifically Counter, defaultdict, and OrderedDict. For each: 1. Describe what it does in one sentence 2. Write a short code example using data-science-relevant data 3. Explain when you would use it instead of a plain dict

Guidance - `Counter` is a dictionary subclass for counting hashable objects. Example: `Counter(["A", "B", "A", "C", "A"])` gives `Counter({"A": 3, "B": 1, "C": 1})`. Use instead of the `.get(key, 0) + 1` pattern. - `defaultdict` is a dictionary that provides default values for missing keys. Example: `defaultdict(list)` lets you append without checking if a key exists. Use for the "invert mapping" pattern in Exercise 5.17. - `OrderedDict` was historically needed for order-preserving dictionaries (before Python 3.7). It is less necessary now but still useful for its `move_to_end()` method. See the official Python docs: https://docs.python.org/3/library/collections.html

Exercise 5.32Real data challenge

Find a publicly available CSV file from a real source (e.g., data.gov, the World Bank, Kaggle, or your university's open data portal). Choose a small one — under 100 rows. Then:

  1. Download it
  2. Read it into a list of dictionaries using csv.DictReader
  3. Answer at least two specific questions about the data using the techniques from this chapter
  4. Write a cleaned or enriched version to a new CSV file

Document your work in a Jupyter notebook with Markdown explanations.

Guidance Good sources for small, beginner-friendly datasets: - World Bank Open Data (data.worldbank.org) — search for a specific indicator - WHO Global Health Observatory (gho.who.int) — health statistics by country - data.gov — US government open data - Kaggle (kaggle.com/datasets) — filter for small, CSV-format datasets The goal is to practice the read-process-write pipeline with real, messy data. Expect to encounter issues like missing values, unexpected column names, or inconsistent formatting. That is the point.

End of Chapter 5 Exercises. If Parts A and B felt comfortable and Parts C and D stretched you, you are in exactly the right place. The fluency you are building with data structures will pay dividends starting in the very next chapter, when you load your first real dataset.