Chapter 5 Exercises: Working with Data Structures
How to use these exercises: Work through the sections in order. Each section builds on the previous one, moving from recall through application to synthesis. Type every code exercise into a Jupyter cell and run it — reading code is not the same as writing it.
Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension
Part A: Conceptual Understanding ⭐
These questions check whether you absorbed the core ideas from the chapter. Aim for clear, concise answers.
Exercise 5.1 — Choosing the right structure
For each scenario below, state which data structure (list, dictionary, set, or tuple) is the best choice. Explain your reasoning in one sentence.
- Storing the names of all countries in a dataset, in the order they appear
- Mapping country names to their ISO 3166-1 alpha-3 codes (e.g., "Brazil" to "BRA")
- Finding all unique vaccine manufacturers mentioned in a dataset
- Representing a fixed pair of latitude and longitude coordinates
- Storing a patient's medical record with named fields like "name," "age," and "blood_type"
Guidance
1. **List** — You need ordered data and may have duplicates if the same country appears in multiple rows. 2. **Dictionary** — You need a name-to-value mapping with fast lookup. 3. **Set** — You need uniqueness, and order does not matter. 4. **Tuple** — Coordinates are fixed data that should not change; tuples also work as dictionary keys. 5. **Dictionary** — Named fields (keys) make the data self-documenting and easy to access.Exercise 5.2 — Mutable vs. immutable
Without running the code, predict what each snippet will print. Then run it to check.
# Snippet A
a = [1, 2, 3]
b = a
b.append(4)
print(a)
# Snippet B
x = (1, 2, 3)
y = x
# y.append(4) — What would happen if you uncommented this line?
print(x)
# Snippet C
s = "hello"
s.upper()
print(s)
Guidance
- **Snippet A:** Prints `[1, 2, 3, 4]`. `b = a` does not copy the list — both `a` and `b` point to the same list object. Modifying through `b` also changes `a`. - **Snippet B:** Prints `(1, 2, 3)`. Uncommenting the append line would raise `AttributeError` because tuples are immutable and do not have an `append` method. - **Snippet C:** Prints `hello`. `s.upper()` returns a new string `"HELLO"` but does not modify `s`. Strings are immutable. You would need `s = s.upper()` to update the variable.Exercise 5.3 — Dictionary access patterns
Given this dictionary:
student = {
"name": "Jordan Kim",
"major": "Data Science",
"gpa": 3.7,
"courses": ["Stats 101", "CS 110", "Data Ethics"]
}
Write the Python expression to access each of the following (do not use variables — write the full expression):
- Jordan's major
- Jordan's third course
- The number of courses Jordan is taking
- Whether Jordan's GPA is above 3.5 (should evaluate to
TrueorFalse)
Guidance
1. `student["major"]` → `"Data Science"` 2. `student["courses"][2]` → `"Data Ethics"` 3. `len(student["courses"])` → `3` 4. `student["gpa"] > 3.5` → `True`Exercise 5.4 — Comprehension anatomy
Rewrite each list comprehension as an equivalent for loop with .append(). Then rewrite each for loop as an equivalent list comprehension.
# Comprehension 1 — rewrite as a loop
doubled = [n * 2 for n in [5, 10, 15, 20]]
# Comprehension 2 — rewrite as a loop
short_words = [w for w in ["data", "is", "powerful", "and", "fun"] if len(w) <= 3]
# Loop 1 — rewrite as a comprehension
result = []
for temp_c in [0, 20, 37, 100]:
temp_f = temp_c * 9/5 + 32
result.append(temp_f)
# Loop 2 — rewrite as a comprehension
names = []
for record in [{"name": "A", "score": 90}, {"name": "B", "score": 55}]:
if record["score"] >= 60:
names.append(record["name"])
Guidance
# Comprehension 1 as loop
doubled = []
for n in [5, 10, 15, 20]:
doubled.append(n * 2)
# Comprehension 2 as loop
short_words = []
for w in ["data", "is", "powerful", "and", "fun"]:
if len(w) <= 3:
short_words.append(w)
# Loop 1 as comprehension
result = [temp_c * 9/5 + 32 for temp_c in [0, 20, 37, 100]]
# Loop 2 as comprehension
names = [record["name"] for record in [{"name": "A", "score": 90}, {"name": "B", "score": 55}]
if record["score"] >= 60]
Exercise 5.5 — Reading the error message
Each code snippet below produces an error. Without running the code, identify: (a) the error type, and (b) how to fix it. Then run the code to verify.
# Snippet A
patient = {"name": "Elena", "age": 31}
print(patient["Name"])
# Snippet B
data = [10, 20, 30]
print(data[3])
# Snippet C
coordinates = (40.7, -74.0)
coordinates[0] = 41.0
# Snippet D
import csv
with open("nonexistent_file.csv", "r") as f:
reader = csv.reader(f)
Guidance
- **A:** `KeyError: 'Name'` — keys are case-sensitive. Fix: `patient["name"]` (lowercase n). - **B:** `IndexError: list index out of range` — index 3 does not exist (valid indices are 0, 1, 2). Fix: `data[2]` for the last item, or `data[-1]`. - **C:** `TypeError: 'tuple' object does not support item assignment` — tuples are immutable. Fix: create a new tuple, e.g., `coordinates = (41.0, coordinates[1])`. - **D:** `FileNotFoundError` — the file does not exist. Fix: use a filename that actually exists, or create the file first.Exercise 5.6 — The file reading pattern
Fill in the blanks in this description of the file-reading pattern (write your answers before checking):
"To read a CSV file in Python, we import the _ module. We open the file using the _ function, typically inside a _ statement to ensure the file is closed automatically. For CSV files, we create a _ object (or a _ for automatic column-name mapping). Each row from csv.reader is returned as a _. Each row from csv.DictReader is returned as a _. All values from CSV files are always _, so numeric values must be converted using float() or int()."
Guidance
csv; `open()`; `with`; `csv.reader`; `csv.DictReader`; list (of strings); dictionary; strings.Exercise 5.7 — Set operations in plain English
Given:
enrolled_2024 = {"Alice", "Bob", "Carol", "Dave", "Eve"}
enrolled_2025 = {"Bob", "Carol", "Frank", "Grace"}
Describe in plain English what each expression computes, then verify by running the code:
enrolled_2024 & enrolled_2025enrolled_2024 - enrolled_2025enrolled_2025 - enrolled_2024enrolled_2024 | enrolled_2025enrolled_2024 ^ enrolled_2025
Guidance
1. Students enrolled in *both* years: `{"Bob", "Carol"}` 2. Students who left (in 2024 but not 2025): `{"Alice", "Dave", "Eve"}` 3. New students (in 2025 but not 2024): `{"Frank", "Grace"}` 4. All students ever enrolled: `{"Alice", "Bob", "Carol", "Dave", "Eve", "Frank", "Grace"}` 5. Students in exactly one year (not both): `{"Alice", "Dave", "Eve", "Frank", "Grace"}`Part B: Applied Practice ⭐⭐
These problems require you to write code. Work in Jupyter and test your solutions.
Exercise 5.8 — Building a data record
Create a dictionary representing a single NBA player with the following fields: name, team, position, games_played, points_per_game, rebounds_per_game, assists_per_game, and three_point_pct. Use realistic values for any player you choose (or invent one). Then write a function player_summary(player_dict) that takes this dictionary and returns a formatted string like:
"LeBron James (LAL) — 25.7 PPG, 7.3 RPG, 8.3 APG"
Guidance
player = {
"name": "LeBron James",
"team": "LAL",
"position": "SF",
"games_played": 71,
"points_per_game": 25.7,
"rebounds_per_game": 7.3,
"assists_per_game": 8.3,
"three_point_pct": 0.410
}
def player_summary(p):
return (f"{p['name']} ({p['team']}) — "
f"{p['points_per_game']} PPG, "
f"{p['rebounds_per_game']} RPG, "
f"{p['assists_per_game']} APG")
print(player_summary(player))
Exercise 5.9 — Filtering a list of dictionaries
Given the following dataset:
students = [
{"name": "Alice", "major": "CS", "gpa": 3.8},
{"name": "Bob", "major": "Data Science", "gpa": 3.2},
{"name": "Carol", "major": "CS", "gpa": 3.9},
{"name": "Dave", "major": "Statistics", "gpa": 2.7},
{"name": "Eve", "major": "Data Science", "gpa": 3.5},
{"name": "Frank", "major": "CS", "gpa": 3.1},
]
Write code to:
1. Create a list of names of all students with a GPA of 3.5 or higher (use a list comprehension)
2. Create a list of all Data Science majors (use a list comprehension)
3. Calculate the average GPA across all students
4. Find the student with the highest GPA (without using max() — write a loop)
Guidance
# 1
honor_roll = [s["name"] for s in students if s["gpa"] >= 3.5]
# ["Alice", "Carol", "Eve"]
# 2
ds_majors = [s["name"] for s in students if s["major"] == "Data Science"]
# ["Bob", "Eve"]
# 3
avg_gpa = sum(s["gpa"] for s in students) / len(students)
# 3.3666...
# 4
best = students[0]
for s in students[1:]:
if s["gpa"] > best["gpa"]:
best = s
print(f"{best['name']}: {best['gpa']}") # Carol: 3.9
Exercise 5.10 — Dictionary from two lists
Given:
countries = ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia"]
rates = [72.3, 85.1, 41.7, 93.2, 68.5]
Use zip() and a dictionary comprehension to create a dictionary mapping each country to its rate. Then use this dictionary to look up Denmark's rate and print it.
Guidance
country_rates = {country: rate for country, rate in zip(countries, rates)}
print(country_rates["Denmark"]) # 93.2
`zip()` pairs up corresponding elements from two sequences. The comprehension `{k: v for k, v in zip(...)}` builds a dictionary from those pairs. You could also write `dict(zip(countries, rates))` for the same result.
Exercise 5.11 — Counting with dictionaries
Write a function count_by_region(records) that takes a list of dictionaries (each with a "region" key) and returns a dictionary mapping each region to the number of countries in that region.
Test it with:
data = [
{"country": "Brazil", "region": "Americas"},
{"country": "Canada", "region": "Americas"},
{"country": "Chad", "region": "Africa"},
{"country": "Denmark", "region": "Europe"},
{"country": "Ethiopia", "region": "Africa"},
{"country": "France", "region": "Europe"},
]
Expected output: {"Americas": 2, "Africa": 2, "Europe": 2}
Guidance
def count_by_region(records):
counts = {}
for record in records:
region = record["region"]
counts[region] = counts.get(region, 0) + 1
return counts
print(count_by_region(data))
The `.get(region, 0)` pattern is essential: it returns the current count if the region is already in the dictionary, or 0 if it is the first time seeing that region.
Exercise 5.12 — Write and read a CSV
- Create a list of at least 5 dictionaries representing books (with keys
"title","author","year","pages"). - Write this data to a file called
books.csvusingcsv.DictWriter. - Read the file back using
csv.DictReaderand print each book's title and year. - Verify that the year values you read back are strings. Convert them to integers and compute the average publication year.
Guidance
import csv
books = [
{"title": "Weapons of Math Destruction", "author": "Cathy O'Neil", "year": "2016", "pages": "272"},
{"title": "The Signal and the Noise", "author": "Nate Silver", "year": "2012", "pages": "544"},
{"title": "Python for Data Analysis", "author": "Wes McKinney", "year": "2022", "pages": "579"},
{"title": "The Art of Statistics", "author": "David Spiegelhalter", "year": "2019", "pages": "426"},
{"title": "Factfulness", "author": "Hans Rosling", "year": "2018", "pages": "352"},
]
# Write
with open("books.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["title", "author", "year", "pages"])
writer.writeheader()
for book in books:
writer.writerow(book)
# Read
with open("books.csv", "r") as f:
reader = csv.DictReader(f)
read_books = list(reader)
for book in read_books:
print(f"{book['title']} ({book['year']})")
print(f" year type: {type(book['year'])}") # <class 'str'>
# Average year
years = [int(book["year"]) for book in read_books]
print(f"Average publication year: {sum(years) / len(years):.0f}")
Exercise 5.13 — Nested data access
Given this nested structure:
university = {
"name": "State University",
"departments": {
"Computer Science": {
"faculty_count": 45,
"courses": ["CS 101", "CS 201", "CS 301", "Data Science 110"],
"chair": "Dr. Park"
},
"Statistics": {
"faculty_count": 28,
"courses": ["Stats 101", "Stats 201", "Bayesian Methods"],
"chair": "Dr. Ramirez"
},
"Mathematics": {
"faculty_count": 52,
"courses": ["Calc I", "Calc II", "Linear Algebra", "Probability"],
"chair": "Dr. Chen"
}
}
}
Write expressions to access: 1. The name of the Statistics department chair 2. The second course in Computer Science 3. The total number of faculty across all departments 4. A list of all department names 5. All courses across all departments combined into a single list
Guidance
# 1
print(university["departments"]["Statistics"]["chair"]) # "Dr. Ramirez"
# 2
print(university["departments"]["Computer Science"]["courses"][1]) # "CS 201"
# 3
total = sum(dept["faculty_count"] for dept in university["departments"].values())
print(total) # 125
# 4
print(list(university["departments"].keys()))
# ["Computer Science", "Statistics", "Mathematics"]
# 5
all_courses = []
for dept in university["departments"].values():
all_courses.extend(dept["courses"])
print(all_courses)
# or as a comprehension:
all_courses = [course for dept in university["departments"].values()
for course in dept["courses"]]
Exercise 5.14 — Building a frequency table
Write a function frequency_table(items) that takes a list and returns a dictionary mapping each unique item to its count. Test with:
colors = ["red", "blue", "red", "green", "blue", "red", "blue", "green", "red"]
print(frequency_table(colors))
# {"red": 4, "blue": 3, "green": 2}
Then sort the result by count (highest first) and print it. Hint: sorted() can take a key argument.
Guidance
def frequency_table(items):
counts = {}
for item in items:
counts[item] = counts.get(item, 0) + 1
return counts
result = frequency_table(colors)
print(result)
# Sort by count, descending
for color, count in sorted(result.items(), key=lambda x: x[1], reverse=True):
print(f"{color}: {count}")
Exercise 5.15 — Set-based data cleaning
You have two lists of country names from different data sources. They should contain the same countries but may not:
source_a = ["Brazil", "Canada", "Chad", "Denmark", "ethiopia", "France"]
source_b = ["brazil", "Canada", "Chad", "Denmark", "Ethiopia", "France", "Germany"]
- Normalize both lists to lowercase
- Find countries in both sources
- Find countries in source A but not source B
- Find countries in source B but not source A
- Combine all unique countries into a single sorted list
Guidance
set_a = {c.lower() for c in source_a}
set_b = {c.lower() for c in source_b}
print("In both:", set_a & set_b)
print("Only A:", set_a - set_b)
print("Only B:", set_b - set_a)
print("All:", sorted(set_a | set_b))
Part C: Real-World Application ⭐⭐-⭐⭐⭐
These exercises connect chapter concepts to realistic data scenarios.
Exercise 5.16 — Weather data processing
Create a list of dictionaries representing 7 days of weather data with keys "day" (e.g., "Monday"), "high_temp" (Fahrenheit), "low_temp", and "precipitation" (inches). Then:
- Calculate the average high temperature for the week
- Find the day with the largest temperature range (high minus low)
- Calculate total precipitation for the week
- Create a list of days where it rained (precipitation > 0)
- Write the data to a CSV file and read it back to verify
Guidance
Design the data yourself with realistic values. The coding patterns are the same as Exercises 5.9 and 5.12. The key challenge is combining multiple techniques in sequence — this is what real data work looks like.Exercise 5.17 — Inverting a mapping
The country_to_region dictionary maps country names to WHO regions. Write a function invert_mapping(mapping) that creates the reverse mapping: region names to lists of countries. For example, "Americas" should map to ["Brazil", "Canada", "Mexico", ...].
country_to_region = {
"Brazil": "Americas",
"Canada": "Americas",
"Chad": "Africa",
"Denmark": "Europe",
"Ethiopia": "Africa",
"France": "Europe",
"India": "South-East Asia",
"Nigeria": "Africa",
}
region_to_countries = invert_mapping(country_to_region)
print(region_to_countries["Africa"]) # ["Chad", "Ethiopia", "Nigeria"]
Guidance
def invert_mapping(mapping):
inverted = {}
for key, value in mapping.items():
if value not in inverted:
inverted[value] = []
inverted[value].append(key)
return inverted
Note that a simple dictionary comprehension will not work here because multiple keys can map to the same value. You need to build lists.
Exercise 5.18 — Grade distribution analysis
Jordan has grade data for a class:
grades = [
{"student": "S001", "dept": "CS", "grade": "A"},
{"student": "S002", "dept": "CS", "grade": "B+"},
{"student": "S003", "dept": "Stats", "grade": "A-"},
{"student": "S004", "dept": "CS", "grade": "B"},
{"student": "S005", "dept": "Stats", "grade": "A"},
{"student": "S006", "dept": "CS", "grade": "C+"},
{"student": "S007", "dept": "Stats", "grade": "B+"},
{"student": "S008", "dept": "CS", "grade": "A-"},
]
- Create a grade-point mapping dictionary:
{"A": 4.0, "A-": 3.7, "B+": 3.3, "B": 3.0, "B-": 2.7, "C+": 2.3, "C": 2.0} - Calculate the average GPA for CS students vs. Statistics students
- Which department has the higher average? (This is the kind of question Jordan investigates in the anchor example.)
Guidance
gp_map = {"A": 4.0, "A-": 3.7, "B+": 3.3, "B": 3.0, "B-": 2.7, "C+": 2.3, "C": 2.0}
cs_grades = [gp_map[g["grade"]] for g in grades if g["dept"] == "CS"]
stats_grades = [gp_map[g["grade"]] for g in grades if g["dept"] == "Stats"]
cs_avg = sum(cs_grades) / len(cs_grades)
stats_avg = sum(stats_grades) / len(stats_grades)
print(f"CS average: {cs_avg:.2f}")
print(f"Stats average: {stats_avg:.2f}")
Exercise 5.19 — JSON data exploration
Write a Python script that:
1. Creates a JSON file called team_roster.json containing a dictionary with keys "team_name", "sport", "season", and "players" (a list of dictionaries, each with "name", "number", and "position")
2. Reads the file back
3. Prints the team name and the number of players
4. Prints the name and position of each player
Use at least 5 players with realistic data.
Guidance
Follow the JSON writing pattern from Section 5.7 (`json.dump` with `indent=2`) and the reading pattern from Section 5.6 (`json.load`). The nested access follows the same principles as Exercise 5.13.Exercise 5.20 — Marcus's sales analysis
Marcus has daily sales records for his store:
sales = [
{"date": "2024-03-01", "product": "Widget A", "quantity": 15, "unit_price": 9.99},
{"date": "2024-03-01", "product": "Widget B", "quantity": 8, "unit_price": 14.99},
{"date": "2024-03-02", "product": "Widget A", "quantity": 22, "unit_price": 9.99},
{"date": "2024-03-02", "product": "Widget C", "quantity": 5, "unit_price": 24.99},
{"date": "2024-03-03", "product": "Widget B", "quantity": 12, "unit_price": 14.99},
{"date": "2024-03-03", "product": "Widget A", "quantity": 18, "unit_price": 9.99},
{"date": "2024-03-03", "product": "Widget C", "quantity": 3, "unit_price": 24.99},
]
- Calculate total revenue (quantity * unit_price) for each record (add a
"revenue"key) - Calculate total revenue per product using a dictionary
- Find which product generated the most total revenue
- Calculate total revenue per day
- Write the enriched data (with the revenue column) to a CSV file
Guidance
# 1
for record in sales:
record["revenue"] = record["quantity"] * record["unit_price"]
# 2
product_revenue = {}
for record in sales:
product = record["product"]
product_revenue[product] = product_revenue.get(product, 0) + record["revenue"]
print(product_revenue)
# 3
best_product = max(product_revenue, key=product_revenue.get)
print(f"Top product: {best_product} (${product_revenue[best_product]:.2f})")
# 4 and 5 follow similar patterns
Part D: Synthesis & Critical Thinking ⭐⭐⭐
These problems require you to connect ideas, compare approaches, or design solutions.
Exercise 5.21 — Data structure trade-offs
You are building a simple contact book that stores people's names and phone numbers. Compare three implementations:
- A list of tuples:
[("Alice", "555-0101"), ("Bob", "555-0102"), ...] - A dictionary:
{"Alice": "555-0101", "Bob": "555-0102", ...} - A list of dictionaries:
[{"name": "Alice", "phone": "555-0101"}, ...]
For each operation below, state which implementation makes it easiest and explain why: - Look up Alice's phone number - Add a new contact - Check if "Carol" is in the contact book - Add a second phone number for Alice (home and work) - Sort contacts by name
Guidance
- **Lookup:** Dictionary is fastest — O(1) by key. List of tuples requires searching. - **Add:** All three are easy (append or assign). - **Check membership:** Dictionary (`"Carol" in contacts`) is fastest. - **Second number:** List of dictionaries is most flexible — each dict can have a `"phones"` list. The simple dictionary would need to change its value type from string to list. - **Sort:** List of tuples and list of dicts are easy to sort. Dictionaries maintain insertion order in Python 3.7+ but are not designed for ordered retrieval by position. The deeper point: there is no universally "best" structure. The right choice depends on which operations you do most.Exercise 5.22 — Designing a mini-database
Design a data structure to represent a university's course catalog. It should support these queries efficiently: - "What courses does the CS department offer?" - "Who teaches Stats 201?" - "What are the prerequisites for Data Science 110?" - "How many total courses are offered?"
Sketch your data structure in Python (you do not need to populate it with a lot of data — 3-4 courses are enough). Then show Python code that answers each of the four queries.
Guidance
One effective design uses a dictionary of dictionaries keyed by department, with each course as a nested dictionary:catalog = {
"CS": {
"CS 101": {"title": "Intro to CS", "instructor": "Dr. Park", "prereqs": []},
"CS 201": {"title": "Data Structures", "instructor": "Dr. Lee", "prereqs": ["CS 101"]},
},
"Stats": {
"Stats 201": {"title": "Statistical Methods", "instructor": "Dr. Ramirez", "prereqs": ["Stats 101"]},
},
"Data Science": {
"DS 110": {"title": "Intro to Data Science", "instructor": "Dr. Park", "prereqs": ["CS 101", "Stats 101"]},
}
}
# Query 1
print(list(catalog["CS"].keys()))
# Query 2
print(catalog["Stats"]["Stats 201"]["instructor"])
# Query 3
print(catalog["Data Science"]["DS 110"]["prereqs"])
# Query 4
total = sum(len(courses) for courses in catalog.values())
Other designs are valid. The key is explaining *why* your design supports the required queries.
Exercise 5.23 — From spreadsheet to code
Here is a small spreadsheet of data:
| City | State | Population | Area_sq_mi |
|---|---|---|---|
| New York | NY | 8336817 | 302.6 |
| Los Angeles | CA | 3979576 | 468.7 |
| Chicago | IL | 2693976 | 227.6 |
| Houston | TX | 2304580 | 671.7 |
- Represent this as a list of dictionaries
- Write a function that calculates the population density (population / area) for each city and adds it as a new key
- Which city has the highest population density? Write code to find out.
- Write a dictionary comprehension mapping city names to population density
- Write the enriched data to a CSV file
Guidance
This exercise integrates nearly every concept from the chapter: list of dicts, adding keys, loops or comprehensions, dict comprehensions, and csv.DictWriter. Work through it step by step. The answer to "highest density" should be New York (about 27,541 people per square mile).Exercise 5.24 — Comparing file formats
Given the same data (a list of 3 countries with name, region, and vaccination rate), write it to both a CSV file and a JSON file. Then:
- Compare the file sizes (use
os.path.getsize()) - Read each file back and verify you get the same data
- In 3-4 sentences, discuss: when would you choose CSV over JSON, and vice versa?
Guidance
CSV is more compact for simple tabular data and widely supported by spreadsheet software. JSON handles nested/hierarchical data naturally and is the standard for web APIs. CSV is better for flat tables; JSON is better for complex structures. Both are human-readable text formats.Exercise 5.25 — The data pipeline
Write a complete data processing pipeline that:
1. Reads a CSV file containing student names and test scores
2. Calculates each student's average score
3. Classifies each student as "pass" (average >= 60) or "fail"
4. Writes a new CSV file with the original data plus two new columns: average and status
First, create the input CSV file with at least 6 students and 3 test score columns each. Then process it.
Guidance
This is a synthesis exercise combining file reading, data processing with dictionaries, and file writing. The key steps are: 1. Use `csv.DictReader` to read 2. Convert score strings to floats 3. Compute the average 4. Add `"average"` and `"status"` keys to each record 5. Use `csv.DictWriter` with updated fieldnames to write This read-process-write pattern is the foundation of all data pipeline work you will do later.Part M: Mixed Practice (Chapters 1-4 Review) ⭐⭐
These problems blend current and previous material to build cumulative fluency.
Exercise 5.26 — Lifecycle with dictionaries (Chapter 1 + Chapter 5)
Map Marcus's small-business analysis onto the data science lifecycle (from Chapter 1). For each stage, describe both (a) what Marcus would do and (b) which Python data structure from this chapter he would likely use. For example: "In the data collection stage, Marcus would export his POS data. He would store each transaction as a dictionary with keys for date, product, quantity, and price."
Guidance
Walk through all six stages: question formulation (no data structure needed), data collection (list of dictionaries from CSV), data cleaning (sets for finding unique values, dictionaries for mapping corrections), exploration (loops over list of dicts, comprehensions for filtering), modeling (dictionaries for storing results), communication (writing results to files). The specific structures you choose should be justified.Exercise 5.27 — Functions with data structures (Chapter 4 + Chapter 5)
Write the following functions:
lookup_region(country, mapping)— takes a country name and a country-to-region dictionary, returns the region or"Unknown"if the country is not foundfilter_by_region(records, region)— takes a list of dictionaries and a region name, returns a list of records matching that regioncompute_average(records, key)— takes a list of dictionaries and a key name (e.g.,"vaccination_rate"), returns the average value for that key
Test all three functions with sample data.
Guidance
def lookup_region(country, mapping):
return mapping.get(country, "Unknown")
def filter_by_region(records, region):
return [r for r in records if r["region"] == region]
def compute_average(records, key):
values = [r[key] for r in records]
return sum(values) / len(values) if values else 0
Exercise 5.28 — Type conversion meets data structures (Chapter 3 + Chapter 5)
The following data was read from a CSV file, so all values are strings:
raw_data = [
{"country": "Brazil", "population": "214000000", "rate": "72.3", "vaccinated": "True"},
{"country": "Chad", "population": "17400000", "rate": "41.7", "vaccinated": "False"},
]
Write a function clean_record(record) that returns a new dictionary with:
- population converted to int
- rate converted to float
- vaccinated converted to bool (careful: bool("False") is True!)
- country left as a string
Test it and verify the types are correct.
Guidance
The `bool` conversion is the tricky part. `bool("False")` returns `True` because any non-empty string is truthy. Use a comparison instead:def clean_record(record):
return {
"country": record["country"],
"population": int(record["population"]),
"rate": float(record["rate"]),
"vaccinated": record["vaccinated"] == "True"
}
Exercise 5.29 — Conditionals inside comprehensions (Chapter 4 + Chapter 5)
Write a single list comprehension that takes a list of vaccination rates and produces a list of category strings:
rates = [72.3, 85.1, 41.7, 93.2, 68.5, 55.0, 12.8]
# Expected: ["medium", "high", "low", "high", "medium", "medium", "low"]
Where: high >= 80, medium >= 50, low < 50.
Hint: You can use a conditional expression (ternary) inside a comprehension: "high" if x >= 80 else "medium" if x >= 50 else "low".
Guidance
categories = ["high" if r >= 80 else "medium" if r >= 50 else "low" for r in rates]
This works but is at the edge of readability. For more complex logic, a helper function called inside the comprehension is cleaner:
def categorize(rate):
if rate >= 80: return "high"
elif rate >= 50: return "medium"
else: return "low"
categories = [categorize(r) for r in rates]
Exercise 5.30 — Jupyter notebook narrative (Chapter 2 + Chapter 5)
Create a Jupyter notebook called ch5_exploration.ipynb that tells a story. Use Markdown cells to explain what you are doing and why. The notebook should:
- Start with a title and a one-paragraph introduction
- Create a list of dictionaries representing at least 8 countries with
name,region,vaccination_rate, andpopulation - Use a comprehension to extract country names
- Use a loop to find the country with the highest vaccination rate
- Use a dictionary to count countries per region
- Write the data to a CSV file and read it back
- End with a "Findings" section summarizing what you observed
This exercise practices the notebook as narrative, which you will do extensively starting in Chapter 6.
Guidance
The code components use techniques from this chapter. The key addition is the Markdown narrative: explain *why* you are performing each step, not just *what* you are doing. A good notebook reads like a report, not a code dump. Revisit Chapter 2's discussion of Markdown cells if needed.Part E: Research & Extension ⭐⭐⭐⭐
These are open-ended projects that go beyond the chapter. Spend 30-60 minutes on one.
Exercise 5.31 — Beyond built-in: collections module
Research Python's collections module, specifically Counter, defaultdict, and OrderedDict. For each:
1. Describe what it does in one sentence
2. Write a short code example using data-science-relevant data
3. Explain when you would use it instead of a plain dict
Guidance
- `Counter` is a dictionary subclass for counting hashable objects. Example: `Counter(["A", "B", "A", "C", "A"])` gives `Counter({"A": 3, "B": 1, "C": 1})`. Use instead of the `.get(key, 0) + 1` pattern. - `defaultdict` is a dictionary that provides default values for missing keys. Example: `defaultdict(list)` lets you append without checking if a key exists. Use for the "invert mapping" pattern in Exercise 5.17. - `OrderedDict` was historically needed for order-preserving dictionaries (before Python 3.7). It is less necessary now but still useful for its `move_to_end()` method. See the official Python docs: https://docs.python.org/3/library/collections.htmlExercise 5.32 — Real data challenge
Find a publicly available CSV file from a real source (e.g., data.gov, the World Bank, Kaggle, or your university's open data portal). Choose a small one — under 100 rows. Then:
- Download it
- Read it into a list of dictionaries using
csv.DictReader - Answer at least two specific questions about the data using the techniques from this chapter
- Write a cleaned or enriched version to a new CSV file
Document your work in a Jupyter notebook with Markdown explanations.
Guidance
Good sources for small, beginner-friendly datasets: - World Bank Open Data (data.worldbank.org) — search for a specific indicator - WHO Global Health Observatory (gho.who.int) — health statistics by country - data.gov — US government open data - Kaggle (kaggle.com/datasets) — filter for small, CSV-format datasets The goal is to practice the read-process-write pipeline with real, messy data. Expect to encounter issues like missing values, unexpected column names, or inconsistent formatting. That is the point.End of Chapter 5 Exercises. If Parts A and B felt comfortable and Parts C and D stretched you, you are in exactly the right place. The fluency you are building with data structures will pay dividends starting in the very next chapter, when you load your first real dataset.