> "A program that handles data without the right structure is like a warehouse with no shelves — everything technically fits, but nothing is where you can find it."
In This Chapter
- Why Data Structures Matter: Representing Business Reality in Code
- Lists: Ordered, Mutable Collections
- Tuples: Ordered, Immutable Records
- Dictionaries: Key-Value Stores
- Sets: Unique Value Collections
- Nested Structures: Combining Data Structures
- Choosing the Right Data Structure: A Decision Guide
- Copying vs. Referencing: A Common Source of Bugs
- A Complete Example: The Customer Database
- Summary
Chapter 7: Data Structures — Lists, Tuples, Dictionaries, and Sets
"A program that handles data without the right structure is like a warehouse with no shelves — everything technically fits, but nothing is where you can find it."
Why Data Structures Matter: Representing Business Reality in Code
Every business runs on data. A sales manager tracks a list of opportunities. A finance analyst stores quarterly results as fixed records. A customer success team maintains profiles for hundreds of clients. An operations director wants to know which market segments overlap between two product lines.
These aren't abstract computer science problems. They're real business needs — and each one maps naturally to a specific Python data structure.
In Chapter 6, you learned to control the flow of a program with loops and conditionals. But flow control only gets you so far. To work with meaningful business data, you need containers — structures that hold collections of information and give you tools to organize, search, and transform them efficiently.
Python provides four built-in data structures that together can represent virtually any business dataset:
| Structure | Ordered? | Mutable? | Unique Values? | Best For |
|---|---|---|---|---|
| List | Yes | Yes | No | Sequences that change — catalogs, queues, rankings |
| Tuple | Yes | No | No | Fixed records — coordinates, configurations, row data |
| Dictionary | Yes* | Yes | Keys must be unique | Named fields — customer records, configs, lookups |
| Set | No | Yes | Yes | Uniqueness checks — market segments, deduplication |
*Dictionaries preserve insertion order in Python 3.7+.
This chapter takes you from the basics of each structure through real-world patterns you will reach for over and over again in business code.
Meet our guides for this chapter. At Acme Corp, Priya Okonkwo, a business analyst, is building a sales data pipeline. She needs to load, organize, and summarize data before it hits a proper database. Sandra Chen, VP of Sales, keeps asking for different cuts of the data — by region, by product, by week. Marcus Webb in IT is watching to make sure Priya's code doesn't accidentally corrupt shared data structures. And Maya Reyes, a freelance business consultant, is building her own client portfolio tracker with nothing but Python and her laptop.
Let's start with the most fundamental structure: the list.
Lists: Ordered, Mutable Collections
What a List Is
A list in Python is an ordered, mutable sequence of items. "Ordered" means items have a definite position you can refer to by index number. "Mutable" means you can add, remove, or change items after the list is created.
You create a list with square brackets:
# Acme Corp's current product names
product_names = ["Wireless Headset Pro", "Standing Desk Converter", "Ergonomic Mouse", "USB-C Hub 7-Port"]
# Q1 weekly sales totals (USD)
weekly_sales = [58_320.75, 62_410.00, 71_850.25, 55_200.00]
# Active sales territories
territories = ["Northeast", "Southeast", "Midwest", "West"]
# A list can hold mixed types (though this is uncommon in practice)
mixed = ["Thornfield Logistics", 92_100.00, True, 2024]
Lists are zero-indexed: the first item is at position 0, the second at position 1, and so on. This trips up newcomers who think of "first" as 1, but once it clicks, it's intuitive.
Indexing and Slicing
Access individual items with square brackets and an index number:
product_names = ["Wireless Headset Pro", "Standing Desk Converter", "Ergonomic Mouse", "USB-C Hub 7-Port"]
first_product = product_names[0] # "Wireless Headset Pro"
third_product = product_names[2] # "Ergonomic Mouse"
last_product = product_names[-1] # "USB-C Hub 7-Port" — negative indices count from the end
second_to_last = product_names[-2] # "Ergonomic Mouse"
Negative indexing is a Python convenience that counts backward from the end. product_names[-1] always gives you the last item, regardless of how long the list is — something you use constantly in business code when you want "the most recent entry."
Slicing gives you a sub-list. The syntax is list[start:stop], where start is inclusive and stop is exclusive:
weekly_sales = [58_320.75, 62_410.00, 71_850.25, 55_200.00, 68_900.50, 74_200.00]
first_two_weeks = weekly_sales[0:2] # [58320.75, 62410.00]
middle_weeks = weekly_sales[1:4] # [62410.00, 71850.25, 55200.00]
last_three = weekly_sales[-3:] # [55200.00, 68900.50, 74200.00]
all_except_first = weekly_sales[1:] # everything from index 1 to the end
all_except_last = weekly_sales[:-1] # everything up to (not including) the last
full_copy = weekly_sales[:] # a shallow copy of the entire list
Slicing is how Priya splits her 20-day sales dataset into four weekly buckets without any loops:
daily_sales = [12450, 9820, 15300, 8750, 11600,
14200, 7300, 16850, 13100, 10400,
9050, 18700, 11200, 6800, 14500,
13900, 12050, 17200, 9600, 15800]
week_1 = daily_sales[0:5] # Days 1-5
week_2 = daily_sales[5:10] # Days 6-10
week_3 = daily_sales[10:15] # Days 11-15
week_4 = daily_sales[15:20] # Days 16-20
print(f"Week 1 total: ${sum(week_1):,.2f}")
print(f"Week 2 total: ${sum(week_2):,.2f}")
Core List Methods
Python lists come with a rich set of built-in methods. Here are the ones you will use most often in business applications:
Adding Items
catalog = ["Wireless Headset Pro", "Ergonomic Mouse"]
# .append() adds one item to the end — O(1), very fast
catalog.append("4K Webcam Ultra")
# catalog is now: ["Wireless Headset Pro", "Ergonomic Mouse", "4K Webcam Ultra"]
# .insert(index, item) adds an item at a specific position — use sparingly
catalog.insert(0, "Collaboration Bundle") # insert at the beginning
catalog.insert(2, "USB-C Hub 7-Port") # insert at index 2
# .extend() adds all items from another iterable to the end
new_products = ["Standing Desk Converter", "Monitor Riser"]
catalog.extend(new_products)
# Equivalent to: for item in new_products: catalog.append(item)
The difference between append and extend confuses many beginners. Remember: append adds one item (even if that item is a list), while extend unpacks an iterable and adds each element:
a = [1, 2, 3]
a.append([4, 5]) # a is now [1, 2, 3, [4, 5]] — the list is nested inside!
b = [1, 2, 3]
b.extend([4, 5]) # b is now [1, 2, 3, 4, 5] — items are merged in
In business code, you almost always want extend when combining two lists and append when adding a single new record.
Removing Items
products = ["Wireless Headset Pro", "Ergonomic Mouse", "USB-C Hub 7-Port", "Monitor Riser"]
# .remove(value) removes the first occurrence of a value — raises ValueError if not found
products.remove("Monitor Riser")
# Safe removal: check first
item_to_remove = "Discontinued Product"
if item_to_remove in products:
products.remove(item_to_remove)
# .pop(index) removes and returns the item at the given index
# .pop() with no argument removes the last item
last_item = products.pop() # removes and returns "USB-C Hub 7-Port"
first_item = products.pop(0) # removes and returns "Wireless Headset Pro"
# del removes by index or slice without returning the value
del products[0] # remove item at index 0
del products[1:3] # remove a slice
Sorting and Ordering
prices = [149.99, 39.99, 59.95, 249.00, 129.00]
# sorted() returns a NEW list — the original is unchanged
prices_ascending = sorted(prices) # [39.99, 59.95, 129.0, 149.99, 249.0]
prices_descending = sorted(prices, reverse=True)
# .sort() sorts IN PLACE — the original list is modified
prices.sort() # now prices IS sorted ascending
prices.sort(reverse=True) # now prices is sorted descending
# Sort a list of strings alphabetically
products = ["Webcam", "Mouse", "Hub", "Headset"]
products.sort() # alphabetical: ["Headset", "Hub", "Mouse", "Webcam"]
# Custom sort key: sort products by price using a function
catalog = [("Wireless Headset Pro", 149.99), ("Ergonomic Mouse", 59.95), ("USB-C Hub", 39.99)]
catalog_by_price = sorted(catalog, key=lambda item: item[1])
The key parameter of sorted() and .sort() is one of the most powerful tools in Python. It accepts any function that extracts a comparison value from each element. Lambda functions (covered in Chapter 8) are the usual way to provide it inline, but any function works:
def get_price(product_tuple):
return product_tuple[1]
catalog_by_price = sorted(catalog, key=get_price)
Finding and Counting
reps = ["Alice", "Bob", "Carol", "Alice", "Dave", "Alice"]
# .index(value) returns the position of the FIRST occurrence
first_alice_pos = reps.index("Alice") # 0
# .count(value) counts how many times a value appears
alice_count = reps.count("Alice") # 3
# in operator: True/False membership check
"Bob" in reps # True
"Eve" in reps # False
"Eve" not in reps # True
# len() — number of items
total_reps = len(reps) # 6
Other Useful Methods
# .reverse() reverses in place
items = [1, 2, 3, 4, 5]
items.reverse() # items is now [5, 4, 3, 2, 1]
# .copy() returns a shallow copy
original = [1, 2, 3]
copy_a = original.copy() # same as original[:]
copy_a.append(4) # doesn't affect original
# .clear() removes all elements
temp_queue = ["task1", "task2", "task3"]
temp_queue.clear() # temp_queue is now []
Iterating Over Lists
You already know for loops from Chapter 6. Lists are iterables, so they work naturally:
territories = ["Northeast", "Southeast", "Midwest", "West"]
# Iterate over values
for territory in territories:
print(f"Preparing report for {territory}...")
# Iterate with index using enumerate()
for index, territory in enumerate(territories):
print(f" {index + 1}. {territory}")
# Iterate over two lists simultaneously using zip()
sales_totals = [94_500, 87_200, 112_400, 76_800]
for territory, total in zip(territories, sales_totals):
print(f" {territory:<12}: ${total:>10,.2f}")
enumerate() and zip() are workhorses in data processing. You will reach for them constantly.
List Comprehensions
A list comprehension is a compact way to build a new list by transforming or filtering an existing iterable. The syntax reads like an English sentence:
# [expression for item in iterable if condition]
# All amounts, doubled
amounts = [1000, 2500, 4200, 800, 3100]
doubled = [amount * 2 for amount in amounts]
# Filter: only amounts over $2,000
large_amounts = [amount for amount in amounts if amount > 2000]
# Transform: apply a 10% discount to amounts over $3,000
discounted = [
round(amount * 0.90, 2) if amount > 3000 else amount
for amount in amounts
]
Priya uses list comprehensions to transform raw order data into clean summary lists without writing multi-line loops:
orders = [
("ORD-1001", "Thornfield Logistics", 4200.00, "completed"),
("ORD-1002", "Beacon Analytics", 1850.50, "completed"),
("ORD-1003", "Summit Partners", 9100.00, "pending"),
("ORD-1004", "Coastal Media", 760.00, "cancelled"),
]
# Extract only completed order amounts
completed_amounts = [
amount
for _, _, amount, status in orders
if status == "completed"
]
# Build a list of display strings for a report
order_lines = [
f"{order_id}: ${amount:,.2f} ({status})"
for order_id, _, amount, status in orders
]
List comprehensions are more Pythonic (idiomatic) than equivalent for-loops when the logic is simple. Use a regular loop when the logic is complex enough that the comprehension becomes hard to read.
When to Use Lists
Use a list when: - You have an ordered collection of items that may change (grow, shrink, be sorted) - You need to access items by position - You need to allow duplicates (same item appearing multiple times) - You're building up results incrementally
Business examples: a product catalog, a sales leaderboard, a queue of tasks to process, daily transaction records.
Tuples: Ordered, Immutable Records
What a Tuple Is
A tuple is like a list, but with one crucial difference: it cannot be changed after creation. You cannot append to it, remove from it, or change any of its values. This immutability is a feature, not a limitation.
You create a tuple with parentheses (or even just commas):
# A quarterly revenue record — 4 numbers that represent a complete, fixed set
q1_results = (2_450_000, 2_680_000, 2_310_000, 2_890_000)
# A geographic coordinate (latitude, longitude) — should never change
hq_location = (40.7128, -74.0060) # New York City
# A database connection configuration
db_config = ("db.acmecorp.internal", 5432, "sales_db")
# Single-element tuple requires a trailing comma
single = (42,) # without the comma, Python reads (42) as just the integer 42
Tuples support indexing and slicing the same way lists do. They also support in, len(), count(), and index(). What they do NOT support is any method that modifies the collection.
Tuple Unpacking
Unpacking is one of the most elegant features in Python. It lets you assign the elements of a tuple to individual variable names in one line:
q1_results = (2_450_000, 2_680_000, 2_310_000, 2_890_000)
q1_jan, q1_feb, q1_mar, q1_apr = q1_results
print(f"January: ${q1_jan:>12,.2f}")
print(f"February: ${q1_feb:>12,.2f}")
# Swap two variables — classic use of tuple unpacking
a, b = 10, 20
a, b = b, a # a is now 20, b is now 10 — no temp variable needed
# Unpack only the parts you want with _ for unused positions
host, port, database = db_config
print(f"Connecting to {host}:{port}/{database}")
# Extended unpacking with * — capture multiple items
first, *rest = (1, 2, 3, 4, 5)
# first = 1, rest = [2, 3, 4, 5]
*beginning, last = (1, 2, 3, 4, 5)
# beginning = [1, 2, 3, 4], last = 5
Tuple unpacking shows up everywhere in Python. When for loops iterate over a list of tuples, the loop variable can unpack in one step:
sales_by_region = [
("Northeast", 94_500.00),
("Southeast", 87_200.00),
("Midwest", 112_400.00),
("West", 76_800.00),
]
for region, total in sales_by_region:
print(f" {region:<12}: ${total:>10,.2f}")
Named Tuples
Regular tuples are accessed by position (record[0], record[1]), which makes code brittle — if the order ever changes, every access breaks. Named tuples solve this by letting you access fields by name.
The collections.namedtuple factory creates a new class that behaves like a tuple but with named fields:
from collections import namedtuple
# Define the structure once
SalesRecord = namedtuple("SalesRecord", ["region", "product", "amount", "quarter"])
# Create instances
record_1 = SalesRecord(region="Northeast", product="Wireless Headset Pro", amount=4_200.00, quarter="Q1")
record_2 = SalesRecord("Southeast", "Ergonomic Mouse", 1_850.50, "Q1")
# Access by name (much clearer than record_1[2])
print(f"Region: {record_1.region}")
print(f"Product: {record_1.product}")
print(f"Amount: ${record_1.amount:,.2f}")
# Still works like a tuple
print(record_1[0]) # "Northeast"
region, product, amount, qtr = record_1 # unpacking still works
# Named tuples are immutable
# record_1.amount = 5000.00 # This would raise AttributeError
Named tuples are excellent for rows of data — think of a row coming back from a database query or a row in a CSV file. They're lighter weight than a full class, but more descriptive than a plain tuple.
When to Use Tuples vs Lists
Use a tuple when: - The data represents a fixed record where the position of each element has meaning (coordinates, a row of data, a configuration setting) - You want to signal to other developers (and to yourself) that this data should not change - You're using the data as a dictionary key (lists cannot be dict keys; tuples can) - Performance matters in extreme cases (tuples are slightly faster than lists)
Use a list when the collection will grow, shrink, or be reordered.
A useful mental model: tuples are records; lists are queues. A quarterly earnings report is a tuple — it represents a complete, fixed snapshot of reality. A product catalog is a list — products get added and removed.
Dictionaries: Key-Value Stores
What a Dictionary Is
A dictionary stores data as key-value pairs. Instead of accessing data by numeric position (like a list), you access it by a meaningful key — usually a string. This makes dictionaries the natural choice for structured records where each field has a name.
# A customer record as a dictionary
customer = {
"customer_id": "CUST-0042",
"company_name": "Thornfield Logistics",
"contact_name": "Rachel Torres",
"tier": "gold",
"annual_spend_usd": 87_450.00,
"active": True,
}
The string before the colon is the key. The value after the colon is the value. Keys must be unique within a dictionary and must be an immutable type (strings and numbers are most common; tuples work too). Values can be anything — strings, numbers, booleans, lists, even other dictionaries.
Creating Dictionaries
# Literal syntax — most common
config = {"host": "localhost", "port": 5432, "debug": False}
# dict() constructor with keyword arguments
config = dict(host="localhost", port=5432, debug=False)
# Empty dict — then add keys
record = {}
record["name"] = "Thornfield Logistics"
record["tier"] = "gold"
# From two parallel lists using zip()
keys = ["name", "tier", "region"]
values = ["Thornfield Logistics", "gold", "Northeast"]
record = dict(zip(keys, values))
Accessing Values
customer = {
"company_name": "Thornfield Logistics",
"tier": "gold",
"annual_spend_usd": 87_450.00,
}
# Direct access — raises KeyError if key doesn't exist
name = customer["company_name"] # "Thornfield Logistics"
tier = customer["tier"] # "gold"
# This would crash if the key doesn't exist:
# phone = customer["phone"] # KeyError: 'phone'
The .get() method is almost always better than direct bracket access when there's any chance a key might be absent:
# .get(key) returns None if key doesn't exist — no crash
phone = customer.get("phone") # None
discount = customer.get("discount_rate") # None
# .get(key, default) returns the default value if key is absent
phone = customer.get("phone", "Not on file") # "Not on file"
discount = customer.get("discount_rate", 0.0) # 0.0
print(f"Phone: {phone}") # Phone: Not on file
Marcus always uses .get() in production code. "Direct bracket access is like opening a drawer that might not exist," he tells Priya. ".get() politely returns None if the drawer isn't there."
Updating, Adding, and Removing
customer = {"company_name": "Thornfield Logistics", "tier": "silver", "annual_spend_usd": 42_000.00}
# Update an existing key
customer["tier"] = "gold"
customer["annual_spend_usd"] = 87_450.00
# Add a new key
customer["account_manager"] = "Sandra Chen"
# .update() modifies multiple keys at once (or adds new keys)
customer.update({"tier": "platinum", "annual_spend_usd": 150_000.00, "vip": True})
# Remove a key with del — raises KeyError if absent
del customer["vip"]
# .pop(key) removes and returns the value
removed_spend = customer.pop("annual_spend_usd") # returns 150000.0
# .pop(key, default) is safe — returns default if key doesn't exist
removed_phone = customer.pop("phone", None) # returns None, no crash
Iterating Over Dictionaries
customer = {
"company_name": "Thornfield Logistics",
"contact_name": "Rachel Torres",
"tier": "gold",
"annual_spend_usd": 87_450.00,
"region": "Northeast",
}
# Iterate over keys (default behavior)
for key in customer:
print(key)
# Iterate over keys explicitly
for key in customer.keys():
print(key)
# Iterate over values
for value in customer.values():
print(value)
# Iterate over key-value pairs — the most common pattern
for key, value in customer.items():
print(f" {key}: {value}")
The .items() method is your workhorse for dictionary iteration. Any time you want to display, transform, or process all fields in a record, .items() gives you both pieces of information in each iteration.
Nested Dictionaries
Real-world data is often hierarchical. A customer might have a nested address, or a product might have a nested specification block:
customer = {
"company_name": "Thornfield Logistics",
"tier": "gold",
"address": {
"street": "1400 Harbor View Drive",
"city": "Boston",
"state": "MA",
"zip": "02110",
},
"contacts": {
"primary": {"name": "Rachel Torres", "email": "r.torres@thornfield.com"},
"billing": {"name": "Paul Nguyen", "email": "billing@thornfield.com"},
},
}
# Access nested values with chained bracket notation
city = customer["address"]["city"] # "Boston"
primary_email = customer["contacts"]["primary"]["email"] # "r.torres@thornfield.com"
# Safe access with nested .get()
state = customer.get("address", {}).get("state", "Unknown") # "MA"
fax = customer.get("contacts", {}).get("fax", {}).get("number", "N/A") # "N/A"
The pattern dict.get("key", {}).get("nested_key", default) is a common and safe way to traverse nested structures without crashing on missing intermediate keys.
The setdefault() Method
setdefault() is a convenient method for "set this key if it doesn't exist, then return its value." It's particularly useful for building dicts of lists:
# Group orders by region without setdefault (verbose way)
region_orders = {}
for order_id, region, amount in transactions:
if region not in region_orders:
region_orders[region] = []
region_orders[region].append(order_id)
# Same thing with setdefault (clean way)
region_orders = {}
for order_id, region, amount in transactions:
region_orders.setdefault(region, []).append(order_id)
Dict Comprehensions
Just as list comprehensions build lists, dict comprehensions build dictionaries:
products = [
("SKU-001", "Wireless Headset Pro", 149.99),
("SKU-002", "Standing Desk Converter", 249.00),
("SKU-003", "Ergonomic Mouse", 59.95),
]
# Build a price lookup: SKU -> price
price_lookup = {sku: price for sku, _, price in products}
# {"SKU-001": 149.99, "SKU-002": 249.00, "SKU-003": 59.95}
# Build a name lookup with filtering
premium_products = {
sku: name
for sku, name, price in products
if price >= 100.00
}
# {"SKU-001": "Wireless Headset Pro", "SKU-002": "Standing Desk Converter"}
# Apply a transformation: add 15% markup to all prices
markup_prices = {
sku: round(price * 1.15, 2)
for sku, _, price in products
}
Merging Dictionaries
Python 3.9+ introduced a clean merge operator:
defaults = {"timeout": 30, "retries": 3, "verbose": False}
overrides = {"timeout": 60, "log_level": "INFO"}
# Merge — overrides wins on conflicts
config = defaults | overrides
# {"timeout": 60, "retries": 3, "verbose": False, "log_level": "INFO"}
# In-place merge
defaults |= overrides
In earlier Python (3.5–3.8), use:
config = {**defaults, **overrides} # double-star unpacking
When to Use Dictionaries
Use a dictionary when: - Your data has named fields (customer name, order total, region) - You need to look up a value by a meaningful key (not just a position number) - You're building an index or lookup table from a list - You're aggregating data (counting or summing by category)
Business examples: customer records, product specifications, configuration files, aggregated sales totals by region or category.
Sets: Unique Value Collections
What a Set Is
A set is an unordered collection of unique values. Duplicates are automatically removed. There is no index, no guaranteed order. What a set gives you that other structures don't is fast membership testing and powerful set-theory operations.
# Market segments served by Acme Corp's Product Line A
segments_A = {"retail", "wholesale", "e-commerce", "enterprise"}
# Market segments served by Product Line B
segments_B = {"enterprise", "government", "healthcare", "e-commerce"}
# Sets eliminate duplicates automatically
sales_regions = {"Northeast", "Southeast", "Midwest", "West", "Northeast", "Midwest"}
print(sales_regions) # {"Northeast", "Southeast", "Midwest", "West"} — only 4 items
# Create a set from a list (deduplication pattern)
rep_activity_log = ["Alice", "Bob", "Alice", "Carol", "Bob", "Alice"]
unique_reps = set(rep_activity_log)
print(unique_reps) # {"Alice", "Bob", "Carol"} — order may vary
Set Operations
Sets support mathematical set operations that are surprisingly useful in business contexts:
segments_A = {"retail", "wholesale", "e-commerce", "enterprise"}
segments_B = {"enterprise", "government", "healthcare", "e-commerce"}
# Union: all segments covered by either product line
all_segments = segments_A | segments_B
# or: segments_A.union(segments_B)
print(all_segments)
# {"retail", "wholesale", "e-commerce", "enterprise", "government", "healthcare"}
# Intersection: segments served by BOTH product lines
shared_segments = segments_A & segments_B
# or: segments_A.intersection(segments_B)
print(shared_segments)
# {"enterprise", "e-commerce"}
# Difference: segments in A but NOT in B
exclusive_A = segments_A - segments_B
# or: segments_A.difference(segments_B)
print(exclusive_A)
# {"retail", "wholesale"}
# Symmetric difference: segments in one or the other, but NOT both
unique_to_each = segments_A ^ segments_B
# or: segments_A.symmetric_difference(segments_B)
print(unique_to_each)
# {"retail", "wholesale", "government", "healthcare"}
# Subset check: is every element of A also in B?
{"enterprise"}.issubset(segments_A) # True
# Superset check: does A contain all elements of B?
segments_A.issuperset({"retail", "e-commerce"}) # True
# Disjoint: do A and B share no elements at all?
{"retail", "wholesale"}.isdisjoint({"government", "healthcare"}) # True
Modifying Sets
active_customers = {"Thornfield Logistics", "Beacon Analytics", "Summit Partners"}
# Add one item
active_customers.add("Ridgeline Corp")
# Add multiple items
active_customers.update(["Coastal Media", "Apex Ventures"])
# Remove an item — raises KeyError if not present
active_customers.remove("Coastal Media")
# Discard removes without raising an error if absent
active_customers.discard("NonExistent Corp") # silently does nothing
# Pop removes and returns an arbitrary item (since sets are unordered)
removed = active_customers.pop()
# Clear empties the set
active_customers.clear()
Practical Set Patterns
# --- Pattern 1: Find customers who lapsed (were active last month but not this month) ---
last_month_active = {"Thornfield", "Beacon", "Summit", "Ridgeline", "Coastal"}
this_month_active = {"Thornfield", "Beacon", "Summit", "Apex", "Ironwood"}
lapsed = last_month_active - this_month_active
# {"Ridgeline", "Coastal"} — need a check-in call
new_this_month = this_month_active - last_month_active
# {"Apex", "Ironwood"} — welcome new customers
retained = last_month_active & this_month_active
# {"Thornfield", "Beacon", "Summit"}
# --- Pattern 2: Check for duplicate order IDs in a batch ---
incoming_order_ids = ["ORD-1001", "ORD-1002", "ORD-1003", "ORD-1002", "ORD-1004"]
unique_ids = set(incoming_order_ids)
if len(incoming_order_ids) != len(unique_ids):
print("WARNING: Duplicate order IDs detected in batch!")
# --- Pattern 3: Validate that all required fields are present ---
required_fields = {"customer_id", "company_name", "email", "tier"}
submitted_fields = {"customer_id", "company_name", "phone", "tier"}
missing_fields = required_fields - submitted_fields
# {"email"} — the form is incomplete
if missing_fields:
print(f"Missing required fields: {missing_fields}")
When to Use Sets
Use a set when: - You only care about whether something exists, not how many times or in what order - You need to deduplicate a list quickly - You want to compare two groups (who's in both? who's only in one?) - You're validating completeness (does this record have all required fields?)
Business examples: unique customer IDs, market segments, active product SKUs, valid permission codes.
Nested Structures: Combining Data Structures
Real business data is rarely flat. You almost always end up with combinations of structures. The two most common patterns are:
Lists of Dictionaries (Table Representation)
This is the most important pattern in business Python. It directly mirrors what you'd see in a spreadsheet or a database table: each row is a dictionary, and the whole table is a list of those dictionaries.
# Each row is a dict; the table is a list
sales_records = [
{"order_id": "ORD-1001", "region": "Northeast", "product": "Wireless Headset Pro", "amount": 4200.00, "week": 1},
{"order_id": "ORD-1002", "region": "Southeast", "product": "Ergonomic Mouse", "amount": 1850.50, "week": 1},
{"order_id": "ORD-1003", "region": "West", "product": "USB-C Hub 7-Port", "amount": 9100.00, "week": 2},
{"order_id": "ORD-1004", "region": "Midwest", "product": "4K Webcam Ultra", "amount": 3400.75, "week": 2},
{"order_id": "ORD-1005", "region": "Northeast", "product": "Ergonomic Mouse", "amount": 760.00, "week": 3},
]
# Iterate and display
for record in sales_records:
print(f" {record['order_id']} {record['region']:<12} ${record['amount']:>9,.2f}")
# Filter by region
northeast_sales = [r for r in sales_records if r["region"] == "Northeast"]
# Aggregate: total by region
region_totals: dict[str, float] = {}
for record in sales_records:
region = record["region"]
region_totals[region] = region_totals.get(region, 0.0) + record["amount"]
# Sort by amount, descending
sorted_records = sorted(sales_records, key=lambda r: r["amount"], reverse=True)
This pattern — a list of dicts representing a table — is what you get back from CSV readers, REST APIs, and databases. Mastering it means you can work with almost any real-world data source.
Dictionaries of Lists
Use this when you want to group items by category:
# Customers grouped by tier
customers_by_tier: dict[str, list[str]] = {
"platinum": ["Summit Partners", "Ironwood Manufacturing"],
"gold": ["Thornfield Logistics", "Apex Ventures", "Blue River Consulting"],
"silver": ["Beacon Analytics", "Ridgeline Corp", "Clearwater Advisory"],
"bronze": ["Coastal Media", "Harborview Tech"],
}
# Access all gold customers
gold_list = customers_by_tier["gold"]
# Add to a category
customers_by_tier["gold"].append("New Gold Customer")
# Iterate over groups
for tier, customers in customers_by_tier.items():
print(f" {tier.upper()} ({len(customers)} customers): {', '.join(customers)}")
Dicts of Dicts (the Database Pattern)
For fast lookups by a unique ID, a dict of dicts is more efficient than a list of dicts:
# Outer key is the customer_id for O(1) lookup
customer_db: dict[str, dict] = {
"CUST-0001": {"company": "Thornfield Logistics", "tier": "gold", "spend": 92100},
"CUST-0002": {"company": "Beacon Analytics", "tier": "silver", "spend": 34200},
"CUST-0003": {"company": "Summit Partners", "tier": "platinum", "spend": 215000},
}
# Direct lookup — instant, regardless of database size
cust = customer_db.get("CUST-0002") # {"company": "Beacon Analytics", ...}
For large datasets, looking up by dict key is dramatically faster than scanning a list of dicts with a loop or list comprehension. (Chapter 11 covers performance in more depth.)
Nested Structures in Practice
Sandra asks Priya for a weekly breakdown by region. Priya builds a nested structure:
# Dict of lists of dicts: {week: [{order records}]}
weekly_data: dict[int, list[dict]] = {1: [], 2: [], 3: [], 4: []}
for record in sales_records:
weekly_data[record["week"]].append(record)
# Now easily get all orders from week 2
week_2_orders = weekly_data[2]
week_2_total = sum(r["amount"] for r in week_2_orders)
print(f"Week 2 total: ${week_2_total:,.2f}")
Choosing the Right Data Structure: A Decision Guide
When you sit down to represent a piece of business data, ask yourself these questions:
1. Is the data a sequence with meaningful position? - If positions matter (first item is special, items have a natural order) → list or tuple - If position does NOT matter (you only care about presence or absence) → set
2. Will the data change after you create it? - If yes → list (mutable sequences) or dict (mutable key-value pairs) - If no → tuple (immutable records)
3. Does each item have named fields? - If yes → dict or named tuple - If no (just a value, not a record) → list, tuple, or set
4. Do you need to look up items by a meaningful identifier? - If yes → dict (key-value lookup) - If no → list (positional access)
5. Do you need to check membership quickly, or find what two groups have in common? - If yes → set
Business Scenario Examples
| Business Need | Structure | Reason |
|---|---|---|
| All products in the catalog | list[tuple] |
Ordered, may change, each item is a fixed record |
| Single customer profile | dict |
Named fields, mutable |
| All customer profiles | list[dict] |
Table of records |
| Customer lookup by ID | dict[str, dict] |
Fast key-based lookup |
| This quarter's revenue numbers | tuple |
Four fixed numbers, shouldn't change |
| Unique active customer IDs | set |
Only care about presence |
| Sales by region | dict[str, float] |
Named aggregation |
| Weekly orders grouped by region | dict[str, list[dict]] |
Hierarchical table |
Copying vs. Referencing: A Common Source of Bugs
This section covers a concept that trips up nearly every Python developer at least once. Understanding it will save you hours of debugging.
Variables Are References
In Python, a variable doesn't hold a value directly — it holds a reference (a pointer) to an object in memory. When you assign a list to a second variable, you get two references to the same list:
original = [1, 2, 3, 4, 5]
alias = original # both names point to THE SAME list
alias.append(99)
print(original) # [1, 2, 3, 4, 5, 99] — original was modified!
This is usually fine when you intend to have two handles to the same object. It becomes a bug when you think you have an independent copy.
Shallow Copies
A shallow copy creates a new list object, but the items inside it are still the same references:
import copy
original = [1, 2, 3]
shallow = original.copy() # or: list(original), or original[:]
shallow.append(99)
print(original) # [1, 2, 3] — original is safe, because int 99 is a new object
# BUT with nested structures, shallow copy fails:
db = [{"name": "Thornfield", "tier": "gold"}, {"name": "Beacon", "tier": "silver"}]
backup = db.copy() # shallow copy
backup[0]["tier"] = "MODIFIED"
print(db[0]["tier"]) # "MODIFIED" — the inner dict is still shared!
The shallow copy created a new list, but the dictionaries inside are the same objects. Modifying a nested dict in the copy also modifies the original.
Deep Copies
A deep copy creates a completely independent copy of the entire nested structure:
import copy
db = [{"name": "Thornfield", "tier": "gold"}, {"name": "Beacon", "tier": "silver"}]
true_backup = copy.deepcopy(db)
true_backup[0]["tier"] = "MODIFIED"
print(db[0]["tier"]) # "gold" — original is now truly safe
Rule of thumb: use copy.deepcopy() whenever you need to work with a copy of nested data and be certain the original is safe. For flat lists of immutable values (ints, strings, tuples), a shallow copy is sufficient because those values can't be mutated.
Marcus enforces this rule in Acme's code reviews: "If you're copying a list of dicts and you need an independent snapshot, use deepcopy. If I see .copy() on a list of dicts, I'm asking questions."
A Complete Example: The Customer Database
Let's put everything together in a realistic scenario. Sandra has asked Priya to build a simple customer database in Python — something they can use to explore data before it goes into the production CRM.
The requirements: 1. Store customer records with named fields 2. Add and remove customers 3. Look up a customer by ID 4. Filter customers by tier 5. Rank customers by spend 6. Calculate total revenue by tier
Here's the core of what Priya builds (the full implementation is in code/customer_database.py):
import copy
from typing import Any
CustomerRecord = dict[str, Any]
def load_customers() -> list[CustomerRecord]:
"""Load the initial customer list."""
return [
{
"customer_id": "CUST-0001",
"company_name": "Thornfield Logistics",
"tier": "gold",
"annual_spend_usd": 92_100.00,
"region": "Northeast",
"active": True,
},
{
"customer_id": "CUST-0003",
"company_name": "Summit Partners",
"tier": "platinum",
"annual_spend_usd": 215_000.00,
"region": "West",
"active": True,
},
# ... more records ...
]
def find_by_id(db: list[CustomerRecord], customer_id: str) -> CustomerRecord | None:
"""Return the record with the given ID, or None."""
for record in db:
if record["customer_id"] == customer_id:
return record
return None
def filter_by_tier(db: list[CustomerRecord], tier: str) -> list[CustomerRecord]:
"""Return deep copies of all records matching the given tier."""
return [
copy.deepcopy(record)
for record in db
if record["tier"] == tier
]
def revenue_by_tier(db: list[CustomerRecord]) -> dict[str, float]:
"""Sum annual spend for each tier."""
totals: dict[str, float] = {}
for record in db:
if record["active"]:
t = record["tier"]
totals[t] = totals.get(t, 0.0) + record["annual_spend_usd"]
return totals
def top_customers(db: list[CustomerRecord], n: int = 5) -> list[CustomerRecord]:
"""Return the top N customers by annual spend."""
return sorted(db, key=lambda r: r["annual_spend_usd"], reverse=True)[:n]
Running the database through its paces:
db = load_customers()
# Look up one customer
record = find_by_id(db, "CUST-0001")
print(f"{record['company_name']} is a {record['tier']} customer.")
# Revenue report by tier
tier_totals = revenue_by_tier(db)
for tier, total in sorted(tier_totals.items()):
print(f" {tier.upper():<10}: ${total:>14,.2f}")
# Top 3 spenders
for i, cust in enumerate(top_customers(db, n=3), start=1):
print(f" #{i}: {cust['company_name']} (${cust['annual_spend_usd']:,.2f})")
This pattern — a list of dicts as the database, functions that take the list and return filtered or aggregated results, deep copies to protect the original — is the foundation of almost all data processing in Python before you bring in libraries like Pandas.
Summary
Data structures are how you map the real world into code. In this chapter you learned:
- Lists hold ordered, mutable sequences. Use them for catalogs, rankings, and any collection that grows and changes. Master indexing, slicing, and list comprehensions.
- Tuples hold ordered, immutable records. Use them for fixed data like quarterly results, coordinates, or row records. Unpacking and named tuples make them expressive.
- Dictionaries hold key-value pairs with named fields. They are the right tool for any structured record. Master
.get(),.items(), and dict comprehensions. - Sets hold unique, unordered values. Use them for deduplication, membership testing, and comparing groups with union, intersection, and difference.
- Nested structures — especially lists of dicts — represent tables of business data and are the foundation of nearly all real-world Python data processing.
- Copying requires attention: shallow copies share nested objects; use
copy.deepcopy()when you need a true independent snapshot.
The code files accompanying this chapter (list_operations.py, dict_operations.py, customer_database.py) demonstrate all of these concepts in runnable form with realistic business data. Run them, modify them, and experiment.
Continue to Case Study 1 to see Priya build a complete sales data pipeline, or jump to the Exercises to test what you've learned.