> "Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom."
Learning Objectives
- Read, process, and write CSV files with the csv module and DictReader/DictWriter
- Parse and generate JSON data for data interchange
- Make HTTP requests to REST APIs using the requests library
- Process API responses and handle common API errors (rate limits, auth, pagination)
- Combine data from multiple sources into a unified dataset
In This Chapter
- Chapter Overview
- 21.1 Data Is Everywhere
- 21.2 CSV Deep Dive
- 21.3 JSON Deep Dive
- 21.4 What Is an API?
- 21.5 Making HTTP Requests with requests
- 21.6 Working with API Responses
- 21.7 API Authentication and Rate Limits
- 21.8 Combining Data from Multiple Sources
- 21.9 Common Data Processing Patterns
- 21.10 Project Checkpoint: TaskFlow v2.0
- Chapter Summary
- Key Vocabulary
Chapter 21: Working with Data: CSV, JSON, and APIs
"Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom." — Clifford Stoll
Chapter Overview
Open your phone. Check the weather. The app didn't measure the temperature — it asked an API. Check your bank balance. The banking app didn't compute your balance locally — it asked an API. Order food. The delivery app talks to a restaurant API, a payment API, a mapping API, and a notification API, all before the driver even starts moving.
Every modern application is a data consumer. Programs don't live in isolation anymore. They read spreadsheets exported by other systems, parse configuration files, query remote servers for live data, and produce output that other programs will consume. The ability to work with real-world data formats — CSV for tabular data, JSON for structured interchange, and APIs for live remote data — is what separates a script that runs on your laptop from software that participates in the wider world.
In Chapter 10, you learned the fundamentals: opening files, reading and writing text, and got your first taste of CSV and JSON. This chapter goes deeper. You'll handle CSV edge cases that trip up beginners, navigate nested JSON structures that mirror real API responses, and — most importantly — learn to talk to web APIs using Python's requests library. By the end, you'll be pulling live data from the internet and combining it with local files into unified datasets.
In this chapter, you will learn to:
- Read, process, and write CSV files with the csv module, including edge cases like commas in data and quoting
- Parse and generate JSON data for data interchange, including deeply nested structures
- Make HTTP requests to REST APIs using the requests library
- Process API responses and handle common errors (404, 401, 429)
- Combine data from multiple sources into a unified dataset (the data pipeline pattern)
🏃 Fast Track: If you're solid on CSV and JSON from Chapter 10, skim 21.2-21.3 (focus on the edge cases), then slow down at 21.4-21.7 where APIs are introduced.
🔬 Deep Dive: Section 21.8 (combining data sources) and 21.9 (processing patterns) are your springboards — they connect directly to web scraping in Chapter 24 and the capstone data dashboard project.
21.1 Data Is Everywhere
Here's a question that would have seemed absurd thirty years ago: how many different data sources does your phone talk to before you've finished your morning coffee?
Your alarm clock app checks your calendar API. Your weather widget queries a meteorological API. Your email client pulls from an IMAP server. Your news feed aggregates from dozens of content APIs. Your fitness tracker syncs sleep data to a cloud API. Your smart coffee maker — yes, really — might talk to an IoT API.
Before you're even fully awake, your phone has made hundreds of data requests across dozens of formats and protocols.
This is the world your programs live in. And the three technologies that make most of it work are surprisingly simple:
-
CSV (Comma-Separated Values): The lingua franca of tabular data. Every spreadsheet program, database, and analytics tool can read and write CSV. When Elena needs to export donor data from her nonprofit's CRM and import it into her Python script, CSV is usually how that happens.
-
JSON (JavaScript Object Notation): The dominant format for structured data on the web. When Dr. Patel queries the NCBI database for gene annotations, the response comes back as JSON. When your weather app asks for a forecast, it gets JSON. When a web page loads dynamic content, it's almost certainly JSON behind the scenes.
-
APIs (Application Programming Interfaces): The mechanism by which programs talk to other programs over the internet. APIs are not a data format — they're a protocol, a contract that says: "Send me a request that looks like this, and I'll send you a response that looks like that."
🔗 Connection — Spaced Review (Ch 10): In Chapter 10, you learned
open(),read(),write(), and got your first exposure tocsv.readerandjson.load. Everything in this chapter builds on that foundation. If those concepts feel rusty, revisit Chapter 10, Sections 10.4-10.5, before continuing.
21.2 CSV Deep Dive
You've seen basic CSV reading and writing in Chapter 10. Now let's handle the real world, where data is messy.
21.2.1 Quick Review: csv.reader and csv.writer
A CSV file is just text with values separated by commas, one record per line:
name,department,hours,rate
Elena Vasquez,Programs,42,28.50
Marcus Chen,Development,38,32.00
Reading it with csv.reader:
import csv
with open("employees.csv", "r", newline="") as f:
reader = csv.reader(f)
header = next(reader) # ['name', 'department', 'hours', 'rate']
for row in reader:
name, dept, hours, rate = row
pay = float(hours) * float(rate)
print(f"{name}: ${pay:.2f}")
# Output:
# Elena Vasquez: $1197.00
# Marcus Chen: $1216.00
Writing with csv.writer:
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["name", "total_pay"])
writer.writerow(["Elena Vasquez", 1197.00])
writer.writerow(["Marcus Chen", 1216.00])
⚠️ Pitfall: Always pass
newline=""when opening CSV files. Without it, you can get extra blank lines on Windows. This is a subtle, platform-specific bug that will cost you an hour of debugging if you forget.
21.2.2 DictReader and DictWriter: Column Names, Not Indices
Using row[0], row[1], row[2] is fragile. If someone reorders the columns in the CSV, your code breaks silently — it won't crash, it'll just produce wrong results. DictReader solves this by mapping each row to a dictionary:
import csv
with open("employees.csv", "r", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
# row is a dict: {'name': 'Elena Vasquez', 'department': 'Programs', ...}
pay = float(row["hours"]) * float(row["rate"])
print(f"{row['name']} ({row['department']}): ${pay:.2f}")
# Output:
# Elena Vasquez (Programs): $1197.00
# Marcus Chen (Development): $1216.00
Writing dictionaries back to CSV with DictWriter:
import csv
employees = [
{"name": "Elena Vasquez", "department": "Programs", "total_pay": 1197.00},
{"name": "Marcus Chen", "department": "Development", "total_pay": 1216.00},
]
fieldnames = ["name", "department", "total_pay"]
with open("summary.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(employees)
✅ Best Practice: Prefer
DictReader/DictWriteroverreader/writerin any code that will be maintained. The small overhead of dictionary lookups is worth the readability and resilience to column reordering.
21.2.3 Edge Cases: Commas, Quotes, and Newlines in Data
What happens when the data itself contains commas?
name,address,city
"O'Brien, James","123 Main St, Apt 4B",Portland
The CSV standard (RFC 4180) handles this with quoting: any field containing a comma, a newline, or a double quote is wrapped in double quotes. If a field contains a double quote, it's escaped by doubling it: "".
import csv
# Data with commas and quotes inside fields
tricky_data = [
["O'Brien, James", '123 Main St, Apt 4B', "Portland"],
["Chen, \"Bobby\"", "456 Oak Ave", "Seattle"],
]
with open("tricky.csv", "w", newline="") as f:
writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
writer.writerow(["name", "address", "city"])
writer.writerows(tricky_data)
# The file will contain:
# name,address,city
# "O'Brien, James","123 Main St, Apt 4B",Portland
# "Chen, ""Bobby""",456 Oak Ave,Seattle
The csv module handles all of this automatically. That's why you should always use the csv module instead of splitting lines on commas manually.
⚠️ Pitfall: Never parse CSV by doing
line.split(","). It breaks on the first field that contains a comma. Thecsvmodule exists precisely because CSV parsing is harder than it looks.
21.2.4 Other Delimiters: TSV and Custom Separators
Not all "CSV" files use commas. Tab-separated values (TSV) are common in scientific data:
import csv
# Reading a tab-separated file
with open("data.tsv", "r", newline="") as f:
reader = csv.reader(f, delimiter="\t")
for row in reader:
print(row)
# Writing with a pipe delimiter
with open("data.pipe", "w", newline="") as f:
writer = csv.writer(f, delimiter="|")
writer.writerow(["name", "score"])
writer.writerow(["Alice", 95])
🔄 Check Your Understanding 1: You receive a CSV file where some names contain commas (like "Smith, John"). Without using the
csvmodule, explain whyline.split(",")would produce incorrect results. Then explain howcsv.readerhandles this correctly.
21.3 JSON Deep Dive
JSON (JavaScript Object Notation) has become the universal language of the web. Despite its JavaScript origins, JSON maps almost perfectly to Python's data types, which is why working with JSON in Python feels natural.
21.3.1 Quick Review: The Four JSON Functions
Python's json module has four core functions — two for files, two for strings:
| Function | Direction | Source/Destination |
|---|---|---|
json.load(f) |
JSON file → Python | Reads from a file object |
json.dump(obj, f) |
Python → JSON file | Writes to a file object |
json.loads(s) |
JSON string → Python | Parses a string |
json.dumps(obj) |
Python → JSON string | Produces a string |
💡 Intuition: The
sinloadsanddumpsstands for "string."loadreads from a file,loadsreads from a string.dumpwrites to a file,dumpswrites to a string. Once you internalize this naming convention, you'll never mix them up.
21.3.2 Nested JSON: Real-World Structures
Chapter 10 showed you simple, flat JSON. Real API responses are almost always nested — dictionaries containing lists containing dictionaries:
import json
# A realistic API response (weather data)
weather_response = {
"location": {
"city": "Portland",
"state": "OR",
"coordinates": {"lat": 45.5152, "lon": -122.6784}
},
"current": {
"temp_f": 58.3,
"condition": "Partly cloudy",
"wind_mph": 12.5,
"humidity": 72
},
"forecast": [
{"day": "Monday", "high": 62, "low": 48, "condition": "Rain"},
{"day": "Tuesday", "high": 65, "low": 50, "condition": "Sunny"},
{"day": "Wednesday", "high": 59, "low": 45, "condition": "Cloudy"}
]
}
# Navigating nested JSON: chain the keys/indices
city = weather_response["location"]["city"]
lat = weather_response["location"]["coordinates"]["lat"]
temp = weather_response["current"]["temp_f"]
tuesday_high = weather_response["forecast"][1]["high"]
print(f"Weather in {city} (lat {lat}):")
print(f" Current: {temp}°F")
print(f" Tuesday high: {tuesday_high}°F")
# Output:
# Weather in Portland (lat 45.5152):
# Current: 58.3°F
# Tuesday high: 65°F
21.3.3 Pretty-Printing and Sorting Keys
When you're exploring a large JSON structure, pretty-printing with indent is essential:
import json
data = {"users": [{"name": "Alice", "scores": [92, 88]}, {"name": "Bob", "scores": [85, 79]}]}
# Compact (hard to read)
print(json.dumps(data))
# {"users": [{"name": "Alice", "scores": [92, 88]}, {"name": "Bob", "scores": [85, 79]}]}
# Pretty (easy to explore)
print(json.dumps(data, indent=2))
# {
# "users": [
# {
# "name": "Alice",
# "scores": [
# 92,
# 88
# ]
# },
# ...
# ]
# }
# Sorted keys (useful for diffing/comparing)
print(json.dumps(data, indent=2, sort_keys=True))
21.3.4 JSON Gotchas
A few things that will bite you if you're not watching:
import json
# 1. Tuples become lists (and don't come back)
data = {"coords": (42.3, -71.1)}
round_tripped = json.loads(json.dumps(data))
print(type(round_tripped["coords"])) # <class 'list'>, NOT tuple
# 2. Dict keys must be strings
bad = {42: "answer", (1, 2): "tuple key"}
# json.dumps(bad) # TypeError: keys must be strings
# 3. Sets are not JSON-serializable
bad2 = {"tags": {"python", "coding"}}
# json.dumps(bad2) # TypeError: Object of type set is not JSON serializable
# Fix: convert sets to lists before serialization
good = {"tags": sorted(list({"python", "coding"}))}
print(json.dumps(good)) # {"tags": ["coding", "python"]}
# 4. Python's None becomes JSON's null
print(json.dumps({"value": None})) # {"value": null}
21.3.5 CSV vs. JSON: When to Use Each
| Criterion | CSV | JSON |
|---|---|---|
| Best for | Flat, tabular data | Hierarchical, nested data |
| Readability | Spreadsheet-friendly | Developer-friendly |
| Data types | Everything is a string | Strings, numbers, booleans, null, arrays, objects |
| File size | Smaller (no key repetition) | Larger (keys repeated per record) |
| Tool support | Excel, Google Sheets, pandas | Web APIs, config files, NoSQL databases |
| Nesting | Not supported | Native |
| Schema | Implicit (header row) | Flexible (can vary per record) |
✅ Best Practice: Use CSV when your data is naturally a table (rows and columns, same fields for every record). Use JSON when your data has nesting, variable structure, or needs to preserve types.
🔄 Check Your Understanding 2: Elena has a list of 500 donors, each with a name, email, donation amount, and date. Dr. Patel has protein data where each protein has a name, a variable-length list of associated genes, and nested structural metadata. Which format (CSV or JSON) is better suited for each dataset, and why?
21.4 What Is an API?
Here's where this chapter takes a leap. CSV and JSON are data formats — containers for information. An API (Application Programming Interface) is a way for programs to talk to each other over a network.
21.4.1 The Restaurant Analogy
Think of an API like ordering at a restaurant:
- You (the client) look at the menu (the API documentation) to see what's available.
- You give your order (a request) to the waiter (the HTTP protocol).
- The waiter takes your order to the kitchen (the server), which prepares the food.
- The waiter brings back your meal (the response).
You never go into the kitchen. You don't know how the food is prepared. You don't need to. You just need to know: what can I order (the endpoints), what format does my order need to be in (the request format), and what will I get back (the response format).
That's exactly how APIs work. Your Python program sends a request to a server, and the server sends back data — usually as JSON.
21.4.2 The Request-Response Cycle
Every API interaction follows the same pattern:
┌─────────────┐ HTTP Request ┌─────────────┐
│ │ ──────────────────────────► │ │
│ CLIENT │ GET /api/weather?city=PDX │ SERVER │
│ (your │ │ (remote │
│ Python │ ◄────────────────────────── │ service) │
│ script) │ HTTP Response │ │
│ │ 200 OK │ │
│ │ {"temp": 58.3, ...} │ │
└─────────────┘ └─────────────┘
Key components:
- URL/Endpoint: The address of the resource you want. Like
https://api.weather.com/v1/forecast. - HTTP Method: What you want to do.
GETmeans "give me data."POSTmeans "here's data, process it." - Headers: Metadata about the request (authentication, content type, etc.).
- Query Parameters: Filters or options appended to the URL:
?city=Portland&units=imperial. - Response Status Code: A number that tells you whether the request succeeded (200), the resource wasn't found (404), you're not authorized (401), or the server had an error (500).
- Response Body: The actual data, usually JSON.
21.4.3 REST: The Architecture of the Web
Most APIs you'll encounter follow REST (Representational State Transfer) conventions. REST is not a protocol — it's a set of design principles:
- Resources have URLs (
/users,/users/42,/users/42/posts) - HTTP methods map to operations (GET = read, POST = create, PUT = update, DELETE = remove)
- Responses are stateless (each request contains all the information the server needs)
- Data is typically exchanged as JSON
You don't need to memorize REST theory. What matters for a Python programmer is this: most APIs accept GET requests to a URL and return JSON. That's 90% of what you'll do.
🚪 Threshold Concept: APIs as Connective Tissue
Before this chapter: "Apps are self-contained programs that do everything themselves."
After this chapter: "Every app I use talks to dozens of APIs — weather, maps, payments, authentication, social media, analytics, machine learning... My Python script can talk to those same APIs."
This shift in understanding is profound. Once you see APIs, you see them everywhere. That weather widget on your phone? An API. Google Maps embedded in a website? An API. "Sign in with Google"? An API. Credit card processing? An API. Every "smart" feature in every app you use is probably powered by an API call to a specialized service.
Your Python scripts can make those same calls. That means you can build programs that access real-time weather data, translate text, analyze sentiment, geocode addresses, send emails, generate images, and do anything that any API on the internet offers. The entire internet becomes your function library.
21.5 Making HTTP Requests with requests
Python's standard library has urllib, but it's verbose and awkward. The third-party requests library is the de facto standard for HTTP in Python — used by virtually every professional Python developer.
21.5.1 Installing requests
pip install requests
Or if you're using a virtual environment (as you should — see Chapter 23):
python -m pip install requests
21.5.2 Your First GET Request
Let's start with a free, public API that requires no authentication — the JSONPlaceholder API, which provides fake data for testing:
import requests
response = requests.get("https://jsonplaceholder.typicode.com/posts/1")
print(f"Status code: {response.status_code}")
print(f"Content type: {response.headers['Content-Type']}")
print(f"Response body (first 200 chars): {response.text[:200]}")
# Output:
# Status code: 200
# Content type: application/json; charset=utf-8
# Response body (first 200 chars): {
# "userId": 1,
# "id": 1,
# "title": "sunt aut facere repellat provident occaecati excepturi...",
# "body": "quia et suscipit\nsuscipit..."
# }
21.5.3 The Response Object
The response object returned by requests.get() is rich with information:
import requests
response = requests.get("https://jsonplaceholder.typicode.com/posts/1")
# Status code: did the request succeed?
print(response.status_code) # 200
# Was it successful? (True for any 2xx status code)
print(response.ok) # True
# The raw text of the response
print(type(response.text)) # <class 'str'>
# Parse JSON directly (no need for json.loads!)
data = response.json()
print(type(data)) # <class 'dict'>
print(data["title"]) # sunt aut facere repellat...
# Response headers
print(response.headers["Content-Type"]) # application/json; charset=utf-8
💡 Intuition:
response.json()is just a convenience forjson.loads(response.text). It parses the response body as JSON and returns a Python dict (or list). You'll use it constantly.
21.5.4 Common HTTP Status Codes
| Code | Meaning | What to Do |
|---|---|---|
| 200 | OK — success | Process the response data |
| 201 | Created — resource created | Confirm creation succeeded |
| 400 | Bad Request — your request was malformed | Check your parameters |
| 401 | Unauthorized — invalid or missing credentials | Check your API key |
| 403 | Forbidden — you don't have permission | Check your access level |
| 404 | Not Found — the resource doesn't exist | Check the URL |
| 429 | Too Many Requests — rate limited | Slow down, wait, retry |
| 500 | Internal Server Error — the server broke | Not your fault; retry later |
21.5.5 Query Parameters
Most APIs accept parameters to filter or customize the response:
import requests
# Method 1: Parameters in the URL (works but messy)
response = requests.get(
"https://jsonplaceholder.typicode.com/posts?userId=1&_limit=3"
)
# Method 2: params dict (cleaner, auto-encodes special characters)
params = {"userId": 1, "_limit": 3}
response = requests.get(
"https://jsonplaceholder.typicode.com/posts",
params=params
)
print(f"URL that was actually sent: {response.url}")
# https://jsonplaceholder.typicode.com/posts?userId=1&_limit=3
posts = response.json()
print(f"Got {len(posts)} posts")
for post in posts:
print(f" [{post['id']}] {post['title'][:50]}...")
# Output:
# URL that was actually sent: https://jsonplaceholder.typicode.com/posts?userId=1&_limit=3
# Got 3 posts
# [1] sunt aut facere repellat provident occaecati exc...
# [2] qui est esse...
# [3] ea molestias quasi exercitationem repellat qui i...
✅ Best Practice: Always use the
paramsdict instead of manually building query strings. It handles URL encoding automatically — spaces, special characters, and Unicode all get encoded correctly.
21.6 Working with API Responses
Getting a response is only step one. The real skill is extracting useful information from it.
21.6.1 Exploring an API Response
When you encounter a new API, the first thing to do is explore what it returns:
import requests
import json
response = requests.get("https://jsonplaceholder.typicode.com/users/1")
user = response.json()
# Pretty-print to understand the structure
print(json.dumps(user, indent=2))
This reveals the structure:
{
"id": 1,
"name": "Leanne Graham",
"username": "Bret",
"email": "Sincere@april.biz",
"address": {
"street": "Kulas Light",
"suite": "Apt. 556",
"city": "Gwenborough",
"zipcode": "92998-3874",
"geo": {
"lat": "-37.3159",
"lng": "81.1496"
}
},
"phone": "1-770-736-8031 x56442",
"website": "hildegard.org",
"company": {
"name": "Romaguera-Crona",
"catchPhrase": "Multi-layered client-server neural-net",
"bs": "harness real-time e-markets"
}
}
Now you can extract exactly what you need:
print(f"Name: {user['name']}")
print(f"City: {user['address']['city']}")
print(f"Company: {user['company']['name']}")
# Output:
# Name: Leanne Graham
# City: Gwenborough
# Company: Romaguera-Crona
21.6.2 Processing Lists of Results
APIs often return lists. Here's how to process them:
import requests
response = requests.get("https://jsonplaceholder.typicode.com/users")
users = response.json()
print(f"Total users: {len(users)}\n")
# Build a summary report
for user in users:
city = user["address"]["city"]
company = user["company"]["name"]
print(f" {user['name']:<25} {city:<20} {company}")
# Output:
# Total users: 10
#
# Leanne Graham Gwenborough Romaguera-Crona
# Ervin Howell Wisokyburgh Deckow-Crist
# Clementine Bauch McKenziehaven Romaguera-Jacobson
# ...
21.6.3 Dr. Patel's Pattern: API to Analysis
Dr. Patel needs protein information from a public bioinformatics API. Here's the pattern she follows — the same pattern you'll use with any API:
import requests
import json
def fetch_protein_info(protein_id: str) -> dict | None:
"""Fetch protein data from the UniProt API.
Returns the protein data as a dict, or None if the request fails.
"""
url = f"https://rest.uniprot.org/uniprotkb/{protein_id}.json"
response = requests.get(url)
if response.ok:
return response.json()
else:
print(f"Error fetching {protein_id}: {response.status_code}")
return None
# Fetch data for human insulin
data = fetch_protein_info("P01308")
if data:
name = data.get("proteinDescription", {}).get("recommendedName", {}).get(
"fullName", {}).get("value", "Unknown")
organism = data.get("organism", {}).get("scientificName", "Unknown")
length = data.get("sequence", {}).get("length", 0)
print(f"Protein: {name}")
print(f"Organism: {organism}")
print(f"Sequence length: {length} amino acids")
# Output:
# Protein: Insulin
# Organism: Homo sapiens
# Sequence length: 110 amino acids
💡 Intuition: Notice the
.get()chains with default empty dicts:data.get("organism", {}).get("scientificName", "Unknown"). This is a defensive pattern for nested JSON. If any key is missing, you get the default value instead of aKeyError. You learneddict.get()in Chapter 9 — this is where it really pays off.
21.7 API Authentication and Rate Limits
21.7.1 API Keys
Many APIs require authentication. The simplest form is an API key — a unique string that identifies your application:
import requests
API_KEY = "your_api_key_here" # Get from the API provider's website
# Method 1: API key as a query parameter
response = requests.get(
"https://api.example.com/data",
params={"api_key": API_KEY, "query": "Portland"}
)
# Method 2: API key in a header (more secure, more common)
headers = {"Authorization": f"Bearer {API_KEY}"}
response = requests.get(
"https://api.example.com/data",
headers=headers,
params={"query": "Portland"}
)
⚠️ Pitfall: Never hard-code API keys in your source code, especially if the code will be shared or committed to Git. Use environment variables instead:
```python import os
API_KEY = os.environ.get("WEATHER_API_KEY") if not API_KEY: raise SystemExit("Set the WEATHER_API_KEY environment variable first.") ```
21.7.2 Rate Limiting
APIs limit how many requests you can make in a given time period. This is called rate limiting. If you exceed the limit, you'll get a 429 Too Many Requests response.
import requests
import time
def fetch_with_retry(url: str, max_retries: int = 3, delay: float = 1.0) -> dict | None:
"""Fetch URL with automatic retry on rate limiting."""
for attempt in range(max_retries):
response = requests.get(url)
if response.ok:
return response.json()
elif response.status_code == 429:
wait = delay * (2 ** attempt) # Exponential backoff: 1, 2, 4 sec
print(f"Rate limited. Waiting {wait:.0f} seconds...")
time.sleep(wait)
else:
print(f"Error: {response.status_code}")
return None
print("Max retries exceeded.")
return None
💡 Intuition: Exponential backoff means you wait longer each time you're rate-limited: 1 second, then 2, then 4. This gives the API server time to recover and makes your requests more likely to succeed on retry.
21.7.3 Handling Errors Gracefully
🔗 Connection — Spaced Review (Ch 11): Remember EAFP — Easier to Ask Forgiveness than Permission? That philosophy applies perfectly to API calls. Don't try to predict every error; catch and handle them:
import requests
def safe_api_call(url: str, params: dict = None) -> dict | None:
"""Make an API call with comprehensive error handling."""
try:
response = requests.get(url, params=params, timeout=10)
response.raise_for_status() # Raises exception for 4xx/5xx
return response.json()
except requests.exceptions.Timeout:
print("Request timed out. The server might be slow.")
except requests.exceptions.ConnectionError:
print("Connection failed. Check your internet connection.")
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e.response.status_code}")
except requests.exceptions.JSONDecodeError:
print("Response was not valid JSON.")
return None
🐛 Debugging Walkthrough: Common API Errors
Symptom:
401 UnauthorizedCause: Missing or invalid API key. Double-check you're passing the key correctly — some APIs want it in a header, others as a query parameter. Fix: Verify the key value, check if it's expired, and confirm the header/parameter name matches the API docs.Symptom:
404 Not FoundCause: The URL is wrong, or the specific resource doesn't exist. Maybe you have/users/999but there's no user 999. Fix: Check the URL against the API documentation. Try the URL in a browser.Symptom:
429 Too Many RequestsCause: You're sending requests too fast. Common when looping over many items. Fix: Addtime.sleep()between requests. Implement exponential backoff. Check the API's rate limit documentation — many include aRetry-Afterheader telling you exactly how long to wait.
21.8 Combining Data from Multiple Sources
This is where everything comes together. Real-world data work almost always involves merging data from multiple sources.
21.8.1 Elena's Multi-Source Report
Elena's nonprofit needs a weekly report that combines: 1. A CSV file of donation records from the CRM 2. A JSON file of program metrics from the internal dashboard 3. API data from a geocoding service to map donor locations
Here's the pattern — a data pipeline:
import csv
import json
import requests
# ── Step 1: Load CSV data ──────────────────────────────
def load_donations(csv_path: str) -> list[dict]:
"""Load donation records from a CSV file."""
donations = []
with open(csv_path, "r", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
row["amount"] = float(row["amount"])
donations.append(row)
return donations
# ── Step 2: Load JSON data ─────────────────────────────
def load_program_metrics(json_path: str) -> dict:
"""Load program metrics from a JSON file."""
with open(json_path, "r") as f:
return json.load(f)
# ── Step 3: Enrich with API data ───────────────────────
def enrich_with_location(donations: list[dict]) -> list[dict]:
"""Add city/state info based on zip code (mock implementation)."""
# In production, this would call a geocoding API.
# Here we simulate the pattern with a lookup dict.
zip_lookup = {
"97201": {"city": "Portland", "state": "OR"},
"97202": {"city": "Portland", "state": "OR"},
"98101": {"city": "Seattle", "state": "WA"},
"94102": {"city": "San Francisco", "state": "CA"},
}
for donation in donations:
location = zip_lookup.get(donation.get("zip", ""), {})
donation["city"] = location.get("city", "Unknown")
donation["state"] = location.get("state", "Unknown")
return donations
# ── Step 4: Aggregate and report ───────────────────────
def generate_report(donations: list[dict], metrics: dict) -> str:
"""Generate a summary report from combined data."""
total = sum(d["amount"] for d in donations)
by_city: dict[str, float] = {}
for d in donations:
by_city[d["city"]] = by_city.get(d["city"], 0) + d["amount"]
lines = ["=== Weekly Report ===", ""]
lines.append(f"Total donations: ${total:,.2f}")
lines.append(f"Number of donors: {len(donations)}")
lines.append(f"Programs active: {metrics.get('active_programs', 'N/A')}")
lines.append("")
lines.append("Donations by city:")
for city, amount in sorted(by_city.items(), key=lambda x: -x[1]):
lines.append(f" {city:<20} ${amount:>10,.2f}")
return "\n".join(lines)
This is the data pipeline pattern: Load → Clean → Enrich → Aggregate → Output. You'll see it everywhere in data engineering, and it's built from the basic tools you already know.
21.8.2 The Data Pipeline Pattern
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ LOAD │ → │ CLEAN │ → │ ENRICH │ → │ AGGREGATE│ → │ OUTPUT │
│ │ │ │ │ │ │ │ │ │
│ CSV file │ │ Fix types│ │ Add API │ │ Sum, │ │ Report, │
│ JSON file│ │ Remove │ │ data │ │ count, │ │ CSV, │
│ API call │ │ bad rows │ │ Merge │ │ group │ │ JSON │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
🔗 Connection — Spaced Review (Ch 17): When processing large datasets, Big-O thinking matters. If you're looking up each donation's zip code against a list, that's O(n*m). Using a dictionary lookup (as we did above) makes it O(n). The data structure choices from Chapter 9 and the algorithmic thinking from Chapter 17 become critical when your dataset has 100,000 rows instead of 10.
21.9 Common Data Processing Patterns
Once you can load data from CSV, JSON, and APIs, you'll find yourself using the same processing patterns over and over.
21.9.1 Filtering
Select only the records that meet a condition:
# Filter donations over $100
large_donations = [d for d in donations if d["amount"] > 100]
# Filter API results to a specific category
posts = requests.get("https://jsonplaceholder.typicode.com/posts").json()
user_1_posts = [p for p in posts if p["userId"] == 1]
21.9.2 Transforming
Change the shape or content of each record:
# Transform raw API data into a cleaner format
raw_users = requests.get("https://jsonplaceholder.typicode.com/users").json()
clean_users = [
{
"name": u["name"],
"email": u["email"],
"city": u["address"]["city"],
"company": u["company"]["name"],
}
for u in raw_users
]
21.9.3 Aggregating
Compute summaries across records:
# Count by category
from collections import Counter
categories = [d["category"] for d in donations]
category_counts = Counter(categories)
print(category_counts.most_common(3))
# [('individual', 45), ('corporate', 12), ('foundation', 8)]
# Sum by group
totals_by_state: dict[str, float] = {}
for d in donations:
totals_by_state[d["state"]] = totals_by_state.get(d["state"], 0) + d["amount"]
21.9.4 Exporting
Write results back to CSV or JSON:
import csv
import json
# Export to CSV
with open("report.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["name", "email", "city", "company"])
writer.writeheader()
writer.writerows(clean_users)
# Export to JSON
with open("report.json", "w") as f:
json.dump(clean_users, f, indent=2)
print(f"Exported {len(clean_users)} records to CSV and JSON")
🧩 Productive Struggle: Before reading further, try this: you need weather data for 50 cities across the US. You have the city names in a CSV file. How would you get the weather data? Think about: (1) where would the weather data come from, (2) how would you match it to your cities, and (3) what would the output look like? Sketch out the steps on paper before looking at the project checkpoint.
🔄 Check Your Understanding 3: Describe the five stages of a data pipeline. For Elena's weekly report, identify which specific data source (CSV, JSON, or API) is used at each stage and what processing happens.
21.10 Project Checkpoint: TaskFlow v2.0
TaskFlow v2.0 adds data interchange capabilities — your task manager can now export and import data in both CSV and JSON formats, and it can fetch inspirational quotes from a free public API to display alongside your tasks.
What's New in v2.0
Building on v1.9 (undo/redo stack and task queue from Chapter 20):
- CSV Export/Import: Export your tasks to a CSV file that you can open in Excel or Google Sheets. Import tasks from a CSV file that someone else created.
- JSON Export/Import: Export tasks as JSON for backup or sharing with other programs.
- API Integration: Fetch a random inspirational quote from a free API to display when you start the app or when you list tasks.
Key Design Decisions
- Export functions write files to the same directory as the TaskFlow data file.
- Import functions add to existing tasks (they don't replace them).
- The quote API call is wrapped in a try/except so the app works even without internet.
- CSV export uses
DictWriterso the file has clear column headers.
The Code
"""
TaskFlow v2.0 — Data Interchange & API Integration
Introduction to Computer Science with Python — Chapter 21
New in v2.0:
- Export tasks to CSV
- Import tasks from CSV
- Export tasks to JSON (pretty-printed)
- Import tasks from JSON
- Fetch inspirational quote from free API on startup
Building on v1.9 (Chapter 20):
- Undo/redo stack
- Task queue
"""
import csv
import json
import requests
import time
from datetime import datetime
from pathlib import Path
DATA_FILE = Path(__file__).parent / "taskflow_data.json"
EXPORT_DIR = Path(__file__).parent
# ── Persistence (from v0.9) ──────────────────────────────
def load_tasks(path: Path) -> list[dict]:
if not path.exists():
return []
try:
with open(path, "r") as f:
return json.load(f)
except json.JSONDecodeError:
print(" Warning: data file corrupted. Starting fresh.")
return []
def save_tasks(tasks: list[dict], path: Path) -> None:
with open(path, "w") as f:
json.dump(tasks, f, indent=2)
# ── NEW: CSV Export/Import ───────────────────────────────
CSV_FIELDS = ["title", "priority", "category", "done", "created"]
def export_csv(tasks: list[dict]) -> None:
if not tasks:
print(" No tasks to export.")
return
path = EXPORT_DIR / "taskflow_export.csv"
with open(path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=CSV_FIELDS,
extrasaction="ignore")
writer.writeheader()
writer.writerows(tasks)
print(f" Exported {len(tasks)} tasks to {path.name}")
def import_csv(tasks: list[dict]) -> None:
path = EXPORT_DIR / "taskflow_export.csv"
if not path.exists():
print(f" No file found: {path.name}")
return
count = 0
with open(path, "r", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
row["done"] = row.get("done", "False").lower() == "true"
tasks.append(row)
count += 1
save_tasks(tasks, DATA_FILE)
print(f" Imported {count} tasks from {path.name}")
# ── NEW: JSON Export/Import ──────────────────────────────
def export_json(tasks: list[dict]) -> None:
if not tasks:
print(" No tasks to export.")
return
path = EXPORT_DIR / "taskflow_export.json"
with open(path, "w") as f:
json.dump(tasks, f, indent=2)
print(f" Exported {len(tasks)} tasks to {path.name}")
def import_json(tasks: list[dict]) -> None:
path = EXPORT_DIR / "taskflow_export.json"
if not path.exists():
print(f" No file found: {path.name}")
return
try:
with open(path, "r") as f:
imported = json.load(f)
tasks.extend(imported)
save_tasks(tasks, DATA_FILE)
print(f" Imported {len(imported)} tasks from {path.name}")
except json.JSONDecodeError:
print(f" Error: {path.name} is not valid JSON.")
# ── NEW: API Integration ─────────────────────────────────
def fetch_quote() -> str | None:
"""Fetch an inspirational quote from a free API."""
try:
response = requests.get(
"https://dummyjson.com/quotes/random",
timeout=5
)
if response.ok:
data = response.json()
return f'"{data["quote"]}" — {data["author"]}'
except (requests.exceptions.RequestException, KeyError):
pass
return None
# ── Task operations (carried forward) ────────────────────
PRIORITIES = {"high": 1, "medium": 2, "low": 3}
def add_task(tasks: list[dict]) -> None:
title = input(" Task title: ").strip()
if not title:
print(" Title cannot be empty.")
return
priority = input(" Priority (high/medium/low) [medium]: ").strip().lower()
if priority not in PRIORITIES:
priority = "medium"
category = input(" Category [general]: ").strip() or "general"
task = {
"title": title,
"priority": priority,
"category": category,
"done": False,
"created": datetime.now().strftime("%Y-%m-%d %H:%M"),
}
tasks.append(task)
save_tasks(tasks, DATA_FILE)
print(f" Added: '{title}' [{priority}, {category}]")
def list_tasks(tasks: list[dict]) -> None:
if not tasks:
print(" No tasks yet.")
return
sorted_tasks = sorted(tasks, key=lambda t: (
PRIORITIES.get(t.get("priority", "medium"), 99),
t.get("created", ""),
))
print(f"\n {'#':<4} {'Done':<6} {'Pri':<8} {'Category':<12} {'Title'}")
print(f" {'—' * 4} {'—' * 5} {'—' * 7} {'—' * 11} {'—' * 20}")
for i, task in enumerate(sorted_tasks, start=1):
check = "[x]" if task.get("done") else "[ ]"
print(f" {i:<4} {check:<6} {task.get('priority', '?'):<8} "
f"{task.get('category', '?'):<12} {task.get('title', '?')}")
done = sum(1 for t in tasks if t.get("done"))
print(f"\n {done}/{len(tasks)} complete")
# ── Main loop ────────────────────────────────────────────
MENU = """
╔══════════════════════════════════╗
║ TaskFlow v2.0 ║
║ CSV · JSON · API Integration ║
╚══════════════════════════════════╝
1) Add task 5) Export CSV
2) List tasks 6) Import CSV
3) Complete task 7) Export JSON
4) Delete task 8) Import JSON
9) Fetch new quote
0) Save & quit
"""
def main() -> None:
print("\n Welcome to TaskFlow v2.0!")
tasks = load_tasks(DATA_FILE)
print(f" {len(tasks)} task(s) loaded.\n")
quote = fetch_quote()
if quote:
print(f" 💡 {quote}\n")
else:
print(" (Could not fetch quote — check internet connection)\n")
while True:
print(MENU)
choice = input(" Choose (0-9): ").strip()
match choice:
case "1":
add_task(tasks)
case "2":
list_tasks(tasks)
case "3":
list_tasks(tasks)
if tasks:
try:
num = int(input("\n Task # to complete: "))
sorted_tasks = sorted(tasks, key=lambda t: (
PRIORITIES.get(t.get("priority", "medium"), 99),
t.get("created", ""),
))
if 1 <= num <= len(sorted_tasks):
sorted_tasks[num - 1]["done"] = True
save_tasks(tasks, DATA_FILE)
print(f" Completed: '{sorted_tasks[num-1]['title']}'")
else:
print(" Invalid task number.")
except ValueError:
print(" Please enter a number.")
case "4":
list_tasks(tasks)
if tasks:
try:
num = int(input("\n Task # to delete: "))
sorted_tasks = sorted(tasks, key=lambda t: (
PRIORITIES.get(t.get("priority", "medium"), 99),
t.get("created", ""),
))
if 1 <= num <= len(sorted_tasks):
target = sorted_tasks[num - 1]
tasks.remove(target)
save_tasks(tasks, DATA_FILE)
print(f" Deleted: '{target['title']}'")
else:
print(" Invalid task number.")
except ValueError:
print(" Please enter a number.")
case "5":
export_csv(tasks)
case "6":
import_csv(tasks)
case "7":
export_json(tasks)
case "8":
import_json(tasks)
case "9":
quote = fetch_quote()
if quote:
print(f"\n 💡 {quote}\n")
else:
print("\n Could not fetch quote.\n")
case "0":
save_tasks(tasks, DATA_FILE)
print(f"\n Tasks saved. Goodbye!")
break
case _:
print(" Invalid choice. Enter 0-9.")
if __name__ == "__main__":
main()
What to Notice
- CSV export uses
extrasaction="ignore"— if tasks have extra fields (like from the undo stack), the CSV writer silently ignores them instead of crashing. - CSV import converts the
donefield from a string ("True"/"False") back to a boolean. CSV files store everything as strings — you always need to convert types when reading. - The quote API call is wrapped in a try/except — the app works perfectly without internet. API integration should enhance the experience, not gatekeep it.
- JSON export/import is simpler than CSV — JSON preserves types natively, so you don't need type conversion after import.
🔗 Bridge to Chapter 22: In the next chapter, you'll learn regular expressions — powerful pattern matching that lets TaskFlow v2.1 support advanced search queries like "find all tasks containing a date" or "find tasks matching 'bug*fix'." Regex will also let you validate data formats (like email addresses and phone numbers) before importing them.
Chapter Summary
This chapter transformed your Python programs from isolated scripts into connected, data-aware applications. You mastered three essential skills:
CSV mastery: You moved beyond basic reading and writing to handle edge cases — commas in data, quoting, custom delimiters, and the critical newline="" parameter. You learned to prefer DictReader/DictWriter for maintainable code.
JSON fluency: You worked with deeply nested JSON structures like those returned by real APIs, used defensive .get() chains to avoid KeyError crashes, and learned when JSON is the better choice over CSV.
API fundamentals: You made HTTP requests with the requests library, parsed JSON responses, handled errors gracefully, and implemented rate limiting with exponential backoff. Most importantly, you internalized the threshold concept: APIs are the connective tissue of the internet, and your Python scripts can talk to any of them.
The data pipeline pattern — Load, Clean, Enrich, Aggregate, Output — is the backbone of data engineering, and you've now built one from scratch. In Chapter 24, you'll extend this to web scraping, and in the capstone projects, you'll build complete data-driven applications.
Key Vocabulary
| Term | Definition |
|---|---|
| API | Application Programming Interface — a set of rules for programs to communicate with each other |
| API key | A unique string that authenticates your application when making API requests |
| authentication | The process of proving your identity to an API, usually via an API key or token |
| CSV | Comma-Separated Values — a plain-text format for tabular data |
| data pipeline | A sequence of steps that load, clean, transform, and output data |
| DictReader | A csv module class that reads CSV rows as dictionaries keyed by column name |
| DictWriter | A csv module class that writes dictionaries as CSV rows |
| GET | An HTTP method that requests data from a server |
| HTTP | HyperText Transfer Protocol — the protocol for communication on the web |
| JSON | JavaScript Object Notation — a text format for structured data interchange |
| pagination | Splitting large API responses across multiple pages/requests |
| POST | An HTTP method that sends data to a server for processing |
| rate limiting | Restrictions on how many API requests you can make in a given time period |
| requests | A third-party Python library for making HTTP requests |
| response | The data a server sends back after receiving a request |
| REST | Representational State Transfer — architectural principles for web API design |
| status code | A three-digit number indicating the result of an HTTP request (200 = OK, 404 = Not Found, etc.) |