Chapter 13 Exercises: Getting Data from the Web
How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 13 material, building from recall to original analysis. Part E presents ethical scenarios requiring judgment, not just code. Part M mixes in concepts from earlier chapters to reinforce retention. You'll need Python, the
requestsandbeautifulsoup4libraries, and internet access for most problems.Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension
Part A: Conceptual Understanding (1-star)
These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.
Exercise 13.1 — The request-response cycle
Explain the HTTP request-response cycle in plain English, using the analogy of ordering food at a restaurant. Map each step of the analogy to a technical concept (request, server, response, status code).
Guidance
A reasonable analogy: You (the client) walk up to the counter and place an order (send a request). The kitchen (the server) receives your order and prepares the food (processes the request). The kitchen hands you a tray with your food and a receipt (the response). The receipt shows either "order complete" (status 200), "we're out of that item" (status 404), or "please wait, we're too busy" (status 429). The order itself specifies what you want (URL, parameters), and your loyalty card identifies you (authentication/API key).Exercise 13.2 — API vs. scraping
In your own words, explain the difference between using an API and web scraping. List two advantages of APIs over scraping and one scenario where scraping might be necessary.
Guidance
**APIs** are official, structured interfaces that return data in a machine-readable format (usually JSON). **Web scraping** involves downloading HTML web pages and parsing them to extract data. API advantages: (1) The data is structured and documented, so you don't have to guess the HTML layout; (2) APIs are designed for programmatic access, so they're more reliable — website redesigns break scrapers but don't affect APIs. Scraping scenario: The data is only available on a web page with no API (e.g., a small government agency that publishes tables on their website but offers no data download or API).Exercise 13.3 — Status codes
Match each HTTP status code to its meaning and the appropriate action:
| Code | Meaning? | What should your code do? |
|---|---|---|
| 200 | ||
| 401 | ||
| 404 | ||
| 429 | ||
| 500 |
Guidance
| Code | Meaning | Action | |------|---------|--------| | 200 | OK, success | Parse the response data | | 401 | Unauthorized | Check your API key or credentials | | 404 | Not found | Check the URL for typos or verify the endpoint exists | | 429 | Too many requests | Wait (check the Retry-After header), then retry | | 500 | Internal server error | Wait and retry later; the problem is on the server's end |Exercise 13.4 — Rate limiting
Explain what rate limiting is and why it exists. Then explain why you should add time.sleep() between API requests even if the API documentation doesn't mention a rate limit.
Guidance
Rate limiting is a policy that restricts how many requests a client can make in a time period (e.g., 60 per minute). It exists to protect the server from being overwhelmed and to ensure fair access for all users. You should add delays even without documented limits because: (1) undocumented limits may still exist and trigger 429 responses; (2) rapid-fire requests from a single client are poor netiquette and can cause problems for small servers; (3) being a courteous user reduces the chance of getting your IP or API key banned.Exercise 13.5 — robots.txt
What is a robots.txt file, where is it typically located, and what does it tell you? Is it legally binding?
Guidance
`robots.txt` is a plain text file found at the root of a website (e.g., `https://example.com/robots.txt`). It specifies rules for automated programs (bots, scrapers) about which parts of the site they're allowed or disallowed from accessing, and sometimes includes a crawl delay. It is generally *not* legally binding in all jurisdictions — it's a convention, like a "please keep off the grass" sign — but ignoring it is considered unethical in the web community and may be used as evidence of bad faith in legal proceedings.Part B: Applied Practice (2-star)
These problems require writing code. Test your solutions with real or simulated data.
Exercise 13.6 — Making a GET request
Write a Python script that:
1. Makes a GET request to https://httpbin.org/get (a free API for testing HTTP requests)
2. Checks the status code
3. Prints the JSON response
4. Extracts and prints the origin field (your IP address) from the response
Guidance
import requests
response = requests.get('https://httpbin.org/get')
if response.status_code == 200:
data = response.json()
print(data)
print(f"My IP: {data['origin']}")
else:
print(f"Error: {response.status_code}")
`httpbin.org` is a real, free service specifically designed for testing HTTP requests. It echoes back information about your request, including your IP address, headers, and parameters.
Exercise 13.7 — Query parameters
Using https://httpbin.org/get, make a request that passes three query parameters: name=Jordan, course=DataScience, chapter=13. Print the URL that requests constructed and verify the parameters appear in the JSON response.
Guidance
import requests
params = {
'name': 'Jordan',
'course': 'DataScience',
'chapter': 13
}
response = requests.get('https://httpbin.org/get',
params=params)
print(f"URL: {response.url}")
data = response.json()
print(f"Args: {data['args']}")
The `args` field in the response should contain your parameters. The URL should show `?name=Jordan&course=DataScience&chapter=13`.
Exercise 13.8 — Handling errors gracefully
Write a function called safe_request(url) that:
1. Makes a GET request with a 10-second timeout
2. Returns the JSON data if successful (status 200)
3. Prints a helpful error message for status codes 401, 404, 429, and 500
4. Returns None for any error
5. Catches requests.RequestException for connection errors
Guidance
import requests
def safe_request(url, params=None):
"""Make a GET request with error handling."""
try:
resp = requests.get(url, params=params,
timeout=10)
except requests.RequestException as e:
print(f"Connection error: {e}")
return None
if resp.status_code == 200:
return resp.json()
elif resp.status_code == 401:
print("Unauthorized: check your API key")
elif resp.status_code == 404:
print("Not found: check the URL")
elif resp.status_code == 429:
print("Rate limited: slow down requests")
elif resp.status_code == 500:
print("Server error: try again later")
else:
print(f"Unexpected status: {resp.status_code}")
return None
Exercise 13.9 — Parsing JSON responses into DataFrames
The following JSON represents a simplified API response. Write code to convert it into a pandas DataFrame with columns for id, name, location_city, and location_country:
{
"status": "ok",
"count": 3,
"data": [
{"id": 1, "name": "Station A",
"location": {"city": "Portland", "country": "US"}},
{"id": 2, "name": "Station B",
"location": {"city": "Vancouver", "country": "CA"}},
{"id": 3, "name": "Station C",
"location": {"city": "London", "country": "UK"}}
]
}
Guidance
import pandas as pd
response_data = {
"status": "ok",
"count": 3,
"data": [
{"id": 1, "name": "Station A",
"location": {"city": "Portland", "country": "US"}},
{"id": 2, "name": "Station B",
"location": {"city": "Vancouver", "country": "CA"}},
{"id": 3, "name": "Station C",
"location": {"city": "London", "country": "UK"}}
]
}
# Extract the data array, then normalize
records = response_data['data']
df = pd.json_normalize(records, sep='_')
print(df)
Key insight: the actual data is inside `response_data['data']` — you need to extract it before normalizing. The wrapper (`status`, `count`) is metadata, not data.
Exercise 13.10 — BeautifulSoup basics
Given this HTML fragment, write BeautifulSoup code to extract all product names and prices into a list of dictionaries:
<div class="products">
<div class="product">
<span class="name">Widget A</span>
<span class="price">$12.99</span>
</div>
<div class="product">
<span class="name">Widget B</span>
<span class="price">$8.50</span>
</div>
<div class="product">
<span class="name">Widget C</span>
<span class="price">$15.00</span>
</div>
</div>
Guidance
from bs4 import BeautifulSoup
html = """...""" # The HTML above
soup = BeautifulSoup(html, 'html.parser')
products = soup.find_all('div', class_='product')
data = []
for product in products:
name = product.find('span', class_='name').text
price = product.find('span', class_='price').text
data.append({'name': name, 'price': price})
print(data)
# [{'name': 'Widget A', 'price': '$12.99'}, ...]
Note: `class_` has an underscore because `class` is a reserved word in Python.
Exercise 13.11 — HTML table scraping
Write code using BeautifulSoup to extract data from this HTML table into a pandas DataFrame:
<table>
<tr><th>Country</th><th>Capital</th><th>Population</th></tr>
<tr><td>France</td><td>Paris</td><td>67390000</td></tr>
<tr><td>Germany</td><td>Berlin</td><td>83240000</td></tr>
<tr><td>Spain</td><td>Madrid</td><td>47420000</td></tr>
</table>
Then compare your approach with pd.read_html().
Guidance
**BeautifulSoup approach:**from bs4 import BeautifulSoup
import pandas as pd
html = """...""" # The HTML above
soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('tr')
headers = [th.text for th in rows[0].find_all('th')]
data = []
for row in rows[1:]:
cells = [td.text for td in row.find_all('td')]
data.append(cells)
df = pd.DataFrame(data, columns=headers)
**pd.read_html approach:**
tables = pd.read_html(html)
df = tables[0]
The `pd.read_html()` approach is dramatically simpler for standard tables. Use BeautifulSoup when data isn't in a `