Chapter 13 Exercises: Getting Data from the Web

Contributors to Introduction to Data Science

Chapter 13 Exercises: Getting Data from the Web

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 13 material, building from recall to original analysis. Part E presents ethical scenarios requiring judgment, not just code. Part M mixes in concepts from earlier chapters to reinforce retention. You'll need Python, the requests and beautifulsoup4 libraries, and internet access for most problems.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension

Part A: Conceptual Understanding (1-star)

These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.

Exercise 13.1 — The request-response cycle

Explain the HTTP request-response cycle in plain English, using the analogy of ordering food at a restaurant. Map each step of the analogy to a technical concept (request, server, response, status code).

Guidance

A reasonable analogy: You (the client) walk up to the counter and place an order (send a request). The kitchen (the server) receives your order and prepares the food (processes the request). The kitchen hands you a tray with your food and a receipt (the response). The receipt shows either "order complete" (status 200), "we're out of that item" (status 404), or "please wait, we're too busy" (status 429). The order itself specifies what you want (URL, parameters), and your loyalty card identifies you (authentication/API key).

Exercise 13.2 — API vs. scraping

In your own words, explain the difference between using an API and web scraping. List two advantages of APIs over scraping and one scenario where scraping might be necessary.

Guidance

**APIs** are official, structured interfaces that return data in a machine-readable format (usually JSON). **Web scraping** involves downloading HTML web pages and parsing them to extract data. API advantages: (1) The data is structured and documented, so you don't have to guess the HTML layout; (2) APIs are designed for programmatic access, so they're more reliable — website redesigns break scrapers but don't affect APIs. Scraping scenario: The data is only available on a web page with no API (e.g., a small government agency that publishes tables on their website but offers no data download or API).

Exercise 13.3 — Status codes

Match each HTTP status code to its meaning and the appropriate action:

Code	Meaning?	What should your code do?
200
401
404
429
500

Guidance

| Code | Meaning | Action | |------|---------|--------| | 200 | OK, success | Parse the response data | | 401 | Unauthorized | Check your API key or credentials | | 404 | Not found | Check the URL for typos or verify the endpoint exists | | 429 | Too many requests | Wait (check the Retry-After header), then retry | | 500 | Internal server error | Wait and retry later; the problem is on the server's end |

Exercise 13.4 — Rate limiting

Explain what rate limiting is and why it exists. Then explain why you should add time.sleep() between API requests even if the API documentation doesn't mention a rate limit.

Guidance

Rate limiting is a policy that restricts how many requests a client can make in a time period (e.g., 60 per minute). It exists to protect the server from being overwhelmed and to ensure fair access for all users. You should add delays even without documented limits because: (1) undocumented limits may still exist and trigger 429 responses; (2) rapid-fire requests from a single client are poor netiquette and can cause problems for small servers; (3) being a courteous user reduces the chance of getting your IP or API key banned.

Exercise 13.5 — robots.txt

What is a robots.txt file, where is it typically located, and what does it tell you? Is it legally binding?

Guidance

`robots.txt` is a plain text file found at the root of a website (e.g., `https://example.com/robots.txt`). It specifies rules for automated programs (bots, scrapers) about which parts of the site they're allowed or disallowed from accessing, and sometimes includes a crawl delay. It is generally *not* legally binding in all jurisdictions — it's a convention, like a "please keep off the grass" sign — but ignoring it is considered unethical in the web community and may be used as evidence of bad faith in legal proceedings.

Part B: Applied Practice (2-star)

These problems require writing code. Test your solutions with real or simulated data.

Exercise 13.6 — Making a GET request

Write a Python script that: 1. Makes a GET request to https://httpbin.org/get (a free API for testing HTTP requests) 2. Checks the status code 3. Prints the JSON response 4. Extracts and prints the origin field (your IP address) from the response

Guidance

import requests

response = requests.get('https://httpbin.org/get')

if response.status_code == 200:
    data = response.json()
    print(data)
    print(f"My IP: {data['origin']}")
else:
    print(f"Error: {response.status_code}")

`httpbin.org` is a real, free service specifically designed for testing HTTP requests. It echoes back information about your request, including your IP address, headers, and parameters.

Exercise 13.7 — Query parameters

Using https://httpbin.org/get, make a request that passes three query parameters: name=Jordan, course=DataScience, chapter=13. Print the URL that requests constructed and verify the parameters appear in the JSON response.

Guidance

import requests

params = {
    'name': 'Jordan',
    'course': 'DataScience',
    'chapter': 13
}

response = requests.get('https://httpbin.org/get',
                        params=params)
print(f"URL: {response.url}")

data = response.json()
print(f"Args: {data['args']}")

The `args` field in the response should contain your parameters. The URL should show `?name=Jordan&course=DataScience&chapter=13`.

Exercise 13.8 — Handling errors gracefully

Write a function called safe_request(url) that: 1. Makes a GET request with a 10-second timeout 2. Returns the JSON data if successful (status 200) 3. Prints a helpful error message for status codes 401, 404, 429, and 500 4. Returns None for any error 5. Catches requests.RequestException for connection errors

Guidance

import requests

def safe_request(url, params=None):
    """Make a GET request with error handling."""
    try:
        resp = requests.get(url, params=params,
                            timeout=10)
    except requests.RequestException as e:
        print(f"Connection error: {e}")
        return None

    if resp.status_code == 200:
        return resp.json()
    elif resp.status_code == 401:
        print("Unauthorized: check your API key")
    elif resp.status_code == 404:
        print("Not found: check the URL")
    elif resp.status_code == 429:
        print("Rate limited: slow down requests")
    elif resp.status_code == 500:
        print("Server error: try again later")
    else:
        print(f"Unexpected status: {resp.status_code}")
    return None

Exercise 13.9 — Parsing JSON responses into DataFrames

The following JSON represents a simplified API response. Write code to convert it into a pandas DataFrame with columns for id, name, location_city, and location_country:

{
  "status": "ok",
  "count": 3,
  "data": [
    {"id": 1, "name": "Station A",
     "location": {"city": "Portland", "country": "US"}},
    {"id": 2, "name": "Station B",
     "location": {"city": "Vancouver", "country": "CA"}},
    {"id": 3, "name": "Station C",
     "location": {"city": "London", "country": "UK"}}
  ]
}

Guidance

import pandas as pd

response_data = {
    "status": "ok",
    "count": 3,
    "data": [
        {"id": 1, "name": "Station A",
         "location": {"city": "Portland", "country": "US"}},
        {"id": 2, "name": "Station B",
         "location": {"city": "Vancouver", "country": "CA"}},
        {"id": 3, "name": "Station C",
         "location": {"city": "London", "country": "UK"}}
    ]
}

# Extract the data array, then normalize
records = response_data['data']
df = pd.json_normalize(records, sep='_')
print(df)

Key insight: the actual data is inside `response_data['data']` — you need to extract it before normalizing. The wrapper (`status`, `count`) is metadata, not data.

Exercise 13.10 — BeautifulSoup basics

Given this HTML fragment, write BeautifulSoup code to extract all product names and prices into a list of dictionaries:

<div class="products">
  <div class="product">
    <span class="name">Widget A</span>
    <span class="price">$12.99</span>
  </div>
  <div class="product">
    <span class="name">Widget B</span>
    <span class="price">$8.50</span>
  </div>
  <div class="product">
    <span class="name">Widget C</span>
    <span class="price">$15.00</span>
  </div>
</div>

Guidance

from bs4 import BeautifulSoup

html = """..."""  # The HTML above

soup = BeautifulSoup(html, 'html.parser')
products = soup.find_all('div', class_='product')

data = []
for product in products:
    name = product.find('span', class_='name').text
    price = product.find('span', class_='price').text
    data.append({'name': name, 'price': price})

print(data)
# [{'name': 'Widget A', 'price': '$12.99'}, ...]

Note: `class_` has an underscore because `class` is a reserved word in Python.

Exercise 13.11 — HTML table scraping

Write code using BeautifulSoup to extract data from this HTML table into a pandas DataFrame:

<table>
  <tr><th>Country</th><th>Capital</th><th>Population</th></tr>
  <tr><td>France</td><td>Paris</td><td>67390000</td></tr>
  <tr><td>Germany</td><td>Berlin</td><td>83240000</td></tr>
  <tr><td>Spain</td><td>Madrid</td><td>47420000</td></tr>
</table>

Then compare your approach with pd.read_html().

Guidance

**BeautifulSoup approach:**

from bs4 import BeautifulSoup
import pandas as pd

html = """..."""  # The HTML above
soup = BeautifulSoup(html, 'html.parser')

rows = soup.find_all('tr')
headers = [th.text for th in rows[0].find_all('th')]

data = []
for row in rows[1:]:
    cells = [td.text for td in row.find_all('td')]
    data.append(cells)

df = pd.DataFrame(data, columns=headers)

**pd.read_html approach:**

tables = pd.read_html(html)
df = tables[0]

The `pd.read_html()` approach is dramatically simpler for standard tables. Use BeautifulSoup when data isn't in a `` or when you need more control over extraction.

Exercise 13.12 — Pagination loop

Write a function that collects data from a paginated API. The API returns JSON like:

{"results": [...], "page": 1, "total_pages": 5}

Your function should loop through all pages, collect all results, respect a 1-second delay between requests, and return a combined DataFrame.

Guidance

import requests
import pandas as pd
import time

def collect_all_pages(base_url, params=None, delay=1.0):
    all_records = []
    page = 1

    while True:
        if params is None:
            params = {}
        params['page'] = page

        resp = requests.get(base_url, params=params,
                            timeout=10)
        if resp.status_code != 200:
            print(f"Error on page {page}: "
                  f"{resp.status_code}")
            break

        data = resp.json()
        all_records.extend(data['results'])

        if page >= data['total_pages']:
            break

        page += 1
        time.sleep(delay)

    return pd.DataFrame(all_records)

Part C: Real-World Scenarios (2-star to 3-star)

Exercise 13.13 — API exploration ⭐⭐

Choose one of these real, free, no-authentication-required APIs and retrieve data from it: - https://api.open-meteo.com/v1/forecast?latitude=45.52&longitude=-122.68&current_weather=true (weather data) - https://api.agify.io?name=jordan (name-based age estimation) - https://catfact.ninja/fact (random cat facts)

Make the request, inspect the response structure, and create a pandas DataFrame from the relevant data.

Guidance

Example with the weather API:

import requests
import pandas as pd

resp = requests.get(
    'https://api.open-meteo.com/v1/forecast',
    params={
        'latitude': 45.52,
        'longitude': -122.68,
        'current_weather': True
    }
)

data = resp.json()
weather = pd.json_normalize(data['current_weather'])
print(weather)

The key skill: inspect the JSON structure first (`data.keys()`, `type(data['current_weather'])`), then decide how to extract and normalize the data you want.

Exercise 13.14 — Building a multi-city weather collector ⭐⭐⭐

Using the Open-Meteo API (no key required), write a script that: 1. Defines a list of 5 cities with their latitude and longitude 2. Loops through the cities, requesting current weather for each 3. Respects a 1-second delay between requests 4. Combines all results into a single DataFrame with a city column 5. Saves the result to a CSV file

Guidance

import requests
import pandas as pd
import time

cities = {
    'Portland': (45.52, -122.68),
    'Denver': (39.74, -104.99),
    'Chicago': (41.88, -87.63),
    'Miami': (25.76, -80.19),
    'Seattle': (47.61, -122.33)
}

records = []
for city, (lat, lon) in cities.items():
    resp = requests.get(
        'https://api.open-meteo.com/v1/forecast',
        params={'latitude': lat, 'longitude': lon,
                'current_weather': True},
        timeout=10
    )
    if resp.status_code == 200:
        weather = resp.json()['current_weather']
        weather['city'] = city
        records.append(weather)
    time.sleep(1)

df = pd.DataFrame(records)
df.to_csv('weather_data.csv', index=False)
print(df)

Exercise 13.15 — Debugging a scraper ⭐⭐

A colleague wrote this scraping code, but it isn't working. Identify and fix all the bugs:

import request
from bs4 import BeautifulSoup

response = request.get('https://example.com/data')
soup = BeautifulSoup(response, 'html.parser')
rows = soup.find('tr')
for row in rows:
    cells = row.find('td')
    print(cells.text)

Guidance

Four bugs: 1. `import request` should be `import requests` (plural) 2. `BeautifulSoup(response, ...)` should be `BeautifulSoup(response.text, ...)` — you need the text content, not the response object 3. `soup.find('tr')` should be `soup.find_all('tr')` — `find()` returns only the first match, not all of them 4. `row.find('td')` should be `row.find_all('td')` — same issue, and you'd need to loop through cells Corrected:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com/data')
soup = BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    for cell in cells:
        print(cell.text)

Exercise 13.16 — API key safety ⭐⭐

Your teammate writes this code and asks you to review it before they push it to GitHub:

import requests

API_KEY = 'sk_live_a8f3k2j5h7g9d1m4n6p8'
response = requests.get(
    'https://api.dataservice.com/records',
    headers={'Authorization': f'Bearer {API_KEY}'}
)

What is the security problem? Write a corrected version that keeps the API key safe.

Guidance

**Problem:** The API key is hardcoded in the source code. If this file is pushed to GitHub, the key becomes publicly visible. Anyone can use (and abuse) it. **Fix:** Store the key outside the code:

import requests
import os

API_KEY = os.environ.get('DATA_SERVICE_API_KEY')
if not API_KEY:
    raise ValueError(
        "Set DATA_SERVICE_API_KEY environment variable"
    )

response = requests.get(
    'https://api.dataservice.com/records',
    headers={'Authorization': f'Bearer {API_KEY}'}
)

Also: if the key has already been committed, it's compromised. Revoke it immediately and generate a new one, even if the commit is later removed (Git history preserves it).

Exercise 13.17 — robots.txt analysis ⭐⭐

Fetch and analyze the robots.txt file for three real websites (e.g., https://en.wikipedia.org/robots.txt, https://twitter.com/robots.txt, https://github.com/robots.txt). For each: 1. Which paths are disallowed? 2. Is there a crawl delay specified? 3. Based on the robots.txt, would your data collection script be allowed to access the main content pages?

Guidance

import requests

sites = [
    'https://en.wikipedia.org/robots.txt',
    'https://github.com/robots.txt'
]

for url in sites:
    resp = requests.get(url, timeout=10)
    print(f"\n--- {url} ---")
    print(resp.text[:500])

Review the output for `Disallow` directives and `Crawl-delay` values. Note that some sites (like Wikipedia) are quite permissive, while others restrict access to many paths. The specifics will vary — the point of this exercise is to build the habit of checking.

Part D: Synthesis and Extension (3-star to 4-star)

Exercise 13.18 — Complete data pipeline ⭐⭐⭐

Build a complete data collection pipeline that: 1. Fetches data from a free public API 2. Handles errors and timeouts 3. Converts the response to a pandas DataFrame 4. Cleans the data (rename columns, fix types) 5. Saves to CSV 6. Includes an ethics documentation cell

Document each step in a Jupyter notebook with Markdown explanations.

Guidance

Use any free API (Open-Meteo, REST Countries, or a public data portal). The pipeline should include error handling (`try/except`), a timeout parameter, status code checking, and a final Markdown cell documenting: the data source, access method, rate limiting approach, and any ethical considerations. This exercise combines technical skills from this chapter with the documentation practices emphasized throughout the book.

Exercise 13.19 — Scraping vs. API comparison ⭐⭐⭐

Find a website that provides data both through a web page and through an API (many government data portals do this). Write two scripts: 1. One that scrapes the data from the web page using BeautifulSoup 2. One that retrieves the same data from the API using requests

Compare the two approaches: Which was easier? Which produced cleaner data? Which is more likely to break if the website changes?

Guidance

The comparison should demonstrate that the API approach is more reliable, produces cleaner data, and is less brittle. The scraping approach may be necessary when no API exists, but it depends on the HTML structure (which can change without notice). This exercise reinforces the principle: always prefer APIs when available.

Exercise 13.20 — Retry logic with exponential backoff ⭐⭐⭐⭐

Write a function request_with_backoff(url, max_retries=5) that implements exponential backoff: if a request fails, wait 1 second before retrying, then 2 seconds, then 4, then 8, then 16. This is a common pattern in production data pipelines. Explain why exponential backoff is better than a fixed retry delay.

Guidance

import requests
import time

def request_with_backoff(url, params=None,
                         max_retries=5):
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, params=params,
                                timeout=10)
            if resp.status_code == 200:
                return resp
            elif resp.status_code == 429:
                wait = 2 ** attempt
                print(f"Retry {attempt + 1}: "
                      f"waiting {wait}s")
                time.sleep(wait)
            else:
                return resp
        except requests.RequestException:
            wait = 2 ** attempt
            print(f"Connection error. "
                  f"Retry in {wait}s")
            time.sleep(wait)
    return None

Exponential backoff is better than fixed delays because: if the server is overloaded, many clients retrying every 1 second adds to the load. Exponential backoff spreads retries out over time, giving the server a chance to recover.

Exercise 13.21 — Caching for repeated API calls ⭐⭐⭐⭐

Write a function that wraps API calls with local file caching. If the data was fetched within the last 24 hours, return the cached version instead of making a new request. Use file modification timestamps to determine freshness.

Guidance

import requests
import json
import os
import time

def cached_request(url, cache_path, max_age=86400):
    """Fetch from API or return cached data."""
    if os.path.exists(cache_path):
        age = time.time() - os.path.getmtime(cache_path)
        if age < max_age:
            with open(cache_path, 'r') as f:
                print("Using cached data")
                return json.load(f)

    resp = requests.get(url, timeout=10)
    if resp.status_code == 200:
        data = resp.json()
        with open(cache_path, 'w') as f:
            json.dump(data, f)
        return data
    return None

Part E: Ethics and Judgment (All levels)

These problems don't have single right answers. They require ethical reasoning and judgment.

Exercise 13.22 — Ethical scenario analysis ⭐⭐

For each scenario, decide whether the data collection is ethical, unethical, or "it depends." Justify your reasoning.

A student scrapes public restaurant health inspection scores from a city government website to build a searchable database.
A company scrapes competitor pricing from e-commerce sites every hour to dynamically adjust their own prices.
A researcher scrapes social media profiles to study mental health patterns, using publicly posted data.
A journalist scrapes public court records to investigate patterns in sentencing.
A startup scrapes millions of images from personal blogs to train an AI model.

Guidance

1. **Likely ethical** — government data is public by design, the purpose is public benefit, and no personal harm results. Still, check robots.txt and ToS. 2. **It depends** — competitive intelligence is common, but hourly scraping could violate ToS and strain the competitor's servers. The legality varies by jurisdiction. 3. **Likely unethical** (despite being technically possible) — "publicly posted" doesn't mean "consented to research use." Mental health data is sensitive. This likely requires IRB approval and informed consent. 4. **Likely ethical** — court records are public by design, journalism serves the public interest, and sentencing analysis can expose systemic problems. 5. **Ethically problematic** — personal blog images are copyrighted, the bloggers didn't consent to AI training, and commercial use compounds the concern. This is the Clearview AI pattern.

Part M: Mixed Review (Chapters 1-12)

Exercise 13.23 — Review: Data types (Chapter 7) ⭐⭐

After fetching JSON data from an API and loading it into a DataFrame, you notice that a population column has dtype object instead of int64. What are two possible causes and how would you fix each?

Guidance

**Cause 1:** Some values are strings (e.g., `"N/A"` or `"unknown"`) mixed with numbers. Fix: `df['population'] = pd.to_numeric(df['population'], errors='coerce')` — this converts to numeric and turns unconvertible values into NaN. **Cause 2:** The JSON had null values, and pandas imported the column as mixed type. Fix: Same as above — `pd.to_numeric()` with `errors='coerce'` handles this cleanly.

Exercise 13.24 — Review: Merge patterns (Chapter 9) ⭐⭐

You've collected data from two different APIs: one returns country-level COVID data and the other returns country-level economic data. The first uses a column called iso_code and the second uses country_code. Write the merge code and explain which how parameter you'd choose and why.

Guidance

merged = covid_df.merge(
    econ_df,
    left_on='iso_code',
    right_on='country_code',
    how='inner'
)

Use `left_on` and `right_on` when the join columns have different names. Choose `how='inner'` if you only want countries present in both datasets (complete cases), or `how='outer'` if you want to keep all countries and investigate which are missing from each source. Start with `inner` for analysis, use `outer` with `indicator=True` for data quality investigation.

Exercise 13.25 — Review: Dates (Chapter 11) + APIs ⭐⭐⭐

You fetch time series data from an API. The dates come as Unix timestamps (integers representing seconds since January 1, 1970). Write code to: 1. Convert the timestamp column to pandas datetime 2. Set it as the index 3. Resample to weekly frequency

Guidance

# Convert Unix timestamps to datetime
df['date'] = pd.to_datetime(df['timestamp'], unit='s')

# Set as index
df = df.set_index('date')

# Resample to weekly
weekly = df.resample('W').mean()
print(weekly.head())

Unix timestamps are common in API responses. The `unit='s'` parameter tells pandas the values are in seconds (some APIs use milliseconds, in which case you'd use `unit='ms'`).