Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets

Contributors to Introduction to Data Science

20 min read

> "The best time to collect data was before you needed it. The second best time is now."

Prerequisites

7
12
8
10

Learning Objectives

Retrieve data from a REST API using the requests library (GET requests with parameters and headers)
Parse HTML content with BeautifulSoup to extract structured data from web pages
Handle authentication, pagination, and rate limiting when collecting web data
Evaluate the legal and ethical considerations of web data collection
Build a reusable data collection script that fetches, processes, and stores web data

In This Chapter

Chapter Overview
13.1 How the Web Works (Just Enough)
13.2 The requests Library: Your Web Data Swiss Army Knife
13.3 API Authentication: Proving Who You Are
13.4 Pagination and Rate Limiting: Being a Good API Citizen
13.5 Web Scraping with BeautifulSoup: When There's No API
13.6 Ethics and Legality: The Hard Questions
13.7 Building a Reusable Data Collection Pipeline
13.8 Project Checkpoint: Pulling Live Vaccination Data
13.9 Spaced Review: Chapters 1-12
Chapter Summary
Concept Inventory Update

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets

"The best time to collect data was before you needed it. The second best time is now." — Adapted from a well-known proverb

Chapter Overview

In Chapter 12, you learned to load data from files sitting on your computer — CSVs, Excel workbooks, JSON files, and databases. That covers a lot of real-world scenarios. But what about data that doesn't exist as a file yet? What about data that lives on someone else's server, updated in real time, accessible only through the internet?

The web is the largest, most diverse, most continuously updated data source ever created. Government agencies publish economic indicators. Weather services broadcast forecasts. Social media platforms stream millions of posts per hour. Sports leagues maintain play-by-play databases. Public health organizations release vaccination and disease surveillance data. And all of it is, in principle, accessible from your Python script.

But "in principle" is doing a lot of work in that sentence. Getting data from the web requires understanding a few things that loading a local file doesn't:

How the web works — at least enough to understand HTTP requests and responses.
APIs — the structured, polite way to ask a server for data.
Web scraping — the sometimes-necessary, sometimes-controversial alternative when no API exists.
Ethics and legality — because just because you can access data doesn't mean you should.

This chapter teaches all four. By the end, you'll be able to write a Python script that reaches out to the internet, retrieves data, processes it into a DataFrame, and saves it for analysis. You'll also know when it's appropriate to do so — and when it's not.

In this chapter, you will learn to:

Retrieve data from a REST API using the requests library (GET requests with parameters and headers) (all paths)
Parse HTML content with BeautifulSoup to extract structured data from web pages (standard + deep dive paths)
Handle authentication, pagination, and rate limiting when collecting web data (standard + deep dive paths)
Evaluate the legal and ethical considerations of web data collection (all paths)
Build a reusable data collection script that fetches, processes, and stores web data (deep dive path)

13.1 How the Web Works (Just Enough)

Before we write any code, let's build a mental model of what happens when you visit a website or request data from the internet.

The Request-Response Cycle

Every interaction on the web follows the same pattern:

Your computer sends a request to a server. ("Give me the weather for Portland.")
The server processes the request and prepares a response.
The server sends the response back to your computer. (The weather data, or an error message if something went wrong.)

When you type a URL into your browser and press Enter, your browser sends a request. The server sends back HTML (the language that web pages are written in), and your browser renders it into the page you see. When you write Python code to request data from an API, the same thing happens — except your Python script sends the request and receives the response, and the response is usually JSON instead of HTML.

HTTP: The Language of the Web

HTTP (HyperText Transfer Protocol) is the protocol — the set of rules — that governs how requests and responses are formatted. You don't need to understand HTTP deeply, but you need to know two things:

HTTP methods tell the server what you want to do: - GET: "Give me some data." (This is what you use 90% of the time for data collection.) - POST: "Here's some data — process it." (Used for submitting forms, uploading files, or sending data to an API.)

HTTP status codes tell you what happened:

Code	Meaning	What to Do
200	OK — success	Parse the response
301	Moved permanently	Follow the redirect (requests handles this automatically)
400	Bad request	Check your parameters
401	Unauthorized	Check your API key / authentication
403	Forbidden	You don't have permission
404	Not found	Check the URL
429	Too many requests	You're hitting the server too fast — slow down
500	Server error	The server has a problem — try again later

You'll check status codes in your code to make sure your request succeeded before trying to parse the response.

URLs and Query Parameters

A URL (Uniform Resource Locator) is the address of a resource on the web. For APIs, URLs often include query parameters — key-value pairs that specify what data you want:

https://api.weather.example.com/forecast?city=Portland&units=fahrenheit

Breaking this down: - https://api.weather.example.com/forecast — the base URL (the API endpoint) - ? — marks the start of query parameters - city=Portland — first parameter (key=value) - & — separates multiple parameters - units=fahrenheit — second parameter

You'll construct URLs with parameters in your code, but (as we'll see) the requests library handles the formatting for you.

🔄 Check Your Understanding

In your own words, describe the request-response cycle.

What's the difference between a GET request and a POST request?

If you receive a status code of 429, what should you do?

13.2 The requests Library: Your Web Data Swiss Army Knife

The requests library is Python's go-to tool for making HTTP requests. It's not part of the standard library, but it's so widely used that it's practically a standard:

# Install if needed: pip install requests
import requests

Your First API Call

Let's make a GET request to a public API. We'll use a hypothetical public data API, but the pattern is identical for any REST API you'll encounter:

import requests

response = requests.get(
    'https://api.example.com/data/countries'
)

print(response.status_code)  # 200 = success
print(type(response))        # <class 'requests.models.Response'>

The response object contains everything the server sent back. The most important attributes:

response.status_code   # HTTP status code (200, 404, etc.)
response.text          # Response body as a string
response.json()        # Response body parsed as JSON (if applicable)
response.headers       # Response headers (metadata)

Passing Query Parameters

Instead of manually building URLs with ? and &, pass parameters as a dictionary:

params = {
    'country': 'Brazil',
    'year': 2023,
    'indicator': 'vaccination_rate'
}

response = requests.get(
    'https://api.example.com/data/indicators',
    params=params
)

# requests builds the URL for you:
print(response.url)
# https://api.example.com/data/indicators?country=Brazil&year=2023&indicator=vaccination_rate

This is cleaner, less error-prone, and handles special characters in parameter values automatically.

Parsing JSON Responses

Most APIs return data as JSON. The response.json() method parses it into Python dictionaries and lists:

response = requests.get(
    'https://api.example.com/data/countries'
)

if response.status_code == 200:
    data = response.json()
    print(type(data))    # Usually list or dict
    print(len(data))     # How many records?
else:
    print(f"Error: {response.status_code}")

Notice the if check on the status code. Always verify that the request succeeded before trying to parse the response. If the server returned an error (status 400, 404, 500, etc.), calling response.json() might fail or return an error message instead of data.

From JSON Response to DataFrame

Once you have the JSON data, you're back in familiar territory from Chapter 12:

import pandas as pd

response = requests.get(
    'https://api.example.com/data/countries'
)

if response.status_code == 200:
    data = response.json()

    # If the response is a flat list of records:
    df = pd.DataFrame(data)

    # If the response is nested:
    df = pd.json_normalize(data)

    # If the data is buried inside a wrapper:
    records = data['results']  # or data['data'], etc.
    df = pd.DataFrame(records)

    print(df.shape)
    print(df.head())

The exact approach depends on the structure of the JSON response. This is why the "load, explore, extract, normalize" pattern from Chapter 12 is so important — it applies directly to API responses.

Adding Headers

Some APIs require you to send additional information in the request headers. Headers are metadata about your request — they're separate from the URL and parameters:

headers = {
    'Accept': 'application/json',
    'User-Agent': 'MyDataScienceProject/1.0'
}

response = requests.get(
    'https://api.example.com/data',
    headers=headers
)

The User-Agent header identifies your script to the server. Some APIs require it, and it's good practice to include one even when it's not required — it tells the server operator who's making requests and lets them contact you if there's a problem.

Debugging Walkthrough: "ConnectionError" or "Timeout"

You run requests.get('https://api.example.com/data') and get a ConnectionError or the request hangs indefinitely.

Common causes: 1. You're not connected to the internet (check your connection) 2. The URL is wrong (typo in the domain name) 3. The server is down (try the URL in your browser) 4. Your network blocks the request (corporate firewall)

Best practice: Always set a timeout so your script doesn't hang forever:

python response = requests.get(url, timeout=10) # 10 seconds

🔄 Check Your Understanding

What does response.json() return, and when would it fail?

Why is it better to pass parameters as a dictionary instead of building the URL manually?

Write the code to make a GET request with a query parameter year=2023 and parse the JSON response.

13.3 API Authentication: Proving Who You Are

Many APIs are free and public — you just send a request and get data back. But many others require authentication: you need to prove your identity before the server will give you anything.

Why APIs Require Authentication

Three reasons: 1. Rate limiting — the server needs to know who's making requests so it can enforce usage limits (e.g., "1,000 requests per hour per user"). 2. Access control — some data is only available to authorized users. 3. Accountability — if someone misuses the API, the provider needs to know who did it.

API Keys: The Most Common Method

An API key is a unique string (like a long password) that identifies you. You typically get one by creating an account on the API provider's website. You then include the key in your requests, usually in one of two ways:

As a query parameter:

params = {
    'city': 'Portland',
    'api_key': 'your_key_here_abc123'
}
response = requests.get(url, params=params)

As a header:

headers = {
    'Authorization': 'Bearer your_key_here_abc123'
}
response = requests.get(url, headers=headers)

Which method to use depends on the API — the documentation will tell you.

Keeping Your API Key Safe

This is important: never put your API key directly in your code if you're going to share that code. If you push a script to GitHub with your API key visible, anyone can use (and abuse) your key.

Instead, store your key in an environment variable or a separate file that you don't share:

import os

# Read from environment variable
api_key = os.environ.get('WEATHER_API_KEY')

# Or read from a local file
with open('.api_key', 'r') as f:
    api_key = f.read().strip()

headers = {'Authorization': f'Bearer {api_key}'}
response = requests.get(url, headers=headers)

If you use Git, add .api_key to your .gitignore file so it never gets committed.

Action Checklist: API Key Safety

[ ] Never hardcode API keys in shared scripts

[ ] Store keys in environment variables or local config files

[ ] Add config files to .gitignore

[ ] If you accidentally commit a key, revoke it immediately and generate a new one

[ ] Use different keys for development and production

13.4 Pagination and Rate Limiting: Being a Good API Citizen

When an API has thousands or millions of records, it doesn't send everything at once. It sends results in pages — a pattern called pagination.

Handling Pagination

A paginated API response typically looks like this:

{
  "results": [...],
  "page": 1,
  "total_pages": 15,
  "next_url": "https://api.example.com/data?page=2"
}

To get all the data, you need to loop through the pages:

import time

all_records = []
page = 1

while True:
    response = requests.get(
        'https://api.example.com/data',
        params={'page': page, 'per_page': 100}
    )
    data = response.json()

    all_records.extend(data['results'])

    if page >= data['total_pages']:
        break
    page += 1
    time.sleep(1)  # Be polite — wait between requests

df = pd.DataFrame(all_records)
print(f"Collected {len(df)} records from {page} pages")

Notice the time.sleep(1) — this pauses for one second between requests. This brings us to rate limiting.

Rate Limiting: Don't Be a Jerk

Rate limiting is a policy that restricts how many requests you can make in a given time period. Common limits are "60 requests per minute" or "1,000 requests per hour."

If you exceed the rate limit, the server returns a 429 (Too Many Requests) status code. Some APIs include rate limit information in the response headers:

response = requests.get(url, headers=headers)

# Check rate limit headers
remaining = response.headers.get('X-RateLimit-Remaining')
reset_time = response.headers.get('X-RateLimit-Reset')
print(f"Requests remaining: {remaining}")
print(f"Limit resets at: {reset_time}")

Best practices for rate limiting:

Read the documentation. The API docs will tell you the rate limit. Respect it.
Add delays between requests. time.sleep(1) between requests is a good default.
Handle 429 responses gracefully. If you get a 429, pause and retry:

import time

def polite_get(url, params=None, max_retries=3):
    """Make a GET request with rate limit handling."""
    for attempt in range(max_retries):
        response = requests.get(url, params=params)
        if response.status_code == 429:
            wait = int(response.headers.get(
                'Retry-After', 60))
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        elif response.status_code == 200:
            return response
        else:
            print(f"Error {response.status_code}")
            return response
    return response

Cache your results. Don't re-request data you've already downloaded. Save intermediate results to a file so you can restart from where you left off if something goes wrong.

Scenario Walkthrough: Collecting 10,000 Records with Rate Limiting

You need 10,000 records from an API that returns 100 records per page and allows 60 requests per minute.

Pages needed: 10,000 / 100 = 100 pages

At 60 requests/minute: 100 pages takes about 1.7 minutes

With 1-second delays: 100 pages takes about 100 seconds

Your script should: (1) loop through pages, (2) sleep 1 second between requests, (3) check status codes, (4) save progress periodically, and (5) handle interruptions gracefully.

🔄 Check Your Understanding

What is pagination, and why do APIs use it?

What does a 429 status code mean, and how should your code handle it?

Why should you add time.sleep() between API requests, even if you haven't hit the rate limit yet?

13.5 Web Scraping with BeautifulSoup: When There's No API

APIs are the clean, structured, official way to get data from the web. But not every website has an API. Sometimes the data you need is embedded in a web page — an HTML table of statistics, a list of prices, a collection of reviews — and the only way to get it is to download the HTML and extract the data yourself.

This is web scraping: programmatically extracting data from web pages by parsing their HTML structure.

What HTML Looks Like

HTML (HyperText Markup Language) is the language used to structure web pages. It uses tags to mark up content:

<html>
  <body>
    <h1>Weather Report</h1>
    <table>
      <tr>
        <th>City</th>
        <th>Temperature</th>
      </tr>
      <tr>
        <td>Portland</td>
        <td>72</td>
      </tr>
      <tr>
        <td>Denver</td>
        <td>85</td>
      </tr>
    </table>
  </body>
</html>

Tags come in pairs (<table> ... </table>), and they nest inside each other to create a tree structure. <tr> is a table row, <th> is a table header cell, and <td> is a table data cell.

BeautifulSoup: Navigating the HTML Tree

BeautifulSoup is a Python library for parsing HTML. It turns raw HTML into a tree structure you can navigate and search:

# Install if needed: pip install beautifulsoup4
from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1>Weather Report</h1>
    <table>
      <tr><th>City</th><th>Temperature</th></tr>
      <tr><td>Portland</td><td>72</td></tr>
      <tr><td>Denver</td><td>85</td></tr>
    </table>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

The soup object is now a navigable representation of the HTML. Here are the most important methods:

find() — find the first matching element:

title = soup.find('h1')
print(title.text)  # "Weather Report"

find_all() — find all matching elements:

rows = soup.find_all('tr')
print(len(rows))  # 3 (header row + 2 data rows)

for row in rows:
    cells = row.find_all(['th', 'td'])
    print([cell.text for cell in cells])
# ['City', 'Temperature']
# ['Portland', '72']
# ['Denver', '85']

select() — find elements using CSS selectors:

# Select all <td> elements
cells = soup.select('td')

# Select elements by class
items = soup.select('.product-name')

# Select elements by ID
header = soup.select('#main-header')

A Real Scraping Workflow

Here's the complete pattern: request a web page, parse the HTML, extract the data, and build a DataFrame:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Download the web page
url = 'https://example.com/weather-data'
response = requests.get(url)

# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Find the data table
table = soup.find('table')
rows = table.find_all('tr')

# Step 4: Extract data from rows
data = []
for row in rows[1:]:  # Skip header row
    cells = row.find_all('td')
    record = {
        'city': cells[0].text.strip(),
        'temperature': int(cells[1].text.strip())
    }
    data.append(record)

# Step 5: Create a DataFrame
df = pd.DataFrame(data)
print(df)

pandas Can Read HTML Tables Directly

For simple HTML tables, pandas has a shortcut that can save you from writing BeautifulSoup code entirely:

tables = pd.read_html('https://example.com/data-page')
# Returns a list of DataFrames, one per <table> on the page

# Usually you want the first (or a specific) table
df = tables[0]

pd.read_html() uses BeautifulSoup under the hood, but it's limited to <table> elements. If the data isn't in a table (e.g., it's in a list, in div elements, or spread across the page), you'll need BeautifulSoup directly.

Debugging Walkthrough: "The data I see in my browser isn't in the HTML"

You request a web page and try to find a table, but soup.find('table') returns None even though you can see the table in your browser.

Common cause: The table is loaded dynamically using JavaScript after the page loads. When you use requests.get(), you get the initial HTML — before JavaScript runs. The data doesn't exist in that initial HTML.

Solutions: 1. Check if the page has an API (look in the browser's Developer Tools > Network tab for XHR/Fetch requests — the data might come from an API endpoint you can call directly) 2. Use a browser automation tool like Selenium that can execute JavaScript (beyond this chapter's scope) 3. Look for an alternative source that provides the data as a file or API

🔄 Check Your Understanding

What is web scraping, and how is it different from using an API?

What do find() and find_all() do in BeautifulSoup?

Why might data visible in your browser not appear in the HTML returned by requests.get()?

13.6 Ethics and Legality: The Hard Questions

This is the most important section in this chapter. Technical skill without ethical judgment is dangerous.

Web scraping and API access raise questions that don't have clear technical answers — they require judgment, empathy, and awareness of consequences. Let's address them directly.

Ethical Analysis: Five Questions to Ask Before Scraping

Before you write a single line of scraping code, work through these questions:

1. Does an API exist? Always check first. If the data provider offers an API, use it. APIs are structured, documented, and intended for programmatic access. Scraping a website when an API exists is like climbing through a window when the door is open.

2. What does robots.txt say? Most websites have a file at /robots.txt (e.g., https://example.com/robots.txt) that specifies which parts of the site automated programs (like your script) are allowed to access. This file isn't legally binding in all jurisdictions, but respecting it is the ethical baseline.

```python

Check robots.txt

response = requests.get('https://example.com/robots.txt') print(response.text) ```

Common directives: - User-agent: * — rules apply to all bots - Disallow: /private/ — don't access this path - Crawl-delay: 10 — wait 10 seconds between requests

3. What are the Terms of Service? Many websites have Terms of Service (ToS) that explicitly prohibit or restrict scraping. Violating ToS can have legal consequences, even if the data is publicly visible.

4. Could your scraping cause harm? Sending thousands of requests per second can overwhelm a small website's server, effectively creating a denial-of-service attack. This is both unethical and potentially illegal. Always rate-limit your requests.

5. What about the data itself? Even if you can legally scrape data, consider: Does the data contain personal information? Could your use of it harm the people represented in it? A list of public restaurant addresses is different from a collection of people's social media posts.

Legal Landscape: What the Law Says (and Doesn't)

The legality of web scraping varies by country and is evolving rapidly. A few key principles:

In the United States: The 2022 Ninth Circuit ruling in hiQ Labs v. LinkedIn established that scraping publicly accessible data is not a violation of the Computer Fraud and Abuse Act (CFAA). However, violating Terms of Service may still have legal consequences under other laws.
In the European Union: The General Data Protection Regulation (GDPR) places strict limits on collecting personal data, even if it's publicly available. Scraping personal data of EU residents without a legal basis can result in significant fines.
Universally: Copyright law protects creative content. Scraping someone's articles, images, or databases for commercial use may violate copyright, even if the content is publicly accessible.

The bottom line: "The data is publicly available" does not mean "I can do whatever I want with it." When in doubt, consult the API first, read the ToS, check robots.txt, and consider the impact.

An Ethics Checklist for Web Data Collection

Use this before every scraping project:

Before collecting data from the web:

[ ] Have I checked for an official API?
[ ] Have I read the site's robots.txt?
[ ] Have I read the Terms of Service?
[ ] Am I rate-limiting my requests (at least 1 second between them)?
[ ] Am I identifying myself with a User-Agent header?
[ ] Does the data contain personal information?
[ ] If it does, do I have a legal basis for collecting it?
[ ] Could my data collection harm the website or the people in the data?
[ ] Am I collecting only the data I actually need?
[ ] Am I storing the data securely?

Real-World Application: The Clearview AI Controversy

In 2020, the company Clearview AI scraped billions of photos from social media platforms (Facebook, Twitter, YouTube, and others) to build a facial recognition database, which they sold to law enforcement agencies. The photos were publicly available, but the platforms' Terms of Service explicitly prohibited scraping. Multiple countries and several US states investigated or fined Clearview AI. The case illustrates a critical principle: "publicly accessible" and "freely usable" are not the same thing.

This isn't an edge case. It's the kind of ethical question you'll face in your career — and the answer matters.

13.7 Building a Reusable Data Collection Pipeline

Let's put everything together. We'll build a data collection script that demonstrates professional practices: proper error handling, rate limiting, progress tracking, and caching.

A Complete API Collection Script

import requests
import pandas as pd
import time
import json
import os

def collect_api_data(base_url, params, headers=None,
                     pages=None, delay=1.0,
                     cache_file=None):
    """Collect paginated data from a REST API.

    Parameters
    ----------
    base_url : str
        The API endpoint URL
    params : dict
        Query parameters
    headers : dict, optional
        Request headers (e.g., for authentication)
    pages : int, optional
        Maximum pages to collect (None = all)
    delay : float
        Seconds to wait between requests
    cache_file : str, optional
        Path to save intermediate results

    Returns
    -------
    pd.DataFrame
    """
    all_records = []
    page = 1

    # Resume from cache if available
    if cache_file and os.path.exists(cache_file):
        with open(cache_file, 'r') as f:
            cached = json.load(f)
        all_records = cached['records']
        page = cached['next_page']
        print(f"Resuming from page {page} "
              f"({len(all_records)} cached records)")

    while True:
        params['page'] = page
        try:
            resp = requests.get(
                base_url, params=params,
                headers=headers, timeout=10
            )
        except requests.RequestException as e:
            print(f"Request failed: {e}")
            break

        if resp.status_code == 429:
            wait = int(resp.headers.get(
                'Retry-After', 60))
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
            continue

        if resp.status_code != 200:
            print(f"Error {resp.status_code} on "
                  f"page {page}")
            break

        data = resp.json()
        records = data.get('results', data)
        if not records:
            break

        all_records.extend(records)
        print(f"Page {page}: {len(records)} records "
              f"(total: {len(all_records)})")

        # Save progress
        if cache_file:
            with open(cache_file, 'w') as f:
                json.dump({
                    'records': all_records,
                    'next_page': page + 1
                }, f)

        # Check if we've collected enough
        total_pages = data.get('total_pages')
        if pages and page >= pages:
            break
        if total_pages and page >= total_pages:
            break

        page += 1
        time.sleep(delay)

    return pd.DataFrame(all_records)

This function handles: - Pagination (looping through pages) - Rate limiting (sleeping between requests, handling 429 responses) - Error handling (catching exceptions, checking status codes) - Caching (saving progress so you can resume after interruptions) - Progress tracking (printing status as it runs)

Using the Pipeline

df = collect_api_data(
    base_url='https://api.example.com/data',
    params={'country': 'all', 'per_page': 100},
    headers={'Authorization': 'Bearer YOUR_KEY'},
    delay=1.0,
    cache_file='data_cache.json'
)

print(f"Collected {len(df)} records")
df.to_csv('collected_data.csv', index=False)

A Web Scraping Pipeline

For scraping, the pattern is similar but requires more custom parsing:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_table_pages(base_url, max_pages=10,
                       delay=2.0):
    """Scrape tables from paginated web pages."""
    all_data = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        resp = requests.get(url, timeout=10,
            headers={'User-Agent': 'DataScienceProject'})

        if resp.status_code != 200:
            print(f"Stopped at page {page}: "
                  f"status {resp.status_code}")
            break

        soup = BeautifulSoup(resp.text, 'html.parser')
        table = soup.find('table')
        if not table:
            break

        rows = table.find_all('tr')[1:]  # Skip header
        for row in rows:
            cells = row.find_all('td')
            record = [c.text.strip() for c in cells]
            all_data.append(record)

        print(f"Page {page}: {len(rows)} rows scraped")
        time.sleep(delay)

    return all_data

13.8 Project Checkpoint: Pulling Live Vaccination Data

Time to extend our running project with live data from the web. In this checkpoint, you'll pull the latest vaccination data from a public health API and discuss the ethics of web data collection.

Step 1: Fetch Data from a Public Health API

import requests
import pandas as pd

# The WHO Global Health Observatory has public APIs
# (This is a simplified example of the pattern)
url = 'https://api.example-health.org/vaccination'
params = {
    'indicator': 'MCV1',
    'format': 'json',
    'per_page': 200
}

response = requests.get(url, params=params, timeout=15)

if response.status_code == 200:
    data = response.json()
    vacc_api = pd.json_normalize(data['value'])
    print(f"Fetched {len(vacc_api)} records from API")
    print(vacc_api.head())
else:
    print(f"API request failed: {response.status_code}")

Step 2: Compare with Our File-Based Data

# Load the file-based data from Chapter 12
vacc_file = pd.read_csv('vaccination_coverage.csv')

# Check: does the API data have more recent years?
api_years = set(vacc_api['year'].unique())
file_years = set(vacc_file['year'].unique())
new_years = api_years - file_years

print(f"API has data for years: {sorted(api_years)}")
print(f"File has data for years: {sorted(file_years)}")
print(f"New years from API: {sorted(new_years)}")

Step 3: Ethical Reflection

Before incorporating this data, we should document our ethical considerations:

## Data Collection Ethics Note

**Source:** WHO Global Health Observatory API
**Access method:** Public REST API (no authentication required)
**Rate limiting:** Respected API guidelines; 1-second delay between requests
**robots.txt:** Checked; no restrictions on API endpoints
**Terms of Service:** WHO data is available under CC BY-NC-SA 3.0 IGO license
**Personal data:** No individual-level data; all aggregated by country
**Storage:** Data saved locally for academic analysis only

This kind of documentation isn't just good practice — it's increasingly required by journals, institutions, and employers.

What You've Built So Far (Project Progress):

Chapters 7-8: Loaded and cleaned the vaccination CSV

Chapters 9-10: Reshaped and transformed the data, cleaned text fields

Chapter 11: Added date handling and time series features

Chapter 12: Integrated population data (Excel) and country metadata (JSON)

Chapter 13: Added live API data and documented collection ethics

Next: Part III will introduce visualization and exploration

13.9 Spaced Review: Chapters 1-12

Let's revisit key concepts from earlier chapters to keep them fresh.

From Chapter 5 (Data Structures): What's the difference between a list and a dictionary in Python? When would you use each?

Check your answer

A list is an ordered sequence accessed by position (index). A dictionary is an unordered collection of key-value pairs accessed by key. Use a list when you have a collection of items in a specific order (e.g., a sequence of temperatures). Use a dictionary when you need to look up values by name (e.g., mapping country codes to country names). JSON data maps directly to these structures: JSON arrays become Python lists, JSON objects become Python dictionaries.

From Chapter 8 (Cleaning Messy Data): What does df.dropna() do, and when would you use df.fillna() instead?

Check your answer

`dropna()` removes rows (or columns) that contain missing values. `fillna()` replaces missing values with a specified value (a constant, the mean, or a forward/backward fill). Use `dropna()` when missing data is rare and the remaining data is sufficient for analysis. Use `fillna()` when you can reasonably estimate what the missing values should be, or when dropping rows would lose too much data.

From Chapter 10 (Text Data): How would you extract the domain name from a column of email addresses like user@example.com?

Check your answer

`df['domain'] = df['email'].str.split('@').str[1]` — split on `@` and take the second element (index 1). Alternatively, `df['email'].str.extract(r'@(.+)')` uses a regex to capture everything after the `@` sign.

From Chapter 12 (Getting Data from Files): What is the difference between pd.read_json() and pd.json_normalize()?

Check your answer

`pd.read_json()` loads flat, table-like JSON directly into a DataFrame. `pd.json_normalize()` handles nested JSON by flattening hierarchical structures into columns with dot-separated names (e.g., `address.city`). Use `read_json()` for simple arrays of objects with no nesting; use `json_normalize()` when the JSON contains objects inside objects.

Chapter Summary

This chapter was about turning the internet into your data source — responsibly.

You learned the technical tools: - The requests library for making HTTP requests. You can now send GET requests with parameters and headers, check status codes, and parse JSON responses. - API patterns including authentication (API keys), pagination (looping through pages), and rate limiting (sleeping between requests and handling 429 responses). - BeautifulSoup for parsing HTML when no API exists. You can find elements by tag name, class, or CSS selector and extract text content from web pages. - A reusable pipeline that handles errors, respects rate limits, caches progress, and produces a clean DataFrame.

And you learned the ethical framework: - Always prefer APIs over scraping. - Respect robots.txt and Terms of Service. - Rate-limit your requests to avoid harming servers. - Consider the people in the data — especially when personal information is involved. - Document your collection methods so others can evaluate your approach.

The threshold concept for this chapter is that the web is a data source, not just a reading source — but accessing it responsibly requires understanding both the technical protocols and the ethical boundaries. The technical skills get you the data; the ethical judgment keeps you out of trouble and protects the people whose data you're using.

In Part III, we'll shift from collecting and cleaning data to understanding and communicating it — through visualization and exploratory analysis. The datasets you've assembled in Part II are about to come alive.

Concept Inventory Update

New concepts introduced in this chapter:

#	Concept	First Introduced	Section
1	HTTP request-response cycle	This chapter	13.1
2	REST APIs and structured data access	This chapter	13.2
3	Web scraping and HTML parsing	This chapter	13.5
4	Data collection ethics and legality	This chapter	13.6
5	Reusable data pipelines	This chapter	13.7

Running total: 54 concepts, 149 terms, 35 techniques introduced across Chapters 1-13.

Next up: Part III — Visualization and Exploration. You've learned to get data, clean it, and shape it. Now you'll learn to see it. Chapter 14 introduces matplotlib and the grammar of graphics.