Key Takeaways: Getting Data from the Web

Contributors to Introduction to Data Science

Key Takeaways: Getting Data from the Web

This is your reference card for Chapter 13 — the chapter where the internet became your data source. Keep this nearby when you need to retrieve data from an API, scrape a web page, or evaluate whether you should be collecting web data at all.

The requests Library Quick Reference

Making a GET request:

import requests

response = requests.get(
    'https://api.example.com/data',
    params={'key': 'value', 'year': 2023},
    headers={'Authorization': 'Bearer YOUR_KEY',
             'User-Agent': 'MyProject/1.0'},
    timeout=10
)

Checking the response:

response.status_code   # 200, 404, 429, etc.
response.text          # Raw response as string
response.json()        # Parse JSON into dict/list
response.headers       # Response metadata
response.url           # The actual URL that was requested

Status codes you need to know:

Code	Meaning	Your Action
200	Success	Parse the response
400	Bad request	Check your parameters
401	Unauthorized	Check your API key
403	Forbidden	You don't have access
404	Not found	Check the URL
429	Too many requests	Wait, then retry
500	Server error	Wait, then retry

API Request Patterns

Simple request with parameters:

resp = requests.get(url, params={'city': 'Portland'})
data = resp.json()
df = pd.DataFrame(data)

Request with authentication:

headers = {'Authorization': f'Bearer {api_key}'}
resp = requests.get(url, headers=headers)

Paginated collection:

all_records = []
page = 1
while True:
    resp = requests.get(url, params={'page': page})
    data = resp.json()
    all_records.extend(data['results'])
    if page >= data['total_pages']:
        break
    page += 1
    time.sleep(1)
df = pd.DataFrame(all_records)

Error handling template:

try:
    resp = requests.get(url, timeout=10)
    if resp.status_code == 200:
        data = resp.json()
    elif resp.status_code == 429:
        time.sleep(60)  # Rate limited
    else:
        print(f"Error: {resp.status_code}")
except requests.RequestException as e:
    print(f"Connection failed: {e}")

API key safety — never do this:

API_KEY = 'sk_live_abc123'  # WRONG: exposed in code

Do this instead:

import os
API_KEY = os.environ.get('MY_API_KEY')

BeautifulSoup Quick Reference

Setup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_string, 'html.parser')

Finding elements:

Method	What It Does	Returns
`soup.find('tag')`	First matching element	Single Tag or None
`soup.find_all('tag')`	All matching elements	List of Tags
`soup.find('tag', class_='name')`	First with specific class	Single Tag or None
`soup.find_all('tag', class_='name')`	All with specific class	List of Tags
`soup.select('css selector')`	CSS selector matching	List of Tags

Extracting content:

element.text           # Text content (no HTML tags)
element.text.strip()   # Text with whitespace trimmed
element['href']        # Attribute value (e.g., link URL)
element.get('class')   # Attribute value (returns None if missing)

Common scraping pattern:

resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
table = soup.find('table')
rows = table.find_all('tr')

data = []
for row in rows[1:]:  # Skip header
    cells = row.find_all('td')
    record = [c.text.strip() for c in cells]
    data.append(record)

df = pd.DataFrame(data, columns=[...])

Shortcut for HTML tables:

tables = pd.read_html(url)  # Returns list of DataFrames
df = tables[0]              # First table on the page

Ethics Checklist for Web Data Collection

Use this before every web data collection project:

BEFORE COLLECTING:
[ ] Check for an official API first
[ ] Read the site's robots.txt
[ ] Read the Terms of Service
[ ] Identify yourself with a User-Agent header
[ ] Rate-limit requests (minimum 1 second between them)

ABOUT THE DATA:
[ ] Collect only what you need
[ ] Check for personal information
[ ] Consider whether data subjects would expect this use
[ ] Assess potential for harm if data is misused

AFTER COLLECTING:
[ ] Store data securely
[ ] Document your collection method
[ ] Document ethical considerations
[ ] Delete data you no longer need

Key principles: - "Publicly accessible" does not mean "freely usable" - Legal and ethical are separate questions — something can be legal but unethical - Scale matters — what's fine for 50 data points may be problematic for 50,000 - Always prefer APIs over scraping - When in doubt, slow down and investigate

Decision Flowchart: How to Get Web Data

Do you need data from the internet?
│
├── Does the source offer an API?
│   ├── Yes → Use the API
│   │   ├── Does it require authentication? → Get an API key
│   │   ├── Is data paginated? → Loop through pages
│   │   └── Are there rate limits? → Add time.sleep()
│   │
│   └── No → Is the data in HTML tables or lists?
│       ├── Yes → Consider scraping (check ethics first!)
│       │   ├── Check robots.txt
│       │   ├── Read Terms of Service
│       │   ├── Use BeautifulSoup or pd.read_html()
│       │   └── Rate-limit your requests
│       │
│       └── No → Is the data loaded by JavaScript?
│           ├── Check Network tab for hidden API calls
│           └── Consider Selenium (advanced, beyond Ch. 13)

Common Mistakes to Avoid

Not setting a timeout. Always use timeout=10 (or similar). Without it, a request to a slow server will hang your script forever.
Not checking status codes. Calling response.json() on a 404 or 500 response will give you an error message, not data — and your code will crash or produce garbage.
Hammering the server. Always add delays between requests. Even if the API doesn't document a rate limit, rapid-fire requests are rude and may get you blocked.
Hardcoding API keys. Use environment variables. If you commit a key to Git, consider it compromised.
Skipping the ethics check. The easiest scraping project in the world isn't worth doing if it causes harm or violates trust.