Key Takeaways: Getting Data from the Web
This is your reference card for Chapter 13 — the chapter where the internet became your data source. Keep this nearby when you need to retrieve data from an API, scrape a web page, or evaluate whether you should be collecting web data at all.
The requests Library Quick Reference
Making a GET request:
import requests
response = requests.get(
'https://api.example.com/data',
params={'key': 'value', 'year': 2023},
headers={'Authorization': 'Bearer YOUR_KEY',
'User-Agent': 'MyProject/1.0'},
timeout=10
)
Checking the response:
response.status_code # 200, 404, 429, etc.
response.text # Raw response as string
response.json() # Parse JSON into dict/list
response.headers # Response metadata
response.url # The actual URL that was requested
Status codes you need to know:
| Code | Meaning | Your Action |
|---|---|---|
| 200 | Success | Parse the response |
| 400 | Bad request | Check your parameters |
| 401 | Unauthorized | Check your API key |
| 403 | Forbidden | You don't have access |
| 404 | Not found | Check the URL |
| 429 | Too many requests | Wait, then retry |
| 500 | Server error | Wait, then retry |
API Request Patterns
Simple request with parameters:
resp = requests.get(url, params={'city': 'Portland'})
data = resp.json()
df = pd.DataFrame(data)
Request with authentication:
headers = {'Authorization': f'Bearer {api_key}'}
resp = requests.get(url, headers=headers)
Paginated collection:
all_records = []
page = 1
while True:
resp = requests.get(url, params={'page': page})
data = resp.json()
all_records.extend(data['results'])
if page >= data['total_pages']:
break
page += 1
time.sleep(1)
df = pd.DataFrame(all_records)
Error handling template:
try:
resp = requests.get(url, timeout=10)
if resp.status_code == 200:
data = resp.json()
elif resp.status_code == 429:
time.sleep(60) # Rate limited
else:
print(f"Error: {resp.status_code}")
except requests.RequestException as e:
print(f"Connection failed: {e}")
API key safety — never do this:
API_KEY = 'sk_live_abc123' # WRONG: exposed in code
Do this instead:
import os
API_KEY = os.environ.get('MY_API_KEY')
BeautifulSoup Quick Reference
Setup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string, 'html.parser')
Finding elements:
| Method | What It Does | Returns |
|---|---|---|
soup.find('tag') |
First matching element | Single Tag or None |
soup.find_all('tag') |
All matching elements | List of Tags |
soup.find('tag', class_='name') |
First with specific class | Single Tag or None |
soup.find_all('tag', class_='name') |
All with specific class | List of Tags |
soup.select('css selector') |
CSS selector matching | List of Tags |
Extracting content:
element.text # Text content (no HTML tags)
element.text.strip() # Text with whitespace trimmed
element['href'] # Attribute value (e.g., link URL)
element.get('class') # Attribute value (returns None if missing)
Common scraping pattern:
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
table = soup.find('table')
rows = table.find_all('tr')
data = []
for row in rows[1:]: # Skip header
cells = row.find_all('td')
record = [c.text.strip() for c in cells]
data.append(record)
df = pd.DataFrame(data, columns=[...])
Shortcut for HTML tables:
tables = pd.read_html(url) # Returns list of DataFrames
df = tables[0] # First table on the page
Ethics Checklist for Web Data Collection
Use this before every web data collection project:
BEFORE COLLECTING:
[ ] Check for an official API first
[ ] Read the site's robots.txt
[ ] Read the Terms of Service
[ ] Identify yourself with a User-Agent header
[ ] Rate-limit requests (minimum 1 second between them)
ABOUT THE DATA:
[ ] Collect only what you need
[ ] Check for personal information
[ ] Consider whether data subjects would expect this use
[ ] Assess potential for harm if data is misused
AFTER COLLECTING:
[ ] Store data securely
[ ] Document your collection method
[ ] Document ethical considerations
[ ] Delete data you no longer need
Key principles: - "Publicly accessible" does not mean "freely usable" - Legal and ethical are separate questions — something can be legal but unethical - Scale matters — what's fine for 50 data points may be problematic for 50,000 - Always prefer APIs over scraping - When in doubt, slow down and investigate
Decision Flowchart: How to Get Web Data
Do you need data from the internet?
│
├── Does the source offer an API?
│ ├── Yes → Use the API
│ │ ├── Does it require authentication? → Get an API key
│ │ ├── Is data paginated? → Loop through pages
│ │ └── Are there rate limits? → Add time.sleep()
│ │
│ └── No → Is the data in HTML tables or lists?
│ ├── Yes → Consider scraping (check ethics first!)
│ │ ├── Check robots.txt
│ │ ├── Read Terms of Service
│ │ ├── Use BeautifulSoup or pd.read_html()
│ │ └── Rate-limit your requests
│ │
│ └── No → Is the data loaded by JavaScript?
│ ├── Check Network tab for hidden API calls
│ └── Consider Selenium (advanced, beyond Ch. 13)
Common Mistakes to Avoid
-
Not setting a timeout. Always use
timeout=10(or similar). Without it, a request to a slow server will hang your script forever. -
Not checking status codes. Calling
response.json()on a 404 or 500 response will give you an error message, not data — and your code will crash or produce garbage. -
Hammering the server. Always add delays between requests. Even if the API doesn't document a rate limit, rapid-fire requests are rude and may get you blocked.
-
Hardcoding API keys. Use environment variables. If you commit a key to Git, consider it compromised.
-
Skipping the ethics check. The easiest scraping project in the world isn't worth doing if it causes harm or violates trust.