> "The internet is the world's largest database. Most of it has no official API."
In This Chapter
- What You Will Learn
- The Monday Morning Stack
- 20.1 Why Web Scraping Matters for Business Intelligence
- 20.2 Ethical and Legal Principles: Read This First
- 20.3 How the Web Works: HTTP Basics
- 20.4 The requests Library
- 20.5 HTML Structure: What Your Scraper Sees
- 20.6 BeautifulSoup4: Parsing HTML
- 20.7 Navigating the Parse Tree
- 20.8 Extracting Links and Tables
- 20.9 Handling Pagination
- 20.10 Rate Limiting and the Ethics of Speed
- 20.11 Saving Scraped Data
- 20.12 Common Pitfalls
- 20.13 When NOT to Scrape
- 20.14 Priya's Competitor Price Tracker: The Complete Scenario
- 20.15 Pulling Everything Together: A Complete Scraping Workflow
- Chapter Summary
- Key Terms
Chapter 20: Web Scraping for Business Intelligence
"The internet is the world's largest database. Most of it has no official API."
What You Will Learn
By the end of this chapter, you will be able to:
- Explain why web scraping matters for competitive business intelligence and when it is the right approach
- Make HTTP requests using the
requestslibrary and interpret responses and status codes - Parse HTML with BeautifulSoup4 using
find(),find_all(), andselect() - Navigate the HTML parse tree: parents, children, and siblings
- Extract text, links, and tables from web pages
- Handle pagination across multi-page listings
- Add rate limiting and check
robots.txtfor ethical, responsible scraping - Save scraped data to CSV and JSON files
- Recognize common pitfalls: JavaScript-rendered pages, encoding issues, and changing page structure
- Know when to use an API instead of scraping
The Monday Morning Stack
It is Monday morning at Acme Corp regional headquarters. Priya Okonkwo, junior analyst, arrives to find three sticky notes on her monitor.
The first is from Sandra Chen, VP of Sales: "Can you pull NorthStar Office competitor pricing on paper reams and ink cartridges? Quarterly review is Thursday."
The second is from a regional sales manager: "I heard NorthStar just dropped their toner prices. Can you confirm before I talk to my team?"
The third is from someone Priya does not recognize: "Hi — are you the data person? What does our main competitor charge for folding tables?"
Before she learned Python, Priya would have spent most of Monday manually visiting each competitor's website, copying prices into a spreadsheet, and trying to remember whether the price she was looking at included shipping. She would do this again next Monday, and the Monday after that, and every Monday for as long as she kept this job.
Today, she has a better option.
Web scraping is the practice of programmatically extracting data from websites — the same data visible in your browser, but collected automatically and saved in a format your code can analyze. For business intelligence, this is transformative. Competitor pricing, job posting trends, publicly available financial data, real estate listings, regulatory filings — vast amounts of business-relevant information is published on the public web, updated constantly, and waiting to be collected systematically.
This chapter teaches you how to collect it responsibly, reliably, and efficiently.
20.1 Why Web Scraping Matters for Business Intelligence
Before writing a single line of code, it is worth understanding what web scraping is genuinely useful for in a business context. The technology is a means, not an end. Here are the most common and legitimate business use cases.
Competitive Pricing Intelligence
If your competitors publish their prices publicly — as most consumer-facing businesses do — you can track those prices over time. A regional distributor like Acme Corp can monitor whether a competitor is running a promotion on paper stock, which would explain a recent dip in Acme's own sales numbers. A retailer can check whether to adjust prices before a holiday weekend. This is not industrial espionage — it is the same information any customer could find by visiting the site. You are simply collecting it systematically rather than manually.
The difference between checking a competitor's price once and tracking it weekly for six months is the difference between a data point and a trend. Trends are where business decisions actually come from.
Market Research and Trend Analysis
Public job postings reveal where industries are investing. If every major competitor in your space is hiring data engineers, that is a signal worth tracking. Public business registrations show where new competitors are forming. Review aggregator sites show what customers consistently complain about — and what they praise. News aggregators show what your industry is talking about before it becomes conventional wisdom.
All of this is public information. Scraping lets you collect it at scale, regularly, and automatically.
Lead Generation (Used Responsibly)
Public business directories, trade association member lists, and professional organization websites sometimes publish contact information intended for legitimate outreach. This is the situation Maya Reyes finds herself in — building a prospecting tool from public directories for her consulting practice. The key word is "public," and even then, responsible use means respecting privacy laws, opt-out signals, and the spirit of why that information was published.
Monitoring Your Own Presence
You can scrape review sites to monitor what customers say about your products. You can track your own product listings on marketplaces where you sell. You can monitor whether your press releases are being picked up and how they are being covered. All of this is data about your own company, and collecting it automatically saves time that would otherwise go to manual monitoring.
Aggregating Public Government and Regulatory Data
Government databases, public financial filings, weather data, transportation schedules — an enormous amount of useful business data is published by public bodies and made freely available. Economic statistics from the Bureau of Labor Statistics, company filings from SEC EDGAR, import/export data from government trade agencies — these can be scraped (or better yet, accessed via their APIs when those exist) to inform business decisions.
20.2 Ethical and Legal Principles: Read This First
Web scraping is powerful, and like any powerful tool, it can be misused. Before writing a single scraper for production use, commit these principles to memory. This is not boilerplate caution — these are practical rules that will keep you out of legal trouble and keep your scrapers running.
Check robots.txt
Every website can publish a file at /robots.txt that specifies which parts of the site automated bots may and may not access. The format is a standard (called the Robots Exclusion Protocol) that has been honored by well-behaved bots since 1994. Ignoring this file is disrespectful at best and has been used as evidence in legal disputes at worst.
You will learn to check it programmatically in Section 20.4.
Read the Terms of Service
Many websites explicitly prohibit automated scraping in their terms of service. Scraping in violation of a site's ToS has led to account terminations, IP bans, and in some cases legal action under computer fraud statutes. When in doubt — and especially for high-volume or commercial scraping — consult legal counsel.
Do Not Hammer Servers
Adding even a one-second pause between requests dramatically reduces the load you place on a site's server. Aggressive scraping that sends hundreds of requests per minute can degrade service for real users and can be construed as a denial-of-service attack regardless of intent. Rate limiting is not optional — it is a basic professional obligation.
Do Not Collect Personal Data Without Legal Basis
GDPR (in the EU), CCPA (in California), and similar privacy regulations restrict how you may collect and use personal information. "It was on a public website" is not always a legal defense for collecting people's personal data, particularly if you plan to use it for marketing outreach or store it persistently. Know your jurisdiction's rules.
When an API Exists, Use It
Many websites that you might want to scrape offer a proper API — an official, designed-for-machine-access interface. APIs are faster, more reliable, more stable, and explicitly sanctioned. You will learn to recognize when to prefer an API over scraping in Section 20.11.
20.3 How the Web Works: HTTP Basics
When you type a URL into your browser and press Enter, a conversation happens between your computer and a remote server. Understanding this conversation is the foundation of web scraping.
Requests and Responses
Your browser sends an HTTP request to a server. The request specifies:
- The method: usually
GET(retrieve a resource) orPOST(submit data) - The URL: the address of the resource
- Headers: metadata about the request, including what software is making it (the
User-Agentheader)
The server sends back an HTTP response, which contains:
- A status code: a number indicating what happened
- Headers: metadata about the response (content type, encoding, caching rules)
- A body: the actual content — usually HTML, JSON, or binary data
Status Codes You Need to Know
| Code | Meaning | What to do in your scraper |
|---|---|---|
| 200 | OK — request succeeded | Proceed with parsing |
| 301 / 302 | Redirect — resource moved | requests follows these automatically |
| 403 | Forbidden — server refuses | May be blocking scrapers; check ToS |
| 404 | Not Found — resource missing | Log and skip; check your URL |
| 429 | Too Many Requests | Slow down; add longer delays |
| 500 | Internal Server Error | Server problem; retry later |
| 503 | Service Unavailable | Server overloaded; retry later |
When your scraper gets a 200, everything worked. When it gets anything else, your code needs to handle that gracefully — which you will see shortly in the requests section.
What Your Browser Does vs. What Python Does
Your browser receives HTML and renders it visually — fonts, colors, layout, images. Python receives the same HTML and sees it as a string of text. Your job as a scraper author is to write code that finds the specific pieces of that text — prices, product names, contact information — within the larger structure of the HTML document.
20.4 The requests Library
Python's standard library includes an http.client module, but it is low-level and verbose. The requests library is the community standard for HTTP in Python. Install it before continuing.
pip install requests
Making Your First Request
import requests
response = requests.get("https://httpbin.org/get")
print(response.status_code) # 200
print(type(response.text)) # <class 'str'>
print(len(response.text)) # number of characters in the response body
requests.get() sends a GET request to the specified URL and returns a Response object. That object contains everything the server sent back.
The Response Object
| Attribute / Method | What it contains |
|---|---|
response.status_code |
Integer status code (200, 404, etc.) |
response.text |
Response body decoded as a string |
response.content |
Response body as raw bytes |
response.json() |
Body parsed as JSON (raises error if not JSON) |
response.headers |
Response headers as a dictionary |
response.url |
Final URL after any redirects |
response.encoding |
Character encoding used to decode the body |
response.raise_for_status() |
Raises an exception if status >= 400 |
A Production-Ready Fetch Function
Here is a fetch function you will use as a building block throughout this chapter. Notice three things about it: it always sets a timeout, it always handles exceptions, and it uses raise_for_status() to convert error status codes into exceptions.
import time
import requests
SCRAPER_HEADERS = {
"User-Agent": (
"AcmeCorpPriceMonitor/1.0 "
"(Internal business intelligence tool; contact: priya@acmecorp.example.com)"
)
}
def fetch_page_html(url: str, session: requests.Session) -> str | None:
"""Fetch a URL and return its HTML content as a string.
Args:
url: The URL to fetch.
session: A requests.Session object for connection reuse.
Returns:
HTML content as a string, or None if the request fails.
"""
try:
response = session.get(url, timeout=15)
response.raise_for_status()
return response.text
except requests.exceptions.Timeout:
print(f" Timeout fetching: {url}")
return None
except requests.exceptions.HTTPError as error:
status = error.response.status_code
print(f" HTTP {status} error fetching: {url}")
if status == 429:
print(" Rate limited — waiting 30 seconds before next request.")
time.sleep(30)
return None
except requests.exceptions.ConnectionError:
print(f" Connection error fetching: {url}")
return None
except requests.exceptions.RequestException as error:
print(f" Request failed for {url}: {error}")
return None
A few design decisions worth noting:
Why a Session? A requests.Session object reuses the underlying TCP connection across multiple requests to the same host, which is faster and more polite than opening a new connection for every request. It also automatically handles cookies that the server sets — useful for sites that use session cookies to track visitors.
Why check for 429 specifically? A 429 response means the server is explicitly asking you to slow down. Ignoring it and retrying immediately will almost certainly result in your IP being blocked. A 30-second pause is the minimum courteous response.
Why raise_for_status()? Without it, a 403 or 404 response would return successfully (from requests's perspective) and you would try to parse an error page as if it were real data. raise_for_status() converts non-2xx responses into exceptions that your except block handles.
Checking robots.txt
Before scraping any site, check whether it permits automated access to the pages you want. Python's standard library includes a RobotFileParser for exactly this purpose.
import urllib.robotparser
from urllib.parse import urlparse
def is_scraping_allowed(target_url: str, user_agent: str = "*") -> bool:
"""Check whether a robot may access a given URL per the site's robots.txt.
Args:
target_url: The URL you want to scrape.
user_agent: Your bot's user agent name. Use "*" to check general rules.
Returns:
True if scraping is permitted, False otherwise.
Returns False (conservative) if robots.txt cannot be read.
"""
parsed = urlparse(target_url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
parser = urllib.robotparser.RobotFileParser()
parser.set_url(robots_url)
try:
parser.read()
return parser.can_fetch(user_agent, target_url)
except Exception as error:
print(f" Could not read robots.txt from {robots_url}: {error}")
return False # Conservative: assume not allowed if we cannot verify
# Use it before starting a scraping job
target = "https://northstar-office.example.com/products/"
if is_scraping_allowed(target):
print("robots.txt permits access — proceeding.")
else:
print("robots.txt disallows access — stopping.")
Priya runs this check before every new scraping target. It takes two seconds and has saved her from two situations where robots.txt explicitly blocked the paths she wanted to access.
Working with JSON Responses
Many modern web services return JSON data rather than HTML pages. When the response is JSON, use response.json() instead of response.text.
def fetch_json_data(url: str, params: dict, session: requests.Session) -> dict | None:
"""Fetch JSON data from an API endpoint.
Args:
url: The API endpoint URL.
params: Query parameters to include in the request.
session: A requests.Session object.
Returns:
Parsed JSON data as a Python dictionary, or None on failure.
"""
try:
response = session.get(url, params=params, timeout=15)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as error:
print(f" API request failed: {error}")
return None
except ValueError as error:
print(f" Response is not valid JSON: {error}")
return None
If you are scraping a site and the data you want comes back as JSON (visible in the browser's Network tab under XHR/Fetch requests), calling the JSON endpoint directly is vastly preferable to parsing HTML. You get structured data immediately, without any parsing.
20.5 HTML Structure: What Your Scraper Sees
When response.text returns the HTML of a page, it is a text document that looks something like this:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Office Supplies — NorthStar Office Co.</title>
<meta charset="UTF-8">
</head>
<body>
<div class="product-listing" data-category="paper">
<h2 class="product-name">Premium Copy Paper, 500 Sheets</h2>
<span class="price" data-sku="PAPER-A4-500">$8.99</span>
<span class="stock-status in-stock">In Stock</span>
<a href="/products/paper/premium-copy-500" class="product-link">
View Details
</a>
</div>
<div class="product-listing" data-category="ink">
<h2 class="product-name">Standard Ink Cartridge, Black</h2>
<span class="price" data-sku="INK-BK-STD">$24.99</span>
<span class="stock-status out-of-stock">Out of Stock</span>
<a href="/products/ink/standard-black" class="product-link">
View Details
</a>
</div>
<div class="pagination">
<a href="/products/?page=2" class="pagination-next">Next »</a>
</div>
</body>
</html>
HTML is built from tags (<div>, <span>, <h2>, <a>, etc.), which can carry attributes (class="product-listing", data-sku="PAPER-A4-500", href="/products/..."). Tags nest inside one another to form a tree structure — the Document Object Model (DOM).
To extract data from this HTML, you navigate this tree programmatically. That is BeautifulSoup's job.
Understanding CSS Classes and Attributes
The class attribute is your primary navigation tool in most scrapers. Web designers use classes to control how elements look; you use them to identify which elements contain the data you want.
Common patterns you will encounter:
class="price"— a direct, single-class marker. Find withclass_="price".class="stock-status in-stock"— multiple classes on one element. Find withclass_="in-stock"orclass_="stock-status".data-sku="PAPER-A4-500"— data attributes, useful for structured metadata. Find withattrs={"data-sku": "PAPER-A4-500"}.id="main-navigation"— unique identifier per page. Find withid="main-navigation".
CSS Selectors
CSS selectors are patterns for targeting HTML elements. They were invented for styling, but they are equally powerful for scraping. BeautifulSoup supports them via .select().
| Selector | Example | Matches |
|---|---|---|
| Tag name | div |
All <div> elements |
| Class | .price |
All elements with class containing "price" |
| ID | #main-nav |
The element with id="main-nav" |
| Tag + class | span.price |
<span> elements with class "price" |
| Attribute | [data-sku] |
Elements with a data-sku attribute |
| Attribute value | [data-sku="INK-BK-STD"] |
Elements where data-sku equals that value |
| Descendant | div.product-listing span.price |
<span class="price"> anywhere inside <div class="product-listing"> |
| Direct child | ul > li |
<li> elements that are direct children of <ul> |
| Starts with | a[href^="/products/"] |
<a> where href begins with "/products/" |
20.6 BeautifulSoup4: Parsing HTML
BeautifulSoup4 takes a raw HTML string and transforms it into a navigable Python object. Install it along with the lxml parser, which is faster than the built-in parser and handles malformed HTML more gracefully.
pip install beautifulsoup4 lxml
Creating a Soup Object
from bs4 import BeautifulSoup
html_string = """
<div class="product-listing">
<h2 class="product-name">Premium Copy Paper, 500 Sheets</h2>
<span class="price" data-sku="PAPER-A4-500">$8.99</span>
<a href="/products/paper/premium-copy-500" class="product-link">View Details</a>
</div>
"""
soup = BeautifulSoup(html_string, "lxml")
Use "lxml" as your parser in production. If lxml is unavailable, "html.parser" is the built-in fallback.
find() — First Match
soup.find() returns the first element matching your criteria, or None if nothing matches. The None return value is important — always check for it before accessing the element's properties.
# Find by tag name alone
first_h2 = soup.find("h2")
print(first_h2.text) # "Premium Copy Paper, 500 Sheets"
# Find by tag name and class
name_tag = soup.find("h2", class_="product-name")
print(name_tag.get_text(strip=True)) # "Premium Copy Paper, 500 Sheets"
# Find by attribute
price_tag = soup.find("span", attrs={"data-sku": "PAPER-A4-500"})
print(price_tag.text) # "$8.99"
# Safe access — always check for None
sale_tag = soup.find("span", class_="sale-price") # Does not exist
if sale_tag is not None:
print(sale_tag.text)
else:
print("No sale price found") # This is what runs
The AttributeError: 'NoneType' object has no attribute 'text' error is the most common beginner mistake in scraping. It happens when you call .text on an element that find() returned as None because it did not exist. Defensive coding with if element is not None prevents it.
find_all() — All Matches
find_all() returns a list of all matching elements. If nothing matches, it returns an empty list — never None. This makes it safe to iterate over without a None check.
# Example HTML with multiple product listings
multi_html = """
<div class="product-listing">
<h2 class="product-name">Premium Copy Paper, 500 Sheets</h2>
<span class="price">$8.99</span>
</div>
<div class="product-listing">
<h2 class="product-name">Standard Ink Cartridge, Black</h2>
<span class="price">$24.99</span>
</div>
"""
soup = BeautifulSoup(multi_html, "lxml")
listings = soup.find_all("div", class_="product-listing")
print(f"Found {len(listings)} products") # Found 2 products
for listing in listings:
name = listing.find("h2", class_="product-name")
price = listing.find("span", class_="price")
name_text = name.get_text(strip=True) if name else "Unknown"
price_text = price.get_text(strip=True) if price else "N/A"
print(f"{name_text}: {price_text}")
select() and select_one() — CSS Selectors
select() and select_one() accept CSS selector strings. select_one() returns the first match (or None); select() returns a list of all matches (or an empty list).
Many experienced scrapers prefer these methods because CSS selectors can express complex targeting concisely.
# Find all price spans inside product listings
prices = soup.select("div.product-listing span.price")
for price in prices:
print(price.text) # $8.99, then $24.99
# The first product name only
first_name = soup.select_one("h2.product-name")
print(first_name.text) # "Premium Copy Paper, 500 Sheets"
# Links that go to the products section
product_links = soup.select("a[href^='/products/']")
# Elements with a specific data attribute
items_with_sku = soup.select("[data-sku]")
Extracting Text and Attributes
element = soup.find("span", class_="price")
# Text content — two ways
print(element.text) # "$8.99" (may include extra whitespace)
print(element.get_text(strip=True)) # "$8.99" (always stripped)
# Attribute values — two ways
price_tag = soup.find("span", attrs={"data-sku": "PAPER-A4-500"})
print(price_tag["data-sku"]) # "PAPER-A4-500" — KeyError if missing
print(price_tag.get("data-sku")) # "PAPER-A4-500" — None if missing
print(price_tag.get("href", "")) # "" — default value if missing (safe)
Always prefer .get("attribute", default) over element["attribute"] in production scrapers. The dictionary-style access raises a KeyError if the attribute is absent; .get() returns a safe default.
20.7 Navigating the Parse Tree
Once you have found an element, you can move relative to it in the tree — up to parents, down to children, and sideways to siblings.
from bs4 import BeautifulSoup
html = """
<div class="catalog-section">
<h3 class="section-title">Paper Products</h3>
<div class="product-listing">
<h2 class="product-name">Premium Copy Paper</h2>
<span class="price">$8.99</span>
</div>
<div class="product-listing">
<h2 class="product-name">Recycled Copy Paper</h2>
<span class="price">$7.49</span>
</div>
</div>
"""
soup = BeautifulSoup(html, "lxml")
first_product = soup.find("div", class_="product-listing")
# Move up: find the parent
parent = first_product.parent
print(parent.get("class")) # ['catalog-section']
# Move sideways: find next sibling of same type
next_product = first_product.find_next_sibling("div", class_="product-listing")
print(next_product.find("h2").text) # "Recycled Copy Paper"
# Move down: iterate over direct children
for child in first_product.children:
if child.name: # Skip NavigableString whitespace nodes
print(f" {child.name}: {child.get_text(strip=True)}")
Searching Within a Found Element
A particularly useful pattern is to locate a container element first, then search within it. This scopes your search and avoids picking up elements from elsewhere on the page.
# Find the paper section specifically
paper_section = soup.find("div", class_="catalog-section")
# Now search only within that section
paper_products = paper_section.find_all("div", class_="product-listing")
# This will not accidentally include products from other sections
20.8 Extracting Links and Tables
Extracting Links
from urllib.parse import urljoin
from bs4 import BeautifulSoup
BASE_URL = "https://northstar-office.example.com"
def extract_all_links(soup: BeautifulSoup, base_url: str) -> list[dict]:
"""Extract all links from a page, converting relative URLs to absolute.
Args:
soup: Parsed BeautifulSoup object.
base_url: Base URL of the site, used to resolve relative links.
Returns:
List of dicts with 'text' and 'url' keys.
"""
links = []
for anchor in soup.find_all("a", href=True):
href = anchor.get("href", "")
absolute_url = urljoin(base_url, href)
link_text = anchor.get_text(strip=True)
links.append({"text": link_text, "url": absolute_url})
return links
def extract_product_links(soup: BeautifulSoup, base_url: str) -> list[str]:
"""Extract product page URLs from a catalog listing.
Args:
soup: Parsed BeautifulSoup object.
base_url: Base URL for resolving relative links.
Returns:
List of absolute product page URLs.
"""
product_anchors = soup.select("a.product-link")
return [
urljoin(base_url, anchor.get("href", ""))
for anchor in product_anchors
if anchor.get("href")
]
Extracting HTML Tables
HTML tables are common in business data — financial reports, comparison tables, government statistics. BeautifulSoup parses them cleanly.
from bs4 import BeautifulSoup
def parse_html_table(soup: BeautifulSoup, table_selector: str) -> list[dict]:
"""Parse an HTML table into a list of dictionaries.
Uses the table's header row as dictionary keys. Each data row becomes
one dictionary in the returned list.
Args:
soup: Parsed BeautifulSoup object.
table_selector: CSS selector string targeting the table element.
Returns:
List of row dictionaries, or empty list if table not found.
"""
table = soup.select_one(table_selector)
if table is None:
print(f" Table not found for selector: {table_selector}")
return []
# Extract column headers from <th> elements
headers = [th.get_text(strip=True) for th in table.select("thead th")]
# Fallback: use first row as headers if no <thead>
if not headers:
first_row = table.select_one("tr")
if first_row:
headers = [td.get_text(strip=True) for td in first_row.select("td, th")]
# Extract data rows from <tbody>
rows = []
for tr in table.select("tbody tr"):
cells = [td.get_text(strip=True) for td in tr.select("td")]
if len(cells) == len(headers):
rows.append(dict(zip(headers, cells)))
return rows
For a quick shortcut when you just need the data from a well-structured public table, pandas can do all of this in one line:
import pandas as pd
# pandas.read_html() fetches the page and returns all tables as DataFrames
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
sp500_df = tables[0] # First table on the page
print(sp500_df.head())
Use pd.read_html() for quick exploration. Use BeautifulSoup for production scrapers where you need custom headers, rate limiting, and detailed error handling.
20.9 Handling Pagination
Most product catalogs and data listings are spread across multiple pages. Your scraper needs to recognize when there is a next page and follow it until the data runs out.
Pattern 1: URL-Based Page Numbers
Many sites use simple URL parameters for pagination:
https://northstar-office.example.com/products/?page=1
https://northstar-office.example.com/products/?page=2
https://northstar-office.example.com/products/?page=3
import time
import requests
from bs4 import BeautifulSoup
def scrape_paginated_catalog(
base_catalog_url: str,
session: requests.Session,
delay_seconds: float = 1.5,
) -> list[dict]:
"""Scrape all pages of a paginated product catalog.
Args:
base_catalog_url: URL of the catalog without page parameter.
session: A requests.Session object.
delay_seconds: Seconds to wait between page requests.
Returns:
List of all product dictionaries collected across all pages.
"""
all_products = []
page_number = 1
while True:
page_url = f"{base_catalog_url}?page={page_number}"
print(f" Fetching page {page_number}: {page_url}")
html = fetch_page_html(page_url, session)
if html is None:
print(f" Failed to fetch page {page_number}. Stopping.")
break
soup = BeautifulSoup(html, "lxml")
page_products = extract_products_from_page(soup)
if not page_products:
print(f" No products found on page {page_number}. Pagination complete.")
break
all_products.extend(page_products)
print(f" Found {len(page_products)} products (running total: {len(all_products)})")
# Check for a "Next" button
next_button = soup.select_one("a.pagination-next, a[aria-label='Next page']")
if next_button is None:
print(" No next-page link found. Pagination complete.")
break
page_number += 1
time.sleep(delay_seconds) # Respectful delay between pages
return all_products
Pattern 2: Following "Next" Links Directly
Some sites use variable next-page URLs rather than sequential numbers. In this case, follow the href of the "Next" link directly.
from urllib.parse import urljoin
def scrape_by_following_next_links(
start_url: str,
base_url: str,
session: requests.Session,
delay_seconds: float = 2.0,
) -> list[dict]:
"""Scrape a site by following 'Next' links until none remain.
Args:
start_url: URL of the first page to scrape.
base_url: Site base URL for resolving relative links.
session: A requests.Session object.
delay_seconds: Seconds to wait between page requests.
Returns:
All products collected across all pages.
"""
all_products = []
current_url: str | None = start_url
while current_url is not None:
html = fetch_page_html(current_url, session)
if html is None:
break
soup = BeautifulSoup(html, "lxml")
all_products.extend(extract_products_from_page(soup))
# Find the next page link — CSS selector may vary by site
next_link = soup.select_one("a.pagination-next, .next-page > a")
if next_link and next_link.get("href"):
current_url = urljoin(base_url, next_link["href"])
time.sleep(delay_seconds)
else:
current_url = None # No more pages
return all_products
20.10 Rate Limiting and the Ethics of Speed
Rate limiting is how you tell a server "I am a polite visitor, not an attack." Every professional scraper implements it. Here is what that looks like in practice.
The Basic Pattern
import time
import random
def respectful_delay(min_seconds: float = 1.0, max_seconds: float = 3.0) -> None:
"""Wait a random interval between requests to avoid overwhelming the server.
Random delays are harder for anti-bot systems to fingerprint and
distribute load more evenly than fixed delays.
Args:
min_seconds: Minimum delay in seconds.
max_seconds: Maximum delay in seconds.
"""
delay = random.uniform(min_seconds, max_seconds)
time.sleep(delay)
Call respectful_delay() between every request. For most business intelligence scraping, a 1–3 second delay is appropriate. If robots.txt specifies a Crawl-delay, use that value instead.
Retry Logic with Exponential Backoff
When requests fail transiently (timeouts, 503 errors), it is appropriate to retry — but with increasing waits between attempts. This is called exponential backoff.
import time
import requests
def fetch_with_retry(
url: str,
session: requests.Session,
max_retries: int = 3,
) -> str | None:
"""Fetch a URL with exponential backoff retry on transient failures.
Args:
url: The URL to fetch.
session: A requests.Session object.
max_retries: Maximum number of retry attempts.
Returns:
HTML content string, or None if all retries exhausted.
"""
for attempt in range(max_retries):
try:
response = session.get(url, timeout=15)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as error:
status = error.response.status_code
if status in (403, 404):
return None # Permanent errors — do not retry
wait = 2 ** attempt # 1s, 2s, 4s
print(f" HTTP {status} on attempt {attempt + 1}. Waiting {wait}s.")
time.sleep(wait)
except requests.exceptions.Timeout:
wait = 2 ** attempt
print(f" Timeout on attempt {attempt + 1}. Waiting {wait}s.")
time.sleep(wait)
except requests.exceptions.RequestException as error:
print(f" Request error: {error}")
return None
return None
20.11 Saving Scraped Data
Saving to CSV
import csv
from pathlib import Path
from datetime import datetime
def save_to_csv(
records: list[dict],
output_path: str | Path,
append: bool = False,
) -> None:
"""Save a list of dictionaries to a CSV file.
Args:
records: List of dictionaries to save. All dicts must have the same keys.
output_path: File path for the output CSV.
append: If True, append to existing file. If False, overwrite.
Raises:
ValueError: If records is empty.
"""
if not records:
print(" No records to save — skipping CSV write.")
return
path = Path(output_path)
path.parent.mkdir(parents=True, exist_ok=True)
file_mode = "a" if append else "w"
write_header = not append or not path.exists()
fieldnames = list(records[0].keys())
with open(path, mode=file_mode, newline="", encoding="utf-8") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
if write_header:
writer.writeheader()
writer.writerows(records)
action = "Appended" if append else "Saved"
print(f" {action} {len(records)} records to {path}")
Saving to JSON
import json
from pathlib import Path
from datetime import datetime
def save_to_json(records: list[dict], output_path: str | Path) -> None:
"""Save a list of dictionaries to a JSON file with run metadata.
Args:
records: List of dictionaries to save.
output_path: File path for the output JSON.
"""
path = Path(output_path)
path.parent.mkdir(parents=True, exist_ok=True)
output_data = {
"exported_at": datetime.now().isoformat(),
"record_count": len(records),
"records": records,
}
with open(path, "w", encoding="utf-8") as json_file:
json.dump(output_data, json_file, indent=2, ensure_ascii=False)
print(f" Saved {len(records)} records to {path}")
Adding Metadata to Each Record
Always record when you scraped the data and where it came from. This is essential for auditing, debugging, and time-series analysis.
from datetime import date
def add_scrape_metadata(records: list[dict], source_url: str) -> list[dict]:
"""Add scrape date and source URL to each record.
Args:
records: List of scraped data dictionaries.
source_url: URL the data was scraped from.
Returns:
Records with 'scrape_date' and 'source_url' fields added.
"""
scrape_date = date.today().isoformat()
return [
{**record, "scrape_date": scrape_date, "source_url": source_url}
for record in records
]
20.12 Common Pitfalls
JavaScript-Rendered Pages
This is the wall most scrapers hit eventually. Many modern websites use JavaScript frameworks — React, Vue, Angular — to build their pages dynamically in the browser. When you fetch such a page with requests, you get the initial HTML shell before JavaScript has run. The content you are looking for simply does not exist in what requests sees.
How to tell: Save response.text to a file and open it in a text editor. If the content visible in your browser is not in that file, JavaScript is rendering it.
Option 1 (often works): Find the underlying API. Open your browser's Developer Tools, go to the Network tab, filter by Fetch/XHR, and reload the page. Look for JSON requests that return the data you need. Scraping that API endpoint directly is faster, cleaner, and more reliable than trying to execute JavaScript.
Option 2: Use Selenium or Playwright. These tools automate a real browser, allowing JavaScript to execute. This is covered in Chapter 21.
# Example: Instead of scraping rendered HTML, find and call the API directly
import requests
# What you might find in the Network tab:
api_url = "https://shop.northstar-office.example.com/api/products"
params = {"category": "paper", "page": 1, "per_page": 50}
headers = {"Accept": "application/json"}
response = requests.get(api_url, params=params, headers=headers, timeout=15)
data = response.json()
products = data["results"] # Clean, already-structured data
Inconsistent HTML Structure
Real websites are inconsistent. An element you expect to be present is sometimes absent. A class name changes with a site redesign. A price that was in a <span> is now in a <p>. Defensive coding is mandatory in production scrapers.
def safe_get_text(element, default: str = "") -> str:
"""Extract text from a BeautifulSoup element, returning default if None.
Args:
element: A BeautifulSoup tag, or None.
default: Value to return if element is None.
Returns:
Stripped text content, or the default value.
"""
if element is None:
return default
return element.get_text(strip=True)
def safe_get_attr(element, attr: str, default: str = "") -> str:
"""Extract an attribute from a BeautifulSoup element safely.
Args:
element: A BeautifulSoup tag, or None.
attr: Attribute name to retrieve.
default: Value to return if element is None or attribute missing.
Returns:
Attribute value, or the default value.
"""
if element is None:
return default
return element.get(attr, default)
Use these two helper functions everywhere in your scrapers. They eliminate the entire class of AttributeError: 'NoneType' crashes.
Encoding Issues
Some websites use non-UTF-8 encodings. requests attempts to auto-detect encoding but does not always get it right, particularly for older sites.
response = requests.get(url, timeout=15)
# Check what encoding requests detected
print(f"Detected encoding: {response.encoding}")
# If the encoding looks wrong, try forcing UTF-8
response.encoding = "utf-8"
html = response.text
# Alternatively, pass response.content (bytes) to BeautifulSoup
# and let lxml detect the encoding from the HTML meta tags
soup = BeautifulSoup(response.content, "lxml") # content, not text
Session Handling
Some sites set cookies that must persist between requests — for analytics, session tracking, or anti-bot detection. Using a requests.Session handles this automatically.
session = requests.Session()
session.headers.update(SCRAPER_HEADERS)
# Cookies set by the first request are automatically included in subsequent ones
page1 = session.get("https://example.com/catalog/page/1")
page2 = session.get("https://example.com/catalog/page/2") # Cookies carry over
20.13 When NOT to Scrape
This section deserves its own dedicated space because knowing when not to scrape will save you days of wasted effort.
Use an API If One Exists
If the website offers a documented public API, use it. APIs are:
- Faster than fetching and parsing HTML
- More stable — API schemas change infrequently; HTML changes constantly
- Explicitly permitted — no ToS concerns about authorized access
- More structured — you get clean JSON rather than messy HTML
Search for [site name] developer API or [site name] API documentation before building any scraper. Many business data providers — financial data, company information, logistics — have official APIs. Even if the API has a cost, it may be less expensive than the engineering time to maintain a scraper.
Use Downloadable Data When Available
Government agencies, financial regulators, and research institutions often publish downloadable data files in CSV, Excel, or XML formats. Before scraping a government statistics page, check whether there is a "Download Data" button. There usually is.
Do Not Scrape Your Own Internal Systems
Your CRM, ERP, and accounting software almost certainly have APIs, webhooks, or export functions designed for exactly the kind of access you are trying to get. Scraping a web interface to your own company's internal systems is almost always the wrong approach — it is fragile, slow, and unnecessary.
20.14 Priya's Competitor Price Tracker: The Complete Scenario
Let us follow Priya through the full process of building Acme's competitor price tracker, from the sticky notes on her monitor to a working tool.
Step 1: Research and Legal Review
Priya starts by checking NorthStar Office Supplies' robots.txt at https://northstar-office.example.com/robots.txt. She finds:
User-agent: *
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/
Crawl-delay: 2
Allow: /products/
The products section is explicitly allowed. The crawl delay of 2 seconds is specified — she will honor it. She also reads the Terms of Service, which do not prohibit automated access to public pricing pages for internal competitive research.
She documents these findings in a comment at the top of her script. Marcus Webb later asks to see this documentation, and having it ready builds his confidence in the project.
Step 2: Inspect the HTML
Priya opens a NorthStar product page in her browser, right-clicks on a product name, and selects "Inspect." She explores the HTML structure until she understands the pattern:
- Each product is in a
<div class="ns-product-card"> - Product names are in
<h3 class="ns-product-title"> - Prices are in
<span class="ns-price"> - SKUs are in
data-product-idattributes on the product card - Pagination uses a
<a class="ns-btn-next-page">link
Step 3: Write the Scraper
from bs4 import BeautifulSoup
from datetime import date
def extract_northstar_products(soup: BeautifulSoup) -> list[dict]:
"""Extract product data from a NorthStar Office catalog page.
Args:
soup: BeautifulSoup object of a NorthStar product listing page.
Returns:
List of product dictionaries with name, sku, and price fields.
"""
products = []
cards = soup.find_all("div", class_="ns-product-card")
for card in cards:
name_tag = card.find("h3", class_="ns-product-title")
price_tag = card.find("span", class_="ns-price")
product = {
"name": safe_get_text(name_tag),
"sku": card.get("data-product-id", ""),
"price": safe_get_text(price_tag),
"competitor": "NorthStar Office Supplies",
"scrape_date": date.today().isoformat(),
}
# Only include records with both a name and a price
if product["name"] and product["price"]:
products.append(product)
return products
Step 4: Report to Sandra
Priya runs the scraper for the first time and gets 147 product records. She loads the CSV into a pandas DataFrame, merges it with Acme's own pricing data, and creates a simple comparison report that flags any product where NorthStar's price is more than 10% different from Acme's.
Sandra Chen's response: "This is exactly what I needed. Can we run this weekly?"
That is Chapter 22's work — scheduling it. For now, Priya has proven the concept.
The complete implementation of this scraper is in code/acme_competitor_tracker.py and case-study-01.md.
20.15 Pulling Everything Together: A Complete Scraping Workflow
Here is the standard structure for any production web scraper you build:
"""
template_scraper.py — Standard scraper structure for business intelligence tasks.
Adapt this template for new scraping targets by implementing:
- extract_data_from_page() for site-specific extraction logic
- The target URL and any site-specific selectors or headers
"""
import csv
import time
import random
import requests
from pathlib import Path
from datetime import date
from bs4 import BeautifulSoup
SCRAPER_HEADERS = {
"User-Agent": "AcmeCorpBI/1.0 (contact: priya@acmecorp.example.com)"
}
OUTPUT_DIR = Path("output")
DELAY_MIN = 1.5 # seconds
DELAY_MAX = 3.0 # seconds
def main() -> None:
"""Main entry point: orchestrate the full scrape-and-save workflow."""
target_url = "https://example.com/products/"
# Step 1: Check robots.txt
if not is_scraping_allowed(target_url):
print("robots.txt disallows scraping this URL. Exiting.")
return
# Step 2: Create session
session = requests.Session()
session.headers.update(SCRAPER_HEADERS)
# Step 3: Scrape all pages
print(f"Starting scrape of {target_url}")
all_records = []
page_number = 1
while True:
page_url = f"{target_url}?page={page_number}"
html = fetch_with_retry(page_url, session)
if html is None:
break
soup = BeautifulSoup(html, "lxml")
page_records = extract_data_from_page(soup) # Implement per site
if not page_records:
break
all_records.extend(page_records)
print(f" Page {page_number}: {len(page_records)} records")
if not soup.select_one("a.next-page"):
break
page_number += 1
time.sleep(random.uniform(DELAY_MIN, DELAY_MAX))
# Step 4: Save results
print(f"\nTotal records collected: {len(all_records)}")
output_file = OUTPUT_DIR / f"scrape_{date.today().isoformat()}.csv"
save_to_csv(all_records, output_file)
print("Done.")
if __name__ == "__main__":
main()
This structure — check permissions, create session, loop through pages with respectful delays, save output — is the template for nearly every business scraper you will build.
Chapter Summary
Web scraping gives you access to the world's largest database: the public web. The core workflow is:
requests.get(url)fetches the HTMLBeautifulSoup(html, "lxml")parses it into a navigable tree.find(),.find_all(), and.select()locate the elements you need.get_text(strip=True)and.get("attribute")extract the values- A pagination loop repeats this across all pages
time.sleep()keeps you polite and keeps you unblockedcsv.DictWritersaves the results for further analysis
The discipline of checking robots.txt, reading Terms of Service, setting honest User-Agent headers, and rate limiting your requests is not optional caution — it is the professional standard. Scrapers that violate these principles get blocked, create legal exposure, and reflect poorly on the organizations that run them.
Priya's competitor price tracker demonstrates the business value: what took half a day of manual work every quarter now takes 30 seconds of computation every week, producing richer data than the manual process ever could.
In Chapter 22, you will schedule this scraper to run automatically every Monday morning without anyone touching a keyboard.
Key Terms
Web scraping — Programmatically extracting data from websites by fetching HTML and parsing its structure.
HTTP — HyperText Transfer Protocol; the communication standard between clients (browsers, scrapers) and web servers.
Status code — A three-digit number in an HTTP response. 200 = success; 404 = not found; 429 = rate limited; 500 = server error.
HTML — HyperText Markup Language; the structured text format used to define web page content and layout.
DOM — Document Object Model; the tree structure that a browser (or parser) builds from HTML, allowing programmatic navigation.
CSS selector — A pattern string for targeting specific HTML elements; used in styling and in BeautifulSoup's .select() method.
BeautifulSoup — A Python library for parsing HTML and XML, providing Pythonic navigation and search over the parse tree.
robots.txt — A standard text file at a website's root that specifies which paths automated bots may or may not access.
Rate limiting — Deliberately slowing down requests to avoid overloading a server or triggering anti-bot defenses.
Pagination — The practice of dividing large datasets across multiple numbered pages in a web interface.
User-Agent — An HTTP request header that identifies the software making the request; used by servers to customize responses and by sites to detect bots.
JSON — JavaScript Object Notation; a lightweight text format for structured data, commonly returned by web APIs.