21 min read

> "The internet is the world's largest database. Most of it has no official API."

Chapter 20: Web Scraping for Business Intelligence

"The internet is the world's largest database. Most of it has no official API."


What You Will Learn

By the end of this chapter, you will be able to:

  • Explain why web scraping matters for competitive business intelligence and when it is the right approach
  • Make HTTP requests using the requests library and interpret responses and status codes
  • Parse HTML with BeautifulSoup4 using find(), find_all(), and select()
  • Navigate the HTML parse tree: parents, children, and siblings
  • Extract text, links, and tables from web pages
  • Handle pagination across multi-page listings
  • Add rate limiting and check robots.txt for ethical, responsible scraping
  • Save scraped data to CSV and JSON files
  • Recognize common pitfalls: JavaScript-rendered pages, encoding issues, and changing page structure
  • Know when to use an API instead of scraping

The Monday Morning Stack

It is Monday morning at Acme Corp regional headquarters. Priya Okonkwo, junior analyst, arrives to find three sticky notes on her monitor.

The first is from Sandra Chen, VP of Sales: "Can you pull NorthStar Office competitor pricing on paper reams and ink cartridges? Quarterly review is Thursday."

The second is from a regional sales manager: "I heard NorthStar just dropped their toner prices. Can you confirm before I talk to my team?"

The third is from someone Priya does not recognize: "Hi — are you the data person? What does our main competitor charge for folding tables?"

Before she learned Python, Priya would have spent most of Monday manually visiting each competitor's website, copying prices into a spreadsheet, and trying to remember whether the price she was looking at included shipping. She would do this again next Monday, and the Monday after that, and every Monday for as long as she kept this job.

Today, she has a better option.

Web scraping is the practice of programmatically extracting data from websites — the same data visible in your browser, but collected automatically and saved in a format your code can analyze. For business intelligence, this is transformative. Competitor pricing, job posting trends, publicly available financial data, real estate listings, regulatory filings — vast amounts of business-relevant information is published on the public web, updated constantly, and waiting to be collected systematically.

This chapter teaches you how to collect it responsibly, reliably, and efficiently.


20.1 Why Web Scraping Matters for Business Intelligence

Before writing a single line of code, it is worth understanding what web scraping is genuinely useful for in a business context. The technology is a means, not an end. Here are the most common and legitimate business use cases.

Competitive Pricing Intelligence

If your competitors publish their prices publicly — as most consumer-facing businesses do — you can track those prices over time. A regional distributor like Acme Corp can monitor whether a competitor is running a promotion on paper stock, which would explain a recent dip in Acme's own sales numbers. A retailer can check whether to adjust prices before a holiday weekend. This is not industrial espionage — it is the same information any customer could find by visiting the site. You are simply collecting it systematically rather than manually.

The difference between checking a competitor's price once and tracking it weekly for six months is the difference between a data point and a trend. Trends are where business decisions actually come from.

Market Research and Trend Analysis

Public job postings reveal where industries are investing. If every major competitor in your space is hiring data engineers, that is a signal worth tracking. Public business registrations show where new competitors are forming. Review aggregator sites show what customers consistently complain about — and what they praise. News aggregators show what your industry is talking about before it becomes conventional wisdom.

All of this is public information. Scraping lets you collect it at scale, regularly, and automatically.

Lead Generation (Used Responsibly)

Public business directories, trade association member lists, and professional organization websites sometimes publish contact information intended for legitimate outreach. This is the situation Maya Reyes finds herself in — building a prospecting tool from public directories for her consulting practice. The key word is "public," and even then, responsible use means respecting privacy laws, opt-out signals, and the spirit of why that information was published.

Monitoring Your Own Presence

You can scrape review sites to monitor what customers say about your products. You can track your own product listings on marketplaces where you sell. You can monitor whether your press releases are being picked up and how they are being covered. All of this is data about your own company, and collecting it automatically saves time that would otherwise go to manual monitoring.

Aggregating Public Government and Regulatory Data

Government databases, public financial filings, weather data, transportation schedules — an enormous amount of useful business data is published by public bodies and made freely available. Economic statistics from the Bureau of Labor Statistics, company filings from SEC EDGAR, import/export data from government trade agencies — these can be scraped (or better yet, accessed via their APIs when those exist) to inform business decisions.


Web scraping is powerful, and like any powerful tool, it can be misused. Before writing a single scraper for production use, commit these principles to memory. This is not boilerplate caution — these are practical rules that will keep you out of legal trouble and keep your scrapers running.

Check robots.txt

Every website can publish a file at /robots.txt that specifies which parts of the site automated bots may and may not access. The format is a standard (called the Robots Exclusion Protocol) that has been honored by well-behaved bots since 1994. Ignoring this file is disrespectful at best and has been used as evidence in legal disputes at worst.

You will learn to check it programmatically in Section 20.4.

Read the Terms of Service

Many websites explicitly prohibit automated scraping in their terms of service. Scraping in violation of a site's ToS has led to account terminations, IP bans, and in some cases legal action under computer fraud statutes. When in doubt — and especially for high-volume or commercial scraping — consult legal counsel.

Do Not Hammer Servers

Adding even a one-second pause between requests dramatically reduces the load you place on a site's server. Aggressive scraping that sends hundreds of requests per minute can degrade service for real users and can be construed as a denial-of-service attack regardless of intent. Rate limiting is not optional — it is a basic professional obligation.

GDPR (in the EU), CCPA (in California), and similar privacy regulations restrict how you may collect and use personal information. "It was on a public website" is not always a legal defense for collecting people's personal data, particularly if you plan to use it for marketing outreach or store it persistently. Know your jurisdiction's rules.

When an API Exists, Use It

Many websites that you might want to scrape offer a proper API — an official, designed-for-machine-access interface. APIs are faster, more reliable, more stable, and explicitly sanctioned. You will learn to recognize when to prefer an API over scraping in Section 20.11.


20.3 How the Web Works: HTTP Basics

When you type a URL into your browser and press Enter, a conversation happens between your computer and a remote server. Understanding this conversation is the foundation of web scraping.

Requests and Responses

Your browser sends an HTTP request to a server. The request specifies:

  • The method: usually GET (retrieve a resource) or POST (submit data)
  • The URL: the address of the resource
  • Headers: metadata about the request, including what software is making it (the User-Agent header)

The server sends back an HTTP response, which contains:

  • A status code: a number indicating what happened
  • Headers: metadata about the response (content type, encoding, caching rules)
  • A body: the actual content — usually HTML, JSON, or binary data

Status Codes You Need to Know

Code Meaning What to do in your scraper
200 OK — request succeeded Proceed with parsing
301 / 302 Redirect — resource moved requests follows these automatically
403 Forbidden — server refuses May be blocking scrapers; check ToS
404 Not Found — resource missing Log and skip; check your URL
429 Too Many Requests Slow down; add longer delays
500 Internal Server Error Server problem; retry later
503 Service Unavailable Server overloaded; retry later

When your scraper gets a 200, everything worked. When it gets anything else, your code needs to handle that gracefully — which you will see shortly in the requests section.

What Your Browser Does vs. What Python Does

Your browser receives HTML and renders it visually — fonts, colors, layout, images. Python receives the same HTML and sees it as a string of text. Your job as a scraper author is to write code that finds the specific pieces of that text — prices, product names, contact information — within the larger structure of the HTML document.


20.4 The requests Library

Python's standard library includes an http.client module, but it is low-level and verbose. The requests library is the community standard for HTTP in Python. Install it before continuing.

pip install requests

Making Your First Request

import requests

response = requests.get("https://httpbin.org/get")
print(response.status_code)   # 200
print(type(response.text))    # <class 'str'>
print(len(response.text))     # number of characters in the response body

requests.get() sends a GET request to the specified URL and returns a Response object. That object contains everything the server sent back.

The Response Object

Attribute / Method What it contains
response.status_code Integer status code (200, 404, etc.)
response.text Response body decoded as a string
response.content Response body as raw bytes
response.json() Body parsed as JSON (raises error if not JSON)
response.headers Response headers as a dictionary
response.url Final URL after any redirects
response.encoding Character encoding used to decode the body
response.raise_for_status() Raises an exception if status >= 400

A Production-Ready Fetch Function

Here is a fetch function you will use as a building block throughout this chapter. Notice three things about it: it always sets a timeout, it always handles exceptions, and it uses raise_for_status() to convert error status codes into exceptions.

import time
import requests


SCRAPER_HEADERS = {
    "User-Agent": (
        "AcmeCorpPriceMonitor/1.0 "
        "(Internal business intelligence tool; contact: priya@acmecorp.example.com)"
    )
}


def fetch_page_html(url: str, session: requests.Session) -> str | None:
    """Fetch a URL and return its HTML content as a string.

    Args:
        url: The URL to fetch.
        session: A requests.Session object for connection reuse.

    Returns:
        HTML content as a string, or None if the request fails.
    """
    try:
        response = session.get(url, timeout=15)
        response.raise_for_status()
        return response.text

    except requests.exceptions.Timeout:
        print(f"  Timeout fetching: {url}")
        return None

    except requests.exceptions.HTTPError as error:
        status = error.response.status_code
        print(f"  HTTP {status} error fetching: {url}")
        if status == 429:
            print("  Rate limited — waiting 30 seconds before next request.")
            time.sleep(30)
        return None

    except requests.exceptions.ConnectionError:
        print(f"  Connection error fetching: {url}")
        return None

    except requests.exceptions.RequestException as error:
        print(f"  Request failed for {url}: {error}")
        return None

A few design decisions worth noting:

Why a Session? A requests.Session object reuses the underlying TCP connection across multiple requests to the same host, which is faster and more polite than opening a new connection for every request. It also automatically handles cookies that the server sets — useful for sites that use session cookies to track visitors.

Why check for 429 specifically? A 429 response means the server is explicitly asking you to slow down. Ignoring it and retrying immediately will almost certainly result in your IP being blocked. A 30-second pause is the minimum courteous response.

Why raise_for_status()? Without it, a 403 or 404 response would return successfully (from requests's perspective) and you would try to parse an error page as if it were real data. raise_for_status() converts non-2xx responses into exceptions that your except block handles.

Checking robots.txt

Before scraping any site, check whether it permits automated access to the pages you want. Python's standard library includes a RobotFileParser for exactly this purpose.

import urllib.robotparser
from urllib.parse import urlparse


def is_scraping_allowed(target_url: str, user_agent: str = "*") -> bool:
    """Check whether a robot may access a given URL per the site's robots.txt.

    Args:
        target_url: The URL you want to scrape.
        user_agent: Your bot's user agent name. Use "*" to check general rules.

    Returns:
        True if scraping is permitted, False otherwise.
        Returns False (conservative) if robots.txt cannot be read.
    """
    parsed = urlparse(target_url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    parser = urllib.robotparser.RobotFileParser()
    parser.set_url(robots_url)

    try:
        parser.read()
        return parser.can_fetch(user_agent, target_url)
    except Exception as error:
        print(f"  Could not read robots.txt from {robots_url}: {error}")
        return False  # Conservative: assume not allowed if we cannot verify


# Use it before starting a scraping job
target = "https://northstar-office.example.com/products/"
if is_scraping_allowed(target):
    print("robots.txt permits access — proceeding.")
else:
    print("robots.txt disallows access — stopping.")

Priya runs this check before every new scraping target. It takes two seconds and has saved her from two situations where robots.txt explicitly blocked the paths she wanted to access.

Working with JSON Responses

Many modern web services return JSON data rather than HTML pages. When the response is JSON, use response.json() instead of response.text.

def fetch_json_data(url: str, params: dict, session: requests.Session) -> dict | None:
    """Fetch JSON data from an API endpoint.

    Args:
        url: The API endpoint URL.
        params: Query parameters to include in the request.
        session: A requests.Session object.

    Returns:
        Parsed JSON data as a Python dictionary, or None on failure.
    """
    try:
        response = session.get(url, params=params, timeout=15)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as error:
        print(f"  API request failed: {error}")
        return None
    except ValueError as error:
        print(f"  Response is not valid JSON: {error}")
        return None

If you are scraping a site and the data you want comes back as JSON (visible in the browser's Network tab under XHR/Fetch requests), calling the JSON endpoint directly is vastly preferable to parsing HTML. You get structured data immediately, without any parsing.


20.5 HTML Structure: What Your Scraper Sees

When response.text returns the HTML of a page, it is a text document that looks something like this:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Office Supplies — NorthStar Office Co.</title>
    <meta charset="UTF-8">
  </head>
  <body>
    <div class="product-listing" data-category="paper">
      <h2 class="product-name">Premium Copy Paper, 500 Sheets</h2>
      <span class="price" data-sku="PAPER-A4-500">$8.99</span>
      <span class="stock-status in-stock">In Stock</span>
      <a href="/products/paper/premium-copy-500" class="product-link">
        View Details
      </a>
    </div>
    <div class="product-listing" data-category="ink">
      <h2 class="product-name">Standard Ink Cartridge, Black</h2>
      <span class="price" data-sku="INK-BK-STD">$24.99</span>
      <span class="stock-status out-of-stock">Out of Stock</span>
      <a href="/products/ink/standard-black" class="product-link">
        View Details
      </a>
    </div>
    <div class="pagination">
      <a href="/products/?page=2" class="pagination-next">Next &raquo;</a>
    </div>
  </body>
</html>

HTML is built from tags (<div>, <span>, <h2>, <a>, etc.), which can carry attributes (class="product-listing", data-sku="PAPER-A4-500", href="/products/..."). Tags nest inside one another to form a tree structure — the Document Object Model (DOM).

To extract data from this HTML, you navigate this tree programmatically. That is BeautifulSoup's job.

Understanding CSS Classes and Attributes

The class attribute is your primary navigation tool in most scrapers. Web designers use classes to control how elements look; you use them to identify which elements contain the data you want.

Common patterns you will encounter:

  • class="price" — a direct, single-class marker. Find with class_="price".
  • class="stock-status in-stock" — multiple classes on one element. Find with class_="in-stock" or class_="stock-status".
  • data-sku="PAPER-A4-500" — data attributes, useful for structured metadata. Find with attrs={"data-sku": "PAPER-A4-500"}.
  • id="main-navigation" — unique identifier per page. Find with id="main-navigation".

CSS Selectors

CSS selectors are patterns for targeting HTML elements. They were invented for styling, but they are equally powerful for scraping. BeautifulSoup supports them via .select().

Selector Example Matches
Tag name div All <div> elements
Class .price All elements with class containing "price"
ID #main-nav The element with id="main-nav"
Tag + class span.price <span> elements with class "price"
Attribute [data-sku] Elements with a data-sku attribute
Attribute value [data-sku="INK-BK-STD"] Elements where data-sku equals that value
Descendant div.product-listing span.price <span class="price"> anywhere inside <div class="product-listing">
Direct child ul > li <li> elements that are direct children of <ul>
Starts with a[href^="/products/"] <a> where href begins with "/products/"

20.6 BeautifulSoup4: Parsing HTML

BeautifulSoup4 takes a raw HTML string and transforms it into a navigable Python object. Install it along with the lxml parser, which is faster than the built-in parser and handles malformed HTML more gracefully.

pip install beautifulsoup4 lxml

Creating a Soup Object

from bs4 import BeautifulSoup

html_string = """
<div class="product-listing">
    <h2 class="product-name">Premium Copy Paper, 500 Sheets</h2>
    <span class="price" data-sku="PAPER-A4-500">$8.99</span>
    <a href="/products/paper/premium-copy-500" class="product-link">View Details</a>
</div>
"""

soup = BeautifulSoup(html_string, "lxml")

Use "lxml" as your parser in production. If lxml is unavailable, "html.parser" is the built-in fallback.

find() — First Match

soup.find() returns the first element matching your criteria, or None if nothing matches. The None return value is important — always check for it before accessing the element's properties.

# Find by tag name alone
first_h2 = soup.find("h2")
print(first_h2.text)         # "Premium Copy Paper, 500 Sheets"

# Find by tag name and class
name_tag = soup.find("h2", class_="product-name")
print(name_tag.get_text(strip=True))  # "Premium Copy Paper, 500 Sheets"

# Find by attribute
price_tag = soup.find("span", attrs={"data-sku": "PAPER-A4-500"})
print(price_tag.text)        # "$8.99"

# Safe access — always check for None
sale_tag = soup.find("span", class_="sale-price")  # Does not exist
if sale_tag is not None:
    print(sale_tag.text)
else:
    print("No sale price found")  # This is what runs

The AttributeError: 'NoneType' object has no attribute 'text' error is the most common beginner mistake in scraping. It happens when you call .text on an element that find() returned as None because it did not exist. Defensive coding with if element is not None prevents it.

find_all() — All Matches

find_all() returns a list of all matching elements. If nothing matches, it returns an empty list — never None. This makes it safe to iterate over without a None check.

# Example HTML with multiple product listings
multi_html = """
<div class="product-listing">
    <h2 class="product-name">Premium Copy Paper, 500 Sheets</h2>
    <span class="price">$8.99</span>
</div>
<div class="product-listing">
    <h2 class="product-name">Standard Ink Cartridge, Black</h2>
    <span class="price">$24.99</span>
</div>
"""

soup = BeautifulSoup(multi_html, "lxml")

listings = soup.find_all("div", class_="product-listing")
print(f"Found {len(listings)} products")  # Found 2 products

for listing in listings:
    name = listing.find("h2", class_="product-name")
    price = listing.find("span", class_="price")

    name_text = name.get_text(strip=True) if name else "Unknown"
    price_text = price.get_text(strip=True) if price else "N/A"

    print(f"{name_text}: {price_text}")

select() and select_one() — CSS Selectors

select() and select_one() accept CSS selector strings. select_one() returns the first match (or None); select() returns a list of all matches (or an empty list).

Many experienced scrapers prefer these methods because CSS selectors can express complex targeting concisely.

# Find all price spans inside product listings
prices = soup.select("div.product-listing span.price")
for price in prices:
    print(price.text)        # $8.99, then $24.99

# The first product name only
first_name = soup.select_one("h2.product-name")
print(first_name.text)      # "Premium Copy Paper, 500 Sheets"

# Links that go to the products section
product_links = soup.select("a[href^='/products/']")

# Elements with a specific data attribute
items_with_sku = soup.select("[data-sku]")

Extracting Text and Attributes

element = soup.find("span", class_="price")

# Text content — two ways
print(element.text)                    # "$8.99" (may include extra whitespace)
print(element.get_text(strip=True))   # "$8.99" (always stripped)

# Attribute values — two ways
price_tag = soup.find("span", attrs={"data-sku": "PAPER-A4-500"})
print(price_tag["data-sku"])           # "PAPER-A4-500" — KeyError if missing
print(price_tag.get("data-sku"))       # "PAPER-A4-500" — None if missing
print(price_tag.get("href", ""))       # "" — default value if missing (safe)

Always prefer .get("attribute", default) over element["attribute"] in production scrapers. The dictionary-style access raises a KeyError if the attribute is absent; .get() returns a safe default.


20.7 Navigating the Parse Tree

Once you have found an element, you can move relative to it in the tree — up to parents, down to children, and sideways to siblings.

from bs4 import BeautifulSoup

html = """
<div class="catalog-section">
    <h3 class="section-title">Paper Products</h3>
    <div class="product-listing">
        <h2 class="product-name">Premium Copy Paper</h2>
        <span class="price">$8.99</span>
    </div>
    <div class="product-listing">
        <h2 class="product-name">Recycled Copy Paper</h2>
        <span class="price">$7.49</span>
    </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")
first_product = soup.find("div", class_="product-listing")

# Move up: find the parent
parent = first_product.parent
print(parent.get("class"))      # ['catalog-section']

# Move sideways: find next sibling of same type
next_product = first_product.find_next_sibling("div", class_="product-listing")
print(next_product.find("h2").text)   # "Recycled Copy Paper"

# Move down: iterate over direct children
for child in first_product.children:
    if child.name:  # Skip NavigableString whitespace nodes
        print(f"  {child.name}: {child.get_text(strip=True)}")

Searching Within a Found Element

A particularly useful pattern is to locate a container element first, then search within it. This scopes your search and avoids picking up elements from elsewhere on the page.

# Find the paper section specifically
paper_section = soup.find("div", class_="catalog-section")

# Now search only within that section
paper_products = paper_section.find_all("div", class_="product-listing")
# This will not accidentally include products from other sections

from urllib.parse import urljoin
from bs4 import BeautifulSoup

BASE_URL = "https://northstar-office.example.com"

def extract_all_links(soup: BeautifulSoup, base_url: str) -> list[dict]:
    """Extract all links from a page, converting relative URLs to absolute.

    Args:
        soup: Parsed BeautifulSoup object.
        base_url: Base URL of the site, used to resolve relative links.

    Returns:
        List of dicts with 'text' and 'url' keys.
    """
    links = []
    for anchor in soup.find_all("a", href=True):
        href = anchor.get("href", "")
        absolute_url = urljoin(base_url, href)
        link_text = anchor.get_text(strip=True)
        links.append({"text": link_text, "url": absolute_url})
    return links


def extract_product_links(soup: BeautifulSoup, base_url: str) -> list[str]:
    """Extract product page URLs from a catalog listing.

    Args:
        soup: Parsed BeautifulSoup object.
        base_url: Base URL for resolving relative links.

    Returns:
        List of absolute product page URLs.
    """
    product_anchors = soup.select("a.product-link")
    return [
        urljoin(base_url, anchor.get("href", ""))
        for anchor in product_anchors
        if anchor.get("href")
    ]

Extracting HTML Tables

HTML tables are common in business data — financial reports, comparison tables, government statistics. BeautifulSoup parses them cleanly.

from bs4 import BeautifulSoup


def parse_html_table(soup: BeautifulSoup, table_selector: str) -> list[dict]:
    """Parse an HTML table into a list of dictionaries.

    Uses the table's header row as dictionary keys. Each data row becomes
    one dictionary in the returned list.

    Args:
        soup: Parsed BeautifulSoup object.
        table_selector: CSS selector string targeting the table element.

    Returns:
        List of row dictionaries, or empty list if table not found.
    """
    table = soup.select_one(table_selector)
    if table is None:
        print(f"  Table not found for selector: {table_selector}")
        return []

    # Extract column headers from <th> elements
    headers = [th.get_text(strip=True) for th in table.select("thead th")]

    # Fallback: use first row as headers if no <thead>
    if not headers:
        first_row = table.select_one("tr")
        if first_row:
            headers = [td.get_text(strip=True) for td in first_row.select("td, th")]

    # Extract data rows from <tbody>
    rows = []
    for tr in table.select("tbody tr"):
        cells = [td.get_text(strip=True) for td in tr.select("td")]
        if len(cells) == len(headers):
            rows.append(dict(zip(headers, cells)))

    return rows

For a quick shortcut when you just need the data from a well-structured public table, pandas can do all of this in one line:

import pandas as pd

# pandas.read_html() fetches the page and returns all tables as DataFrames
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
sp500_df = tables[0]   # First table on the page
print(sp500_df.head())

Use pd.read_html() for quick exploration. Use BeautifulSoup for production scrapers where you need custom headers, rate limiting, and detailed error handling.


20.9 Handling Pagination

Most product catalogs and data listings are spread across multiple pages. Your scraper needs to recognize when there is a next page and follow it until the data runs out.

Pattern 1: URL-Based Page Numbers

Many sites use simple URL parameters for pagination:

https://northstar-office.example.com/products/?page=1
https://northstar-office.example.com/products/?page=2
https://northstar-office.example.com/products/?page=3
import time
import requests
from bs4 import BeautifulSoup


def scrape_paginated_catalog(
    base_catalog_url: str,
    session: requests.Session,
    delay_seconds: float = 1.5,
) -> list[dict]:
    """Scrape all pages of a paginated product catalog.

    Args:
        base_catalog_url: URL of the catalog without page parameter.
        session: A requests.Session object.
        delay_seconds: Seconds to wait between page requests.

    Returns:
        List of all product dictionaries collected across all pages.
    """
    all_products = []
    page_number = 1

    while True:
        page_url = f"{base_catalog_url}?page={page_number}"
        print(f"  Fetching page {page_number}: {page_url}")

        html = fetch_page_html(page_url, session)
        if html is None:
            print(f"  Failed to fetch page {page_number}. Stopping.")
            break

        soup = BeautifulSoup(html, "lxml")
        page_products = extract_products_from_page(soup)

        if not page_products:
            print(f"  No products found on page {page_number}. Pagination complete.")
            break

        all_products.extend(page_products)
        print(f"  Found {len(page_products)} products (running total: {len(all_products)})")

        # Check for a "Next" button
        next_button = soup.select_one("a.pagination-next, a[aria-label='Next page']")
        if next_button is None:
            print("  No next-page link found. Pagination complete.")
            break

        page_number += 1
        time.sleep(delay_seconds)  # Respectful delay between pages

    return all_products

Some sites use variable next-page URLs rather than sequential numbers. In this case, follow the href of the "Next" link directly.

from urllib.parse import urljoin

def scrape_by_following_next_links(
    start_url: str,
    base_url: str,
    session: requests.Session,
    delay_seconds: float = 2.0,
) -> list[dict]:
    """Scrape a site by following 'Next' links until none remain.

    Args:
        start_url: URL of the first page to scrape.
        base_url: Site base URL for resolving relative links.
        session: A requests.Session object.
        delay_seconds: Seconds to wait between page requests.

    Returns:
        All products collected across all pages.
    """
    all_products = []
    current_url: str | None = start_url

    while current_url is not None:
        html = fetch_page_html(current_url, session)
        if html is None:
            break

        soup = BeautifulSoup(html, "lxml")
        all_products.extend(extract_products_from_page(soup))

        # Find the next page link — CSS selector may vary by site
        next_link = soup.select_one("a.pagination-next, .next-page > a")
        if next_link and next_link.get("href"):
            current_url = urljoin(base_url, next_link["href"])
            time.sleep(delay_seconds)
        else:
            current_url = None   # No more pages

    return all_products

20.10 Rate Limiting and the Ethics of Speed

Rate limiting is how you tell a server "I am a polite visitor, not an attack." Every professional scraper implements it. Here is what that looks like in practice.

The Basic Pattern

import time
import random


def respectful_delay(min_seconds: float = 1.0, max_seconds: float = 3.0) -> None:
    """Wait a random interval between requests to avoid overwhelming the server.

    Random delays are harder for anti-bot systems to fingerprint and
    distribute load more evenly than fixed delays.

    Args:
        min_seconds: Minimum delay in seconds.
        max_seconds: Maximum delay in seconds.
    """
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)

Call respectful_delay() between every request. For most business intelligence scraping, a 1–3 second delay is appropriate. If robots.txt specifies a Crawl-delay, use that value instead.

Retry Logic with Exponential Backoff

When requests fail transiently (timeouts, 503 errors), it is appropriate to retry — but with increasing waits between attempts. This is called exponential backoff.

import time
import requests


def fetch_with_retry(
    url: str,
    session: requests.Session,
    max_retries: int = 3,
) -> str | None:
    """Fetch a URL with exponential backoff retry on transient failures.

    Args:
        url: The URL to fetch.
        session: A requests.Session object.
        max_retries: Maximum number of retry attempts.

    Returns:
        HTML content string, or None if all retries exhausted.
    """
    for attempt in range(max_retries):
        try:
            response = session.get(url, timeout=15)
            response.raise_for_status()
            return response.text

        except requests.exceptions.HTTPError as error:
            status = error.response.status_code
            if status in (403, 404):
                return None   # Permanent errors — do not retry
            wait = 2 ** attempt   # 1s, 2s, 4s
            print(f"  HTTP {status} on attempt {attempt + 1}. Waiting {wait}s.")
            time.sleep(wait)

        except requests.exceptions.Timeout:
            wait = 2 ** attempt
            print(f"  Timeout on attempt {attempt + 1}. Waiting {wait}s.")
            time.sleep(wait)

        except requests.exceptions.RequestException as error:
            print(f"  Request error: {error}")
            return None

    return None

20.11 Saving Scraped Data

Saving to CSV

import csv
from pathlib import Path
from datetime import datetime


def save_to_csv(
    records: list[dict],
    output_path: str | Path,
    append: bool = False,
) -> None:
    """Save a list of dictionaries to a CSV file.

    Args:
        records: List of dictionaries to save. All dicts must have the same keys.
        output_path: File path for the output CSV.
        append: If True, append to existing file. If False, overwrite.

    Raises:
        ValueError: If records is empty.
    """
    if not records:
        print("  No records to save — skipping CSV write.")
        return

    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)

    file_mode = "a" if append else "w"
    write_header = not append or not path.exists()

    fieldnames = list(records[0].keys())

    with open(path, mode=file_mode, newline="", encoding="utf-8") as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        if write_header:
            writer.writeheader()
        writer.writerows(records)

    action = "Appended" if append else "Saved"
    print(f"  {action} {len(records)} records to {path}")

Saving to JSON

import json
from pathlib import Path
from datetime import datetime


def save_to_json(records: list[dict], output_path: str | Path) -> None:
    """Save a list of dictionaries to a JSON file with run metadata.

    Args:
        records: List of dictionaries to save.
        output_path: File path for the output JSON.
    """
    path = Path(output_path)
    path.parent.mkdir(parents=True, exist_ok=True)

    output_data = {
        "exported_at": datetime.now().isoformat(),
        "record_count": len(records),
        "records": records,
    }

    with open(path, "w", encoding="utf-8") as json_file:
        json.dump(output_data, json_file, indent=2, ensure_ascii=False)

    print(f"  Saved {len(records)} records to {path}")

Adding Metadata to Each Record

Always record when you scraped the data and where it came from. This is essential for auditing, debugging, and time-series analysis.

from datetime import date


def add_scrape_metadata(records: list[dict], source_url: str) -> list[dict]:
    """Add scrape date and source URL to each record.

    Args:
        records: List of scraped data dictionaries.
        source_url: URL the data was scraped from.

    Returns:
        Records with 'scrape_date' and 'source_url' fields added.
    """
    scrape_date = date.today().isoformat()
    return [
        {**record, "scrape_date": scrape_date, "source_url": source_url}
        for record in records
    ]

20.12 Common Pitfalls

JavaScript-Rendered Pages

This is the wall most scrapers hit eventually. Many modern websites use JavaScript frameworks — React, Vue, Angular — to build their pages dynamically in the browser. When you fetch such a page with requests, you get the initial HTML shell before JavaScript has run. The content you are looking for simply does not exist in what requests sees.

How to tell: Save response.text to a file and open it in a text editor. If the content visible in your browser is not in that file, JavaScript is rendering it.

Option 1 (often works): Find the underlying API. Open your browser's Developer Tools, go to the Network tab, filter by Fetch/XHR, and reload the page. Look for JSON requests that return the data you need. Scraping that API endpoint directly is faster, cleaner, and more reliable than trying to execute JavaScript.

Option 2: Use Selenium or Playwright. These tools automate a real browser, allowing JavaScript to execute. This is covered in Chapter 21.

# Example: Instead of scraping rendered HTML, find and call the API directly
import requests

# What you might find in the Network tab:
api_url = "https://shop.northstar-office.example.com/api/products"
params = {"category": "paper", "page": 1, "per_page": 50}
headers = {"Accept": "application/json"}

response = requests.get(api_url, params=params, headers=headers, timeout=15)
data = response.json()
products = data["results"]   # Clean, already-structured data

Inconsistent HTML Structure

Real websites are inconsistent. An element you expect to be present is sometimes absent. A class name changes with a site redesign. A price that was in a <span> is now in a <p>. Defensive coding is mandatory in production scrapers.

def safe_get_text(element, default: str = "") -> str:
    """Extract text from a BeautifulSoup element, returning default if None.

    Args:
        element: A BeautifulSoup tag, or None.
        default: Value to return if element is None.

    Returns:
        Stripped text content, or the default value.
    """
    if element is None:
        return default
    return element.get_text(strip=True)


def safe_get_attr(element, attr: str, default: str = "") -> str:
    """Extract an attribute from a BeautifulSoup element safely.

    Args:
        element: A BeautifulSoup tag, or None.
        attr: Attribute name to retrieve.
        default: Value to return if element is None or attribute missing.

    Returns:
        Attribute value, or the default value.
    """
    if element is None:
        return default
    return element.get(attr, default)

Use these two helper functions everywhere in your scrapers. They eliminate the entire class of AttributeError: 'NoneType' crashes.

Encoding Issues

Some websites use non-UTF-8 encodings. requests attempts to auto-detect encoding but does not always get it right, particularly for older sites.

response = requests.get(url, timeout=15)

# Check what encoding requests detected
print(f"Detected encoding: {response.encoding}")

# If the encoding looks wrong, try forcing UTF-8
response.encoding = "utf-8"
html = response.text

# Alternatively, pass response.content (bytes) to BeautifulSoup
# and let lxml detect the encoding from the HTML meta tags
soup = BeautifulSoup(response.content, "lxml")  # content, not text

Session Handling

Some sites set cookies that must persist between requests — for analytics, session tracking, or anti-bot detection. Using a requests.Session handles this automatically.

session = requests.Session()
session.headers.update(SCRAPER_HEADERS)

# Cookies set by the first request are automatically included in subsequent ones
page1 = session.get("https://example.com/catalog/page/1")
page2 = session.get("https://example.com/catalog/page/2")  # Cookies carry over

20.13 When NOT to Scrape

This section deserves its own dedicated space because knowing when not to scrape will save you days of wasted effort.

Use an API If One Exists

If the website offers a documented public API, use it. APIs are:

  • Faster than fetching and parsing HTML
  • More stable — API schemas change infrequently; HTML changes constantly
  • Explicitly permitted — no ToS concerns about authorized access
  • More structured — you get clean JSON rather than messy HTML

Search for [site name] developer API or [site name] API documentation before building any scraper. Many business data providers — financial data, company information, logistics — have official APIs. Even if the API has a cost, it may be less expensive than the engineering time to maintain a scraper.

Use Downloadable Data When Available

Government agencies, financial regulators, and research institutions often publish downloadable data files in CSV, Excel, or XML formats. Before scraping a government statistics page, check whether there is a "Download Data" button. There usually is.

Do Not Scrape Your Own Internal Systems

Your CRM, ERP, and accounting software almost certainly have APIs, webhooks, or export functions designed for exactly the kind of access you are trying to get. Scraping a web interface to your own company's internal systems is almost always the wrong approach — it is fragile, slow, and unnecessary.


20.14 Priya's Competitor Price Tracker: The Complete Scenario

Let us follow Priya through the full process of building Acme's competitor price tracker, from the sticky notes on her monitor to a working tool.

Priya starts by checking NorthStar Office Supplies' robots.txt at https://northstar-office.example.com/robots.txt. She finds:

User-agent: *
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/
Crawl-delay: 2
Allow: /products/

The products section is explicitly allowed. The crawl delay of 2 seconds is specified — she will honor it. She also reads the Terms of Service, which do not prohibit automated access to public pricing pages for internal competitive research.

She documents these findings in a comment at the top of her script. Marcus Webb later asks to see this documentation, and having it ready builds his confidence in the project.

Step 2: Inspect the HTML

Priya opens a NorthStar product page in her browser, right-clicks on a product name, and selects "Inspect." She explores the HTML structure until she understands the pattern:

  • Each product is in a <div class="ns-product-card">
  • Product names are in <h3 class="ns-product-title">
  • Prices are in <span class="ns-price">
  • SKUs are in data-product-id attributes on the product card
  • Pagination uses a <a class="ns-btn-next-page"> link

Step 3: Write the Scraper

from bs4 import BeautifulSoup
from datetime import date


def extract_northstar_products(soup: BeautifulSoup) -> list[dict]:
    """Extract product data from a NorthStar Office catalog page.

    Args:
        soup: BeautifulSoup object of a NorthStar product listing page.

    Returns:
        List of product dictionaries with name, sku, and price fields.
    """
    products = []
    cards = soup.find_all("div", class_="ns-product-card")

    for card in cards:
        name_tag = card.find("h3", class_="ns-product-title")
        price_tag = card.find("span", class_="ns-price")

        product = {
            "name": safe_get_text(name_tag),
            "sku": card.get("data-product-id", ""),
            "price": safe_get_text(price_tag),
            "competitor": "NorthStar Office Supplies",
            "scrape_date": date.today().isoformat(),
        }

        # Only include records with both a name and a price
        if product["name"] and product["price"]:
            products.append(product)

    return products

Step 4: Report to Sandra

Priya runs the scraper for the first time and gets 147 product records. She loads the CSV into a pandas DataFrame, merges it with Acme's own pricing data, and creates a simple comparison report that flags any product where NorthStar's price is more than 10% different from Acme's.

Sandra Chen's response: "This is exactly what I needed. Can we run this weekly?"

That is Chapter 22's work — scheduling it. For now, Priya has proven the concept.

The complete implementation of this scraper is in code/acme_competitor_tracker.py and case-study-01.md.


20.15 Pulling Everything Together: A Complete Scraping Workflow

Here is the standard structure for any production web scraper you build:

"""
template_scraper.py — Standard scraper structure for business intelligence tasks.

Adapt this template for new scraping targets by implementing:
  - extract_data_from_page() for site-specific extraction logic
  - The target URL and any site-specific selectors or headers
"""
import csv
import time
import random
import requests
from pathlib import Path
from datetime import date
from bs4 import BeautifulSoup


SCRAPER_HEADERS = {
    "User-Agent": "AcmeCorpBI/1.0 (contact: priya@acmecorp.example.com)"
}
OUTPUT_DIR = Path("output")
DELAY_MIN = 1.5   # seconds
DELAY_MAX = 3.0   # seconds


def main() -> None:
    """Main entry point: orchestrate the full scrape-and-save workflow."""
    target_url = "https://example.com/products/"

    # Step 1: Check robots.txt
    if not is_scraping_allowed(target_url):
        print("robots.txt disallows scraping this URL. Exiting.")
        return

    # Step 2: Create session
    session = requests.Session()
    session.headers.update(SCRAPER_HEADERS)

    # Step 3: Scrape all pages
    print(f"Starting scrape of {target_url}")
    all_records = []
    page_number = 1

    while True:
        page_url = f"{target_url}?page={page_number}"
        html = fetch_with_retry(page_url, session)

        if html is None:
            break

        soup = BeautifulSoup(html, "lxml")
        page_records = extract_data_from_page(soup)   # Implement per site

        if not page_records:
            break

        all_records.extend(page_records)
        print(f"  Page {page_number}: {len(page_records)} records")

        if not soup.select_one("a.next-page"):
            break

        page_number += 1
        time.sleep(random.uniform(DELAY_MIN, DELAY_MAX))

    # Step 4: Save results
    print(f"\nTotal records collected: {len(all_records)}")
    output_file = OUTPUT_DIR / f"scrape_{date.today().isoformat()}.csv"
    save_to_csv(all_records, output_file)
    print("Done.")


if __name__ == "__main__":
    main()

This structure — check permissions, create session, loop through pages with respectful delays, save output — is the template for nearly every business scraper you will build.


Chapter Summary

Web scraping gives you access to the world's largest database: the public web. The core workflow is:

  1. requests.get(url) fetches the HTML
  2. BeautifulSoup(html, "lxml") parses it into a navigable tree
  3. .find(), .find_all(), and .select() locate the elements you need
  4. .get_text(strip=True) and .get("attribute") extract the values
  5. A pagination loop repeats this across all pages
  6. time.sleep() keeps you polite and keeps you unblocked
  7. csv.DictWriter saves the results for further analysis

The discipline of checking robots.txt, reading Terms of Service, setting honest User-Agent headers, and rate limiting your requests is not optional caution — it is the professional standard. Scrapers that violate these principles get blocked, create legal exposure, and reflect poorly on the organizations that run them.

Priya's competitor price tracker demonstrates the business value: what took half a day of manual work every quarter now takes 30 seconds of computation every week, producing richer data than the manual process ever could.

In Chapter 22, you will schedule this scraper to run automatically every Monday morning without anyone touching a keyboard.


Key Terms

Web scraping — Programmatically extracting data from websites by fetching HTML and parsing its structure.

HTTP — HyperText Transfer Protocol; the communication standard between clients (browsers, scrapers) and web servers.

Status code — A three-digit number in an HTTP response. 200 = success; 404 = not found; 429 = rate limited; 500 = server error.

HTML — HyperText Markup Language; the structured text format used to define web page content and layout.

DOM — Document Object Model; the tree structure that a browser (or parser) builds from HTML, allowing programmatic navigation.

CSS selector — A pattern string for targeting specific HTML elements; used in styling and in BeautifulSoup's .select() method.

BeautifulSoup — A Python library for parsing HTML and XML, providing Pythonic navigation and search over the parse tree.

robots.txt — A standard text file at a website's root that specifies which paths automated bots may or may not access.

Rate limiting — Deliberately slowing down requests to avoid overloading a server or triggering anti-bot defenses.

Pagination — The practice of dividing large datasets across multiple numbered pages in a web interface.

User-Agent — An HTTP request header that identifies the software making the request; used by servers to customize responses and by sites to detect bots.

JSON — JavaScript Object Notation; a lightweight text format for structured data, commonly returned by web APIs.