Chapter 20 Key Takeaways: Web Scraping for Business Intelligence

DataField.Dev

Chapter 20 Key Takeaways: Web Scraping for Business Intelligence

The Core Mental Model

Web scraping has two phases: fetching and parsing. requests.get(url) fetches — it hands you the raw HTML as a string. BeautifulSoup(html, "lxml") parses — it converts that string into a navigable tree structure. Everything else is navigating that tree to find the data you need. Keep these phases separate in your thinking and in your code.

The Non-Negotiable Rules

Rule 1: Check robots.txt before every new scraping target. It takes five seconds. urllib.robotparser.RobotFileParser makes it trivial in Python. Skipping this check is not a time-saving shortcut — it is a professional failure.

Rule 2: Read the Terms of Service. robots.txt covers technical access rules. Terms of Service covers legal access rules. You need to check both. If the ToS prohibits scraping, stop and find an alternative data source.

Rule 3: Rate limit every scraper. time.sleep() between requests is not optional polish — it is the basic obligation of a responsible web citizen. A minimum of one to two seconds between requests to the same host. Honor the Crawl-delay in robots.txt if one is specified.

Rule 4: Identify yourself honestly in the User-Agent. Set a descriptive, truthful User-Agent string that identifies what your script is and how to reach you. Do not impersonate a browser's User-Agent. Honesty in your bot's identity is both ethical and practically better — servers that actively block bots look for browser impersonation.

Rule 5: When an API exists, use it. Scraping HTML is a workaround for the absence of a better option. If the website offers a public API, use it. APIs are faster, more stable, and explicitly sanctioned for machine access.

The Key Technical Facts

requests.get() returns a Response object, not HTML. Access the HTML via response.text. Check the status via response.status_code. Always verify the status before proceeding.

soup.find() returns one element or None. soup.find_all() returns a list, never None. This distinction matters for how you write your code: - After find(): always check for None before accessing .text or .get() - After find_all(): safe to iterate directly — an empty list is a valid result

element.get("attr") is safer than element["attr"]. Dictionary-style access raises KeyError if the attribute is absent. .get() returns None (or a specified default). Use .get() in production; use ["attr"] only when you are certain the attribute exists.

CSS selectors are often more concise than chained .find() calls. soup.select("div.product-card span.price") is cleaner than soup.find("div", class_="product-card").find("span", class_="price"). Both work, but CSS selectors express complex targeting more clearly.

pd.read_html(url) scrapes all tables in one call. Returns a list of DataFrames, one per <table> element. The fastest path when you need tabular public data and do not need custom headers or rate limiting.

urljoin(base_url, relative_url) correctly resolves all relative URL forms. String concatenation breaks for paths starting with / versus paths without one. Always use urljoin from urllib.parse.

Status Codes You Must Know

Code	Meaning	Your Action
200	Success	Parse the response
301/302	Redirect	`requests` follows automatically
403	Forbidden	Stop — check ToS and robots.txt
404	Not Found	Log and skip — URL is wrong
429	Rate Limited	Wait (30+ seconds) before retrying
500/503	Server Error	Retry with backoff

The Defensive Coding Pattern

The single most common scraping bug is AttributeError: 'NoneType' object has no attribute 'text'. It happens when a page changes structure and find() returns None. The fix is defensive extraction:

# Write this:
def safe_text(element, default=""):
    return element.get_text(strip=True) if element else default

# Not this:
text = soup.find("span", class_="price").text  # crashes if element is missing

Write every element access defensively. Log a warning when expected elements are absent — structure changes should be visible in your logs before they cause complete failures.

What You Can and Cannot Scrape

You can scrape: - Static HTML pages where content is present in the initial HTTP response - Public data pages that robots.txt and ToS permit - HTML tables (with BeautifulSoup or pd.read_html()) - Paginated listings (with a loop following next-page links)

You cannot easily scrape: - JavaScript-rendered pages (React, Vue, Angular SPAs) — content is not in the initial HTML - Pages behind login walls (complex session management) - CAPTCHA-protected pages (designed to block automation) - Real-time WebSocket data

When you encounter JavaScript-rendered content: First, look for the underlying API in your browser's Developer Tools (Network → XHR/Fetch tab). If you can find a JSON API endpoint, call it directly — this is faster and more reliable than browser automation. If no API is accessible, Selenium or Playwright (not covered in this book) can automate a real browser.

Building for Longevity

Web scrapers break when websites change. Design yours to fail gracefully:

Validate your extractions — check that prices are numeric, titles are non-empty, URLs are valid
Log when structure seems wrong — zero results on a page that usually has twenty is a signal worth capturing
Separate configuration from logic — store CSS selectors and site URLs in a configuration dict, not scattered through the code. When the site changes, you update one place.
Add timestamps to all saved data — you need to know when each record was scraped for debugging and time-series analysis
Use append mode for historical data — accumulating data over time is often more valuable than any single snapshot

What the Characters Learned

Priya (Acme Corp): Competitor price monitoring went from a half-day quarterly exercise to a fully automated weekly process. The automation produced richer data (weekly trends instead of quarterly snapshots) while consuming less human time. The critical discipline was doing the due diligence first: checking robots.txt and ToS before writing a single line of scraping code.

Maya (Freelance Consultant): Daily job board monitoring automated the tedious "check two sites, manually filter 40 listings, notice 5 relevant ones" routine. The keyword filtering is the system's highest-value feature — it makes decisions that were previously mental overhead. The result: consistent daily monitoring with ten minutes of human review instead of fifteen minutes of browsing.

Common Mistakes to Avoid

Not checking robots.txt — makes your scraper disrespectful and potentially legally exposed.
No rate limiting — hammering a server will get you blocked and may harm the site for real users.
Not handling None from find() — the most common crash. Use defensive coding with if element checks.
String concatenation for URL joining — use urljoin(). String concatenation breaks on paths starting with /.
Scraping when an API exists — check for APIs first. They are always the better option.
Using a browser-impersonating User-Agent — dishonest and flagged by anti-bot systems. Use a descriptive, honest user agent.
Overwriting historical data — use append mode when building time-series datasets. You cannot recreate historical prices after the fact.
Assuming the page structure will not change — it will. Write defensive code and monitor for unexpected zero-result runs.