Chapter 20 Further Reading: Web Scraping for Business Intelligence

DataField.Dev

Chapter 20 Further Reading: Web Scraping for Business Intelligence

Official Documentation

requests library https://requests.readthedocs.io/en/latest/ The definitive reference for Python's requests library. The "Quickstart" section is excellent for beginners. The "Advanced Usage" section covers sessions, authentication, streaming, and SSL certificate handling — all relevant to production scrapers.

BeautifulSoup4 Documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Comprehensive documentation covering the full API: all parser options, search methods, CSS selectors, tree navigation, output formatting, and encoding handling. The "Searching the Parse Tree" section is particularly thorough.

urllib.robotparser https://docs.python.org/3/library/urllib.robotparser.html Python standard library documentation for RobotFileParser. Covers the full API for reading and querying robots.txt files, including crawl delay parsing and request rate handling.

pandas.read_html() https://pandas.pydata.org/docs/reference/api/pandas.read_html.html API reference for pandas.read_html(). Lists all parameters including match strings, header row specification, index column handling, and converters for data type coercion.

lxml Documentation https://lxml.de/ Documentation for the lxml library, which BeautifulSoup uses as its parsing backend. Useful if you encounter edge cases in HTML parsing or need to use lxml's XPath capabilities directly.

The Robots Exclusion Protocol

robots.txt Standard https://www.robotstxt.org/robotstxt.html The original documentation for the Robots Exclusion Protocol, explaining all directives, wildcards, and common patterns. Essential reading before deploying any production scraper.

Google's robots.txt Specification https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt Google's interpretation of the robots.txt standard, which is widely considered the authoritative practical reference. Includes guidance on wildcards (*), specific path matching, and the Allow directive behavior.

Ethics and Legal Context

hiQ Labs v. LinkedIn — Ninth Circuit Decision https://law.justia.com/cases/federal/appellate-courts/ca9/17-16783/17-16783-2022-04-18.html The landmark 2022 Ninth Circuit ruling holding that scraping publicly available data does not constitute unauthorized computer access under the CFAA. Provides important legal context for public data scraping in the United States. (Note: consult a lawyer for actual legal advice — this is educational context only.)

GDPR and Web Scraping https://gdpr.eu/ The official GDPR information portal. If you scrape data involving EU residents (names, email addresses, or any personally identifiable information), GDPR applies regardless of where your company is located. The "What is GDPR?" section is a clear overview.

The Ethical Scraper's Manifesto https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01 A practitioner-written overview of ethical scraping principles. Not a legal document — a professional standards discussion. Useful for developing your own framework for evaluating whether a scraping project is appropriate.

Practice Environments

books.toscrape.com https://books.toscrape.com/ The practice site used throughout this chapter. A fully functional fake bookstore with categories, pagination, star ratings, and price data — all designed for scraping practice. No authentication, no CAPTCHA, no ToS restrictions.

toscrape.com https://toscrape.com/ The parent organization of books.toscrape.com. Also hosts quotes.toscrape.com (a quotes site for scraping practice) and links to other ethical scraping sandboxes.

httpbin.org https://httpbin.org/ A request/response testing service useful for understanding HTTP mechanics. /headers returns the headers your request sent; /status/429 returns a 429 response; /delay/5 delays the response for 5 seconds. Excellent for testing your error handling code.

Wikipedia https://en.wikipedia.org/ Wikipedia explicitly allows scraping of its content (see its Terms of Use and robots.txt). The wealth of HTML tables on Wikipedia makes it an excellent practice target for table scraping. For programmatic access to Wikipedia data, the MediaWiki API is even better.

Books

Web Scraping with Python by Ryan Mitchell (O'Reilly) The most comprehensive Python web scraping book available. Covers BeautifulSoup, Selenium, handling JavaScript, working with APIs, storing scraped data, and legal/ethical considerations. An excellent next step after this chapter.

Mining the Social Web by Matthew A. Russell and Mikhail Klassen (O'Reilly) Focused on scraping social media platforms and public web APIs. Covers Twitter/X, LinkedIn, GitHub, Gmail, and other platforms. More API-focused than HTML-scraping focused, which complements this chapter well.

Practical Web Scraping for Data Science by Seppe vanden Broucke and Bart Baesens (Apress) A more data-science-oriented treatment that emphasizes integrating scraped data with analytical workflows — a natural bridge between this chapter and later chapters on data analysis.

Scrapy https://scrapy.org/ A full-featured Python web scraping framework for large-scale projects. Provides a structured project layout, middleware for handling delays and retries, pipelines for data processing, and built-in robots.txt compliance. Appropriate when you are building a scraper that runs continuously, scrapes many sites, or handles high volume. For the business use cases in this book (weekly price checks, daily job monitoring), requests + BeautifulSoup is simpler and entirely sufficient.

Selenium https://selenium-python.readthedocs.io/ The standard tool for scraping JavaScript-rendered pages by automating a real browser. Useful when requests returns an empty shell rather than the visible content. Slower and more resource-intensive than static HTML scraping — use only when necessary.

Playwright https://playwright.dev/python/ A newer alternative to Selenium for browser automation. Faster, more reliable, and with better async support. Increasingly preferred over Selenium for new projects.

httpx https://www.python-httpx.org/ A modern HTTP client that supports both synchronous and asynchronous usage. A potential replacement for requests when you need async scraping for better performance at higher volumes.

parsel https://github.com/scrapy/parsel The HTML/XML parsing library underlying Scrapy, usable independently. Supports both CSS selectors and XPath, which can be more powerful than CSS selectors for complex navigation tasks.

Data Sources Worth Knowing

SEC EDGAR (Financial Filings) https://www.sec.gov/cgi-bin/browse-edgar https://efts.sec.gov/LATEST/search-index?q= (full-text search API) The US Securities and Exchange Commission's public database of company filings. Financial statements, quarterly reports, insider transactions — all publicly available and downloadable. EDGAR has a formal API (no scraping needed for most use cases).

US Bureau of Labor Statistics https://www.bls.gov/developers/ Employment statistics, Consumer Price Index, Producer Price Index, and more. Has a public API that is preferable to scraping the HTML pages.

US Census Bureau Data https://www.census.gov/data/developers/data-sets.html Population data, economic surveys, business statistics. Comprehensive API available.

World Bank Open Data https://data.worldbank.org/ Global development indicators, GDP, inflation, trade data. Excellent API, also downloadable as CSV.

Google Trends https://trends.google.com/trends/ Relative search interest over time for any keyword. Unofficial Python library pytrends provides programmatic access. Useful for monitoring brand awareness and market trends.

Next Steps After This Chapter

APIs as a scraping alternative: If you find yourself frequently scraping a particular site, investigate whether they have an official API. The book recommends checking [site name] + "developer API" as a first step before building any scraper.

Scrapy for larger projects: Once your scraping needs grow beyond a few pages to a few thousand, Scrapy's framework provides session management, concurrent requests, retry middleware, and item pipelines that make large-scale work manageable.

Database storage: For scrapers that run regularly and accumulate data, move from CSV files to a proper database. SQLite (built into Python) requires no server and handles millions of rows efficiently. PostgreSQL is appropriate for larger production deployments.

Monitoring and alerting: Production scrapers should alert you when they fail or when they return unexpected results (zero items on a page that usually has twenty). The email automation from Chapter 19 combined with the scraping patterns from this chapter creates a complete monitoring pipeline.

Combining with Chapter 19: The natural next project is a scraper that runs on a schedule, collects data, analyzes it, and emails a report. The Acme Corp price monitor combined with the automated email system from Chapter 19 is precisely this — and the tools for building it are all now in your toolkit.

Chapter 20 Further Reading: Web Scraping for Business Intelligence

Official Documentation

The Robots Exclusion Protocol

Ethics and Legal Context

Practice Environments

Books

Tools and Related Libraries

Data Sources Worth Knowing

Next Steps After This Chapter