Further Reading: Getting Data from the Web
You've just learned to treat the internet as a data source — a skill that dramatically expands what you can analyze. Here are resources to deepen your understanding, organized by what caught your interest.
Tier 1: Verified Sources
These are published books with full bibliographic details.
Ryan Mitchell, Web Scraping with Python: Collecting More Data from the Modern Web (O'Reilly, 3rd edition, 2024). This is the definitive book on web scraping with Python. Mitchell covers BeautifulSoup in depth, then goes far beyond what we covered — Scrapy (a scraping framework), Selenium (for JavaScript-heavy sites), handling CAPTCHAs, and working with APIs. She also dedicates significant space to the legal and ethical dimensions of scraping. If web data collection is going to be a regular part of your work, this book is essential. The third edition is updated for modern web technologies.
Leonard Richardson and Mike Amundsen, RESTful Web APIs: Services for a Changing World (O'Reilly, 2013). If the API section of this chapter made you curious about how APIs are designed — not just how to use them — this book provides a thorough treatment of REST architecture, API design patterns, and hypermedia. It's more relevant to API designers than API consumers, but understanding how APIs are built makes you a more effective user of them.
Daniel Halperin, Jevin West, and Carl Bergstrom, Calling Bullshit: The Art of Skepticism in a Data-Driven World (Random House, 2020). While not specifically about web scraping, this book is essential reading for anyone collecting data from the web. Halperin, West, and Bergstrom (professors at the University of Washington) teach you to critically evaluate data claims — a skill you'll need when deciding whether web-scraped data is reliable, representative, and ethically collected.
Matthew A. Russell and Mikhail Klassen, Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, and Other Social Media Sites (O'Reilly, 3rd edition, 2019). Covers accessing social media data through official APIs, including authentication, pagination, and rate limiting. The book walks through practical examples for multiple platforms. Important caveat: social media APIs change frequently, so some specific code examples may be outdated — but the patterns and concepts remain valid.
Tier 2: Attributed Resources
These are well-known online resources and documentation. We provide enough detail to find them without URLs (because links change).
The requests library documentation. The official documentation for the Python requests library is one of the best-written library docs in the Python ecosystem. It includes a "Quickstart" guide that covers everything from this chapter and more, plus detailed API reference for advanced features (sessions, SSL certificates, proxies, streaming downloads). Search for "python requests library documentation" to find it at docs.python-requests.org.
The BeautifulSoup documentation. The official documentation for BeautifulSoup (also called "Beautiful Soup Documentation") is thorough and includes many examples. It covers navigating the parse tree, searching by CSS class or attribute, and handling different parsers (html.parser, lxml, html5lib). Search for "beautifulsoup4 documentation" to find it.
Kenneth Reitz's "The Hitchhiker's Guide to Python" — HTTP section. Reitz, who created the requests library, co-authored this guide to Python best practices. The HTTP and web scraping sections provide practical advice on structuring web data collection scripts. Available at docs.python-guide.org.
The Electronic Frontier Foundation (EFF) — Web Scraping Resources. The EFF has published multiple analyses of the legal landscape around web scraping in the United States, including commentary on the hiQ Labs v. LinkedIn case. Search for "EFF web scraping legal" to find their articles. These are valuable for understanding the evolving legal framework.
Public APIs List (github.com/public-apis/public-apis). A community-curated list of free APIs organized by category (Animals, Business, Finance, Health, Sports, Weather, and many more). This is the best starting point when you're looking for a data source to practice with or to find data for a project. Search for "public-apis github" to find the repository.
Recommended Next Steps
-
If you want to practice with real APIs: Start with free, no-authentication APIs like Open-Meteo (weather), REST Countries (country data), or the Public APIs list on GitHub. Build a small data collection project for a topic you care about.
-
If web scraping fascinated you: Read Ryan Mitchell's Web Scraping with Python for comprehensive coverage. Then practice on sites that explicitly allow scraping for educational purposes (like books.toscrape.com and quotes.toscrape.com, created specifically for scraping practice).
-
If the ethics discussion resonated: Read the Clearview AI case coverage in major news outlets (NYT, Wired, The Verge all covered it extensively in 2020-2021). For academic treatment, search for "web scraping ethics data science" in Google Scholar.
-
If you want to handle JavaScript-heavy sites: Look into Selenium (a browser automation tool) or Playwright (a newer alternative). Both allow your Python script to control a real browser, executing JavaScript and interacting with dynamic content. These are beyond our scope but essential for advanced web data collection.
-
If you're ready to move on: Part III awaits. You've spent seven chapters learning to get data, clean it, reshape it, and connect it. Now it's time to see it — Chapter 14 introduces data visualization with matplotlib, and the datasets you've built will finally come alive as charts, graphs, and visual stories.