Chapter 20 Exercises: Web Scraping for Business Intelligence
These exercises build progressively from understanding HTML structure to building production-quality scrapers. Work through them in order within each tier. The primary practice site for live scraping is https://books.toscrape.com — it is explicitly designed for scraping practice with no ToS restrictions. Other exercises use local HTML strings so you can work offline.
Tier 1 — Comprehension (No Coding Required)
Exercise 1.1: HTML Structure Analysis
Given the following HTML snippet, answer the questions below:
<section class="product-catalog" id="main-catalog">
<h1>Widget Catalog — Q4 2024</h1>
<div class="filters">
<a href="?cat=standard" class="filter-link active">Standard</a>
<a href="?cat=pro" class="filter-link">Professional</a>
</div>
<ul class="product-list">
<li class="product-item" data-sku="WG-100">
<span class="name">Basic Widget</span>
<span class="price" data-currency="USD">$29.99</span>
<span class="stock in-stock">In Stock</span>
</li>
<li class="product-item" data-sku="WG-200">
<span class="name">Pro Widget</span>
<span class="price" data-currency="USD">$79.99</span>
<span class="stock out-of-stock">Out of Stock</span>
</li>
</ul>
</section>
a. What is the parent element of the two <li> tags?
b. What are the siblings of the <ul class="product-list"> element?
c. How would you identify the section element using BeautifulSoup? Write two ways: by tag+class and by id.
d. The data-sku attribute contains the product SKU. Write the BeautifulSoup call to find the element with SKU "WG-200".
e. Write the CSS selector that would find all product items that are currently in stock.
Exercise 1.2: True or False
Mark each statement True or False and explain your reasoning.
a. requests.get() renders JavaScript and returns the final visible content of a page.
b. A website's robots.txt file is legally binding — violating it is always illegal.
c. soup.find("div") returns a list of all <div> elements in the document.
d. Setting a time.sleep(2) between requests is mandatory by Python law.
e. pd.read_html() can scrape all tables from a web page in a single function call.
f. A 429 status code means the server found the page but access is forbidden.
g. The "lxml" parser is faster than Python's built-in "html.parser".
Exercise 1.3: robots.txt Interpretation
Read the following robots.txt file and answer the questions:
User-agent: *
Disallow: /admin/
Disallow: /api/private/
Disallow: /user/
Allow: /api/public/
Crawl-delay: 5
User-agent: Googlebot
Allow: /
Crawl-delay: 1
User-agent: BadBot
Disallow: /
a. Is a generic Python scraper allowed to access /products/widgets/?
b. Is a generic Python scraper allowed to access /api/public/data?
c. Is Googlebot allowed to access /admin/?
d. If your scraper identifies itself as "BadBot", can it access any page?
e. How long should a generic Python scraper wait between requests?
Exercise 1.4: HTTP Status Code Reference
Match each status code to its meaning and the appropriate action to take:
| Code | Meaning | Action |
|---|---|---|
| 200 | ? | ? |
| 301 | ? | ? |
| 403 | ? | ? |
| 404 | ? | ? |
| 429 | ? | ? |
| 503 | ? | ? |
Exercise 1.5: CSS Selector Writing
Write CSS selectors for each of the following:
a. All <a> tags with the class "product-link"
b. The first <h1> inside a <div> with the id "page-header"
c. All <span> tags that have a data-price attribute
d. All <li> tags that are direct children of <ul class="menu">
e. All <a> tags where the href attribute starts with "https"
f. The element with id "featured-product"
Tier 2 — Guided Practice (Short Code Exercises)
Use https://books.toscrape.com for these exercises.
Exercise 2.1: Your First Fetch
Write a Python script that:
1. Imports requests
2. Fetches the homepage of books.toscrape.com
3. Prints the status code
4. Prints the first 500 characters of the HTML
5. Prints the value of the Content-Type header
Exercise 2.2: Parsing with BeautifulSoup
Fetch https://books.toscrape.com/ and answer these questions using BeautifulSoup:
a. What is the page title? (Find the <title> tag)
b. How many <article> elements are on the page?
c. Find the first book's title. (Hint: the full title is in the title attribute of the <a> inside <h3>)
d. What is the price of the first book? (Find p.price_color)
e. How many star ratings of "Five" are there on the first page? (Look for p.star-rating.Five)
Exercise 2.3: Extracting a List
Write a function get_all_book_titles(html: str) -> list[str] that:
- Takes the HTML of a books.toscrape.com page as a string
- Returns a list of all book titles on that page (the full title, not truncated)
- Tests: the first page should return exactly 20 titles
Exercise 2.4: Attribute Extraction
Write a function get_all_book_links(html: str, base_url: str) -> list[str] that:
- Finds all book listing links on a page
- Converts relative URLs to absolute URLs using urljoin
- Returns a list of absolute URLs
Exercise 2.5: Table Scraping with pandas
Using pd.read_html():
1. Scrape the first table from https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue
2. Print the column names
3. Print the first 5 rows
4. Find the company with the highest revenue
5. Save to a CSV file called largest_companies.csv
Tier 3 — Applied Exercises (Write Working Scrapers)
Exercise 3.1: Single Category Scraper
Write a complete scraper for one category on books.toscrape.com:
- Navigate to
https://books.toscrape.com/catalogue/category/books/mystery_3/index.html - Extract all books: title, price, rating (as a number 1–5), availability
- Handle the
£symbol in prices — store as a float - Convert the text star rating ("One", "Two", etc.) to a number
- Save to
mystery_books.csv - Print a summary: total books, average price, highest-rated books
Exercise 3.2: Multi-Page Scraper with Rate Limiting
Extend Exercise 3.1 to scrape all pages of a category:
- Scrape the first page
- Find the "next" pagination link (
li.next a) - Follow it to the next page
- Continue until there is no "next" link
- Add a 1.5–3 second random delay between pages
- Print progress: "Scraping page 1/5..." on each page
- Save all results to a single CSV
Exercise 3.3: Error-Resilient Scraper
Modify your scraper from Exercise 3.2 to handle errors gracefully:
- If a page fetch fails (any non-200 status), log the error and skip to the next page
- If a specific book's data cannot be extracted (missing element), log a warning and skip that book
- If a price cannot be converted to float, store
Noneand log a warning - Add a
--verboseflag that shows debug-level logging when enabled - The scraper should never crash on a single page or book failure
Exercise 3.4: BeautifulSoup Table Parser
Write a function scrape_html_table(url: str, table_index: int = 0) -> list[dict] that:
- Fetches the page at
url - Finds the table at the given index
- Extracts headers from
<th>elements - Extracts data from
<td>elements - Returns a list of dictionaries (one per row, keyed by header)
- Test it on a Wikipedia table of your choice
Exercise 3.5: Link Extractor with robots.txt Check
Build a tool that:
- Accepts a URL as command-line input
- Checks
robots.txtbefore proceeding - Fetches the page if allowed
- Extracts all unique absolute URLs from
<a>tags - Categorizes links as internal (same domain) or external (different domain)
- Saves results to CSV with columns:
url,link_text,type(internal/external) - Prints a summary: N internal links, N external links found
Tier 4 — Business Scenarios (Multi-Step Projects)
Exercise 4.1: Acme Catalog Price Monitor (Simplified)
Build a price monitoring tool for books.toscrape.com that simulates competitive price tracking:
- Scrape the "Science" category (
https://books.toscrape.com/catalogue/category/books/science_22/index.html) - Save prices with a timestamp:
book_title,price_gbp,rating_stars,scraped_date - Run the scraper twice (with at least a 10-second interval between runs)
- Load both datasets and compare: "Did any prices change?"
- On the second run, append new data to the existing CSV (do not overwrite)
- Print a "price change report": list any books with different prices
Note: Books.toscrape.com does not actually change prices. Use this to practice the append-and-compare pattern, not to find real changes.
Exercise 4.2: Multi-Category Price Comparison
Extend Exercise 4.1 to scrape multiple categories:
- Scrape three categories: Science, History, Business
- Build a combined DataFrame with a
categorycolumn - Calculate: average price by category, most expensive book per category, percentage of 5-star books per category
- Export a formatted Excel file (using openpyxl) with: - Sheet 1: Raw data (all books) - Sheet 2: Summary statistics by category - Appropriate column formatting
Exercise 4.3: Job Listing Filter (Custom Version)
Build a simplified version of Maya's job monitor using a public, scraping-friendly job board:
- Identify a job board that permits scraping in its
robots.txt(check before writing any code) - Scrape the first two pages of a relevant job category
- Define five to ten keywords relevant to a profession you choose (not necessarily analytics)
- Filter listings for keyword matches in title and description
- Save matching jobs to CSV with:
title,company,location,posted_date,url,matched_keywords - Print: total scraped, total matched, new since last run
Exercise 4.4: Wikipedia Economic Data Pipeline
Build a data pipeline using Wikipedia public data:
- Scrape
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies - Extract the full table (500 companies) using either
pd.read_html()or BeautifulSoup - Clean the data: remove footnote markers, handle missing values
- Add a column
sector_countshowing how many companies share each sector - Find the five sectors with the most S&P 500 companies
- Scrape the Wikipedia page for each of those five sectors (their main Wikipedia article)
- Extract a one-paragraph summary from each sector article
- Build a final report: sector name, company count, Wikipedia summary
- Save to both CSV and a formatted plain text file
Tier 5 — Capstone Project
Exercise 5.1: Business Intelligence Dashboard Data Pipeline
Build a complete, automated data pipeline that combines scraping, processing, and reporting.
Business context: You are building a weekly market intelligence report for a hypothetical company that sells books (using books.toscrape.com as a proxy for real market data).
Component 1 — Market Scanner
- Scrape all categories available on https://books.toscrape.com/catalogue/category/books/index.html
- For each category, follow its link and scrape all books (all pages)
- Total: approximately 1000 books across 50 categories
- Rate limit appropriately — complete the full scrape in a single session without getting blocked
- Save to market_data/books_full_catalog.csv
Component 2 — Analysis Module
Create a Python module market_analysis.py with these functions:
- get_price_distribution(df) → returns percentile statistics (10th, 25th, 50th, 75th, 90th) overall and by category
- get_top_categories_by_avg_price(df, n=10) → returns the N most expensive categories by average book price
- get_rating_distribution(df) → returns count and percentage for each star rating (1–5)
- find_value_books(df, max_price, min_rating) → returns books under a price with a rating at or above a threshold
- get_category_summary(df) → returns a summary DataFrame with one row per category
Component 3 — Report Generator
Using the analysis functions, generate a weekly market intelligence report:
- A formatted plain text report saved to reports/market_report_YYYY-MM-DD.txt
- An Excel workbook with multiple sheets (one per analysis function's output)
- A simple HTML page reports/market_report_YYYY-MM-DD.html with a styled summary table
Component 4 — Email Delivery Combine with Chapter 19 skills: - Send the report HTML as an email body to yourself - Attach the Excel workbook - The subject line includes the date
Component 5 — Scheduling and Logging
- Add a --run-log flag that appends a summary of each run to run_history.csv
- Add a --dry-run flag that generates the report without sending the email
- Ensure all scraping includes rate limiting, error handling, and a polite user agent
Deliverables:
- All Python modules with proper docstrings
- A requirements.txt file listing all dependencies
- Sample output: the full catalog CSV, one market report (text), one Excel workbook
- A README.txt explaining how to run the pipeline from scratch