Chapter 20 Exercises: Web Scraping for Business Intelligence

DataField.Dev

Chapter 20 Exercises: Web Scraping for Business Intelligence

These exercises build progressively from understanding HTML structure to building production-quality scrapers. Work through them in order within each tier. The primary practice site for live scraping is https://books.toscrape.com — it is explicitly designed for scraping practice with no ToS restrictions. Other exercises use local HTML strings so you can work offline.

Tier 1 — Comprehension (No Coding Required)

Exercise 1.1: HTML Structure Analysis

Given the following HTML snippet, answer the questions below:

<section class="product-catalog" id="main-catalog">
  <h1>Widget Catalog — Q4 2024</h1>
  <div class="filters">
    <a href="?cat=standard" class="filter-link active">Standard</a>
    <a href="?cat=pro" class="filter-link">Professional</a>
  </div>
  <ul class="product-list">
    <li class="product-item" data-sku="WG-100">
      <span class="name">Basic Widget</span>
      <span class="price" data-currency="USD">$29.99</span>
      <span class="stock in-stock">In Stock</span>
    </li>
    <li class="product-item" data-sku="WG-200">
      <span class="name">Pro Widget</span>
      <span class="price" data-currency="USD">$79.99</span>
      <span class="stock out-of-stock">Out of Stock</span>
    </li>
  </ul>
</section>

a. What is the parent element of the two <li> tags?

b. What are the siblings of the <ul class="product-list"> element?

c. How would you identify the section element using BeautifulSoup? Write two ways: by tag+class and by id.

d. The data-sku attribute contains the product SKU. Write the BeautifulSoup call to find the element with SKU "WG-200".

e. Write the CSS selector that would find all product items that are currently in stock.

Exercise 1.2: True or False

Mark each statement True or False and explain your reasoning.

a. requests.get() renders JavaScript and returns the final visible content of a page.

b. A website's robots.txt file is legally binding — violating it is always illegal.

c. soup.find("div") returns a list of all <div> elements in the document.

d. Setting a time.sleep(2) between requests is mandatory by Python law.

e. pd.read_html() can scrape all tables from a web page in a single function call.

f. A 429 status code means the server found the page but access is forbidden.

g. The "lxml" parser is faster than Python's built-in "html.parser".

Exercise 1.3: robots.txt Interpretation

Read the following robots.txt file and answer the questions:

User-agent: *
Disallow: /admin/
Disallow: /api/private/
Disallow: /user/
Allow: /api/public/
Crawl-delay: 5

User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: BadBot
Disallow: /

a. Is a generic Python scraper allowed to access /products/widgets/?

b. Is a generic Python scraper allowed to access /api/public/data?

c. Is Googlebot allowed to access /admin/?

d. If your scraper identifies itself as "BadBot", can it access any page?

e. How long should a generic Python scraper wait between requests?

Exercise 1.4: HTTP Status Code Reference

Match each status code to its meaning and the appropriate action to take:

Code	Meaning	Action
200	?	?
301	?	?
403	?	?
404	?	?
429	?	?
503	?	?

Exercise 1.5: CSS Selector Writing

Write CSS selectors for each of the following:

a. All <a> tags with the class "product-link"

b. The first <h1> inside a <div> with the id "page-header"

c. All <span> tags that have a data-price attribute

d. All <li> tags that are direct children of <ul class="menu">

e. All <a> tags where the href attribute starts with "https"

f. The element with id "featured-product"

Tier 2 — Guided Practice (Short Code Exercises)

Use https://books.toscrape.com for these exercises.

Exercise 2.1: Your First Fetch

Write a Python script that: 1. Imports requests 2. Fetches the homepage of books.toscrape.com 3. Prints the status code 4. Prints the first 500 characters of the HTML 5. Prints the value of the Content-Type header

Exercise 2.2: Parsing with BeautifulSoup

Fetch https://books.toscrape.com/ and answer these questions using BeautifulSoup:

a. What is the page title? (Find the <title> tag)

b. How many <article> elements are on the page?

c. Find the first book's title. (Hint: the full title is in the title attribute of the <a> inside <h3>)

d. What is the price of the first book? (Find p.price_color)

e. How many star ratings of "Five" are there on the first page? (Look for p.star-rating.Five)

Exercise 2.3: Extracting a List

Write a function get_all_book_titles(html: str) -> list[str] that: - Takes the HTML of a books.toscrape.com page as a string - Returns a list of all book titles on that page (the full title, not truncated) - Tests: the first page should return exactly 20 titles

Exercise 2.4: Attribute Extraction

Write a function get_all_book_links(html: str, base_url: str) -> list[str] that: - Finds all book listing links on a page - Converts relative URLs to absolute URLs using urljoin - Returns a list of absolute URLs

Exercise 2.5: Table Scraping with pandas

Using pd.read_html(): 1. Scrape the first table from https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue 2. Print the column names 3. Print the first 5 rows 4. Find the company with the highest revenue 5. Save to a CSV file called largest_companies.csv

Tier 3 — Applied Exercises (Write Working Scrapers)

Exercise 3.1: Single Category Scraper

Write a complete scraper for one category on books.toscrape.com:

Navigate to https://books.toscrape.com/catalogue/category/books/mystery_3/index.html
Extract all books: title, price, rating (as a number 1–5), availability
Handle the £ symbol in prices — store as a float
Convert the text star rating ("One", "Two", etc.) to a number
Save to mystery_books.csv
Print a summary: total books, average price, highest-rated books

Exercise 3.2: Multi-Page Scraper with Rate Limiting

Extend Exercise 3.1 to scrape all pages of a category:

Scrape the first page
Find the "next" pagination link (li.next a)
Follow it to the next page
Continue until there is no "next" link
Add a 1.5–3 second random delay between pages
Print progress: "Scraping page 1/5..." on each page
Save all results to a single CSV

Exercise 3.3: Error-Resilient Scraper

Modify your scraper from Exercise 3.2 to handle errors gracefully:

If a page fetch fails (any non-200 status), log the error and skip to the next page
If a specific book's data cannot be extracted (missing element), log a warning and skip that book
If a price cannot be converted to float, store None and log a warning
Add a --verbose flag that shows debug-level logging when enabled
The scraper should never crash on a single page or book failure

Exercise 3.4: BeautifulSoup Table Parser

Write a function scrape_html_table(url: str, table_index: int = 0) -> list[dict] that:

Fetches the page at url
Finds the table at the given index
Extracts headers from <th> elements
Extracts data from <td> elements
Returns a list of dictionaries (one per row, keyed by header)
Test it on a Wikipedia table of your choice

Exercise 3.5: Link Extractor with robots.txt Check

Build a tool that:

Accepts a URL as command-line input
Checks robots.txt before proceeding
Fetches the page if allowed
Extracts all unique absolute URLs from <a> tags
Categorizes links as internal (same domain) or external (different domain)
Saves results to CSV with columns: url, link_text, type (internal/external)
Prints a summary: N internal links, N external links found

Tier 4 — Business Scenarios (Multi-Step Projects)

Exercise 4.1: Acme Catalog Price Monitor (Simplified)

Build a price monitoring tool for books.toscrape.com that simulates competitive price tracking:

Scrape the "Science" category (https://books.toscrape.com/catalogue/category/books/science_22/index.html)
Save prices with a timestamp: book_title, price_gbp, rating_stars, scraped_date
Run the scraper twice (with at least a 10-second interval between runs)
Load both datasets and compare: "Did any prices change?"
On the second run, append new data to the existing CSV (do not overwrite)
Print a "price change report": list any books with different prices

Note: Books.toscrape.com does not actually change prices. Use this to practice the append-and-compare pattern, not to find real changes.

Exercise 4.2: Multi-Category Price Comparison

Extend Exercise 4.1 to scrape multiple categories:

Scrape three categories: Science, History, Business
Build a combined DataFrame with a category column
Calculate: average price by category, most expensive book per category, percentage of 5-star books per category
Export a formatted Excel file (using openpyxl) with: - Sheet 1: Raw data (all books) - Sheet 2: Summary statistics by category - Appropriate column formatting

Exercise 4.3: Job Listing Filter (Custom Version)

Build a simplified version of Maya's job monitor using a public, scraping-friendly job board:

Identify a job board that permits scraping in its robots.txt (check before writing any code)
Scrape the first two pages of a relevant job category
Define five to ten keywords relevant to a profession you choose (not necessarily analytics)
Filter listings for keyword matches in title and description
Save matching jobs to CSV with: title, company, location, posted_date, url, matched_keywords
Print: total scraped, total matched, new since last run

Exercise 4.4: Wikipedia Economic Data Pipeline

Build a data pipeline using Wikipedia public data:

Scrape https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
Extract the full table (500 companies) using either pd.read_html() or BeautifulSoup
Clean the data: remove footnote markers, handle missing values
Add a column sector_count showing how many companies share each sector
Find the five sectors with the most S&P 500 companies
Scrape the Wikipedia page for each of those five sectors (their main Wikipedia article)
Extract a one-paragraph summary from each sector article
Build a final report: sector name, company count, Wikipedia summary
Save to both CSV and a formatted plain text file

Tier 5 — Capstone Project

Exercise 5.1: Business Intelligence Dashboard Data Pipeline

Build a complete, automated data pipeline that combines scraping, processing, and reporting.

Business context: You are building a weekly market intelligence report for a hypothetical company that sells books (using books.toscrape.com as a proxy for real market data).

Component 1 — Market Scanner - Scrape all categories available on https://books.toscrape.com/catalogue/category/books/index.html - For each category, follow its link and scrape all books (all pages) - Total: approximately 1000 books across 50 categories - Rate limit appropriately — complete the full scrape in a single session without getting blocked - Save to market_data/books_full_catalog.csv

Component 2 — Analysis Module Create a Python module market_analysis.py with these functions: - get_price_distribution(df) → returns percentile statistics (10th, 25th, 50th, 75th, 90th) overall and by category - get_top_categories_by_avg_price(df, n=10) → returns the N most expensive categories by average book price - get_rating_distribution(df) → returns count and percentage for each star rating (1–5) - find_value_books(df, max_price, min_rating) → returns books under a price with a rating at or above a threshold - get_category_summary(df) → returns a summary DataFrame with one row per category

Component 3 — Report Generator Using the analysis functions, generate a weekly market intelligence report: - A formatted plain text report saved to reports/market_report_YYYY-MM-DD.txt - An Excel workbook with multiple sheets (one per analysis function's output) - A simple HTML page reports/market_report_YYYY-MM-DD.html with a styled summary table

Component 4 — Email Delivery Combine with Chapter 19 skills: - Send the report HTML as an email body to yourself - Attach the Excel workbook - The subject line includes the date

Component 5 — Scheduling and Logging - Add a --run-log flag that appends a summary of each run to run_history.csv - Add a --dry-run flag that generates the report without sending the email - Ensure all scraping includes rate limiting, error handling, and a polite user agent

Deliverables: - All Python modules with proper docstrings - A requirements.txt file listing all dependencies - Sample output: the full catalog CSV, one market report (text), one Excel workbook - A README.txt explaining how to run the pipeline from scratch