24 min read

> "Automate the boring stuff so you can focus on the interesting stuff."

Learning Objectives

  • Parse HTML documents and extract data using Beautiful Soup
  • Use CSS selectors and Beautiful Soup methods to locate specific elements
  • Build a complete web scraper with error handling, rate limiting, and polite headers
  • Apply ethical scraping practices: robots.txt, rate limits, Terms of Service
  • Automate file management, report generation, and batch processing tasks
  • Schedule scripts to run automatically with cron, Task Scheduler, and the schedule library
  • Design and implement a fetch-parse-transform-output automation pipeline

Chapter 24: Web Scraping and Automation

"Automate the boring stuff so you can focus on the interesting stuff." — Al Sweigart

Chapter Overview

Every Friday afternoon, Elena Vasquez sat down at her desk at Harbor Community Services and began her four-hour ritual. Open three browser tabs. Navigate to three county data portals. Click "Export." Wait. Download. Copy-paste each dataset into a master spreadsheet. Check for obvious errors — sometimes missing them, like the column swap that once led to wrong numbers at a board meeting. Write formulas. Format the report. Email it to twelve stakeholders. Four hours, every single week.

You first met Elena in Chapter 1. You've watched her story evolve — her first loop in Chapter 5, file I/O in Chapter 10, data processing in Chapter 21, regex validation in Chapter 22. Now, in Chapter 24, her story reaches its payoff. Elena's four-hour Friday becomes a 30-second script execution, scheduled to run at 6 AM every Friday before she even gets to the office.

This chapter teaches you two related skills: web scraping (extracting data from web pages programmatically) and automation (making the computer do repetitive work without your involvement). Web scraping lets your programs read web pages the way you do — but faster and without getting bored. Automation turns any repetitive workflow into a script that runs on command or on schedule.

In this chapter, you will learn to: - Parse HTML documents and extract data using Beautiful Soup - Use CSS selectors and Beautiful Soup methods to locate specific elements - Build a complete web scraper with error handling, rate limiting, and polite headers - Apply ethical scraping practices: robots.txt, rate limits, Terms of Service - Automate file management, report generation, and batch processing - Schedule scripts with cron, Task Scheduler, and the schedule library - Design a complete fetch-parse-transform-output pipeline

🏃 Fast Track: If you're familiar with HTML from web development, skim 24.2, then focus on 24.3 (Beautiful Soup API) and 24.5 (ethics). If you just want the pipeline pattern, jump to 24.8.

🔬 Deep Dive: The case studies explore Elena's fully automated report pipeline and a detailed ethical analysis of five scraping scenarios along a spectrum from clearly ethical to clearly problematic.

Spaced Review: This chapter builds directly on Chapter 21 (HTTP requests with requests, JSON/CSV processing) and Chapter 22 (regex for text extraction). You'll also use file I/O from Chapter 10 and module structure from Chapter 12. If requests.get() or re.findall() feel rusty, revisit those chapters before continuing.


24.1 Why Automate?

Here's a question worth sitting with: how much of your work is genuinely creative, and how much is mechanical repetition?

Elena's weekly report was valuable — the executive director relied on it. But the process of creating it was pure drudgery. The same clicks, the same copy-pastes, the same formulas, the same formatting, every single week. Nothing required human judgment except catching errors — and humans are terrible at catching errors in repetitive work. (Ask Elena about the column-swap incident.)

Consider the math. Elena's report took 4 hours per week. Over a year, that's 200 hours — five full work weeks. Her Python script takes 30 seconds. Even accounting for the 8 hours she spent writing and debugging the script, she saves 191.5 hours in the first year alone. Every subsequent year, she saves the full 200 hours.

But the real multiplier isn't just time. It's reliability. A script doesn't get tired on Friday afternoon. It doesn't accidentally swap two columns. It doesn't forget to check for negative numbers in the data. It runs the same validation checks every time, catches anomalies that a human would miss, and produces output in exactly the format stakeholders expect.

Here's what Elena's pipeline does now versus what she used to do manually:

Step Manual (Before) Automated (After)
Download CSVs from 3 counties Open 3 browser tabs, click Export, save requests.get() in a loop
Check for errors Eyeball spreadsheets Validation function with explicit rules
Merge and aggregate Copy-paste into master sheet Python dicts and loops
Calculate statistics Write Excel formulas sum(), list comprehensions
Format report Copy into Word template String formatting, f-strings
Save and distribute Save, attach, email pathlib write, automated naming
Total time ~4 hours ~30 seconds

🚪 Threshold Concept: Automation as a Multiplier

Before this chapter: "Programming is about solving one problem at a time."

After this chapter: "Programming is about building machines that solve the same problem forever — while I move on to the next one."

This shift changes how you look at every repetitive task. When you catch yourself doing the same thing for the third time, a voice in your head will ask: "Could I automate this?" Often the answer is yes. Renaming 500 files? Script. Checking a website for updates? Script. Generating a weekly report from three data sources? Script. Sending reminder emails to your study group? Script.

More importantly, automated pipelines are reliable. They don't have bad days. They don't forget steps. They produce consistent, auditable results every run. In Elena's case, the column-swap error that embarrassed her at a board meeting (Chapter 1) is now structurally impossible — the script validates data before any calculations happen.

Once you see the world through the automation lens, you can't unsee it. Every manual process becomes a potential script. Every repetitive task becomes an opportunity to build something permanent. The cost of automation is fixed; the benefit compounds.


24.2 HTML Basics for Scraping

Web scraping means extracting data from web pages. Web pages are written in HTML (HyperText Markup Language) — a language of nested tags that tells your browser what to display. You don't need to become an HTML expert, but you do need to understand enough to tell your scraper where to find the data.

24.2.1 Tags, Attributes, and Content

HTML is built from elements, each defined by a tag:

<p class="description">Learn web scraping with Beautiful Soup.</p>

Let's break this down:

Part What It Is Example
<p> Opening tag Starts a paragraph element
class="description" Attribute Metadata about the element
Learn web scraping... Content The visible text
</p> Closing tag Ends the paragraph element

Tags nest inside each other, forming a tree structure:

<div class="event" data-date="2025-09-15">
  <h2>Python Workshop</h2>
  <p class="description">Learn web scraping with Beautiful Soup.</p>
  <span class="location">Room 204, CS Building</span>
</div>

Here, the <div> is the parent element. The <h2>, <p>, and <span> inside it are children. This parent-child relationship is how HTML organizes information — and how your scraper will navigate to find the data you need.

24.2.2 The Tags You'll See Most Often

Tag Purpose Example
<div> Generic container <div class="product">...</div>
<p> Paragraph <p>Description here</p>
<h1>-<h6> Headings <h2>Product Name</h2>
<a> Link (href holds the URL) <a href="/page2">Next</a>
<span> Inline container <span class="price">$12.99</span>
<ul>, <li> List and list items <ul><li>Item 1</li></ul>
<table>, <tr>, <td> Table, row, cell Tabular data

24.2.3 Classes and IDs

Two attributes are especially important for scraping:

  • class: A label shared by multiple elements. On a product page, every product might be inside a <div class="product">. You use classes to find all elements of a certain type.
  • id: A unique identifier. Only one element on the page should have a given ID. You use IDs to find one specific element.

24.2.4 HTML as a Tree

HTML's nested structure forms a tree:

html
├── head
│   └── title
└── body
    ├── div#results
    │   ├── div.product
    │   │   ├── h2.product-name  ->  "Widget A"
    │   │   └── span.price       ->  "$12.99"
    │   └── div.product
    │       ├── h2.product-name  ->  "Widget B"
    │       └── span.price       ->  "$8.49"

Beautiful Soup turns this tree into Python objects you can navigate — by tag name, by class, by ID, or by position in the tree.

💡 Tip: Most web browsers let you inspect a page's HTML structure. Right-click on any element and choose "Inspect" (Chrome, Edge) or "Inspect Element" (Firefox, Safari). The developer tools panel shows you the exact HTML tags, classes, and IDs — invaluable for figuring out what to target in your scraper.

🔗 Connection — Spaced Review (Ch 10): In Chapter 10, you learned to read files line by line and parse their structure. HTML parsing is conceptually similar — you're extracting structured data from a text format. The difference is that HTML's structure is a tree of nested elements, not flat lines.


24.3 Beautiful Soup: Parsing HTML

Beautiful Soup is a Python library that takes an HTML document and turns it into a navigable tree of Python objects. Instead of writing regex to extract data from raw HTML text (which is fragile, error-prone, and painful), you use Beautiful Soup to search the tree by tag name, class, ID, or CSS selector.

24.3.1 Installation and First Parse

Beautiful Soup is a third-party library. Install it the way you learned in Chapter 23:

pip install beautifulsoup4 requests

The package name on PyPI is beautifulsoup4, but you import it as bs4:

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Campus Events</title></head>
<body>
  <h1 class="page-title">Upcoming Events</h1>
  <div id="events">
    <div class="event" data-date="2025-09-15">
      <h2>Python Workshop</h2>
      <p class="description">Learn web scraping with Beautiful Soup.</p>
      <span class="location">Room 204, CS Building</span>
    </div>
    <div class="event" data-date="2025-09-18">
      <h2>Data Science Meetup</h2>
      <p class="description">Exploring pandas and matplotlib.</p>
      <span class="location">Auditorium B</span>
    </div>
  </div>
</body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

Two things to notice:

  1. Always specify the parser. The second argument "html.parser" tells Beautiful Soup to use Python's built-in HTML parser. If you omit it, you'll get a GuessedAtParserWarning, and the behavior might differ across machines. Always be explicit.

  2. soup is now a tree. You can navigate it, search it, and extract data from it using Beautiful Soup's methods.

24.3.2 Finding Elements: find() and find_all()

The two methods you'll use most:

# find() — returns the FIRST matching element (or None)
title = soup.find("title")
print(title.text)  # "Campus Events"

heading = soup.find("h1", class_="page-title")
print(heading.text)  # "Upcoming Events"

# find_all() — returns a LIST of ALL matching elements
events = soup.find_all("div", class_="event")
print(len(events))  # 2

You can search by tag name, by attribute, or by both:

# By tag name only
all_h2s = soup.find_all("h2")

# By class (note the underscore: class_ because "class" is a Python keyword)
descriptions = soup.find_all("p", class_="description")

# By ID (unique — so find() makes more sense than find_all())
events_container = soup.find("div", id="events")

# By custom attribute
dated_events = soup.find_all("div", attrs={"data-date": True})

Once you have a Tag object, you can extract its content:

for event in soup.find_all("div", class_="event"):
    name = event.find("h2").text              # Get visible text
    desc = event.find("p", class_="description").text
    location = event.find("span", class_="location").text
    date = event["data-date"]                 # Get an attribute value

    print(f"{name} ({date}) -- {location}")
    print(f"  {desc}")

Output:

Python Workshop (2025-09-15) -- Room 204, CS Building
  Learn web scraping with Beautiful Soup.
Data Science Meetup (2025-09-18) -- Auditorium B
  Exploring pandas and matplotlib.

Key extraction methods at a glance:

Property/Method Returns Use When
tag.text All text content (including nested elements) You want the visible text
tag.string Direct text content (only if the tag has one text child) Simple, single-text elements
tag["attr"] The attribute value You need href, src, data-*, etc.
tag.get("attr") The attribute value, or None if missing Safely accessing an attribute that might not exist

24.3.3 CSS Selectors: select() and select_one()

Beautiful Soup also supports CSS selectors — the same syntax used in web development to style elements. CSS selectors are often more concise for complex queries:

# By class
descriptions = soup.select(".description")

# By ID
events_div = soup.select_one("#events")

# Nested: spans inside elements with class "event"
locations = soup.select(".event .location")

# Tag with specific class
event_divs = soup.select("div.event")

# By attribute
dated = soup.select("div[data-date]")

# Attribute with specific value
sept15 = soup.select('div[data-date="2025-09-15"]')

Here's a quick reference for CSS selector syntax:

Selector Meaning Example
.class Elements with this class ".price"
#id Element with this ID "#main"
parent child Child (any depth) inside parent ".product .price"
tag.class Tag with this class "div.item"
[attr] Elements that have this attribute "[data-id]"
[attr=val] Attribute with specific value '[type="email"]'

When to use find() vs. select()? Both work. find() and find_all() are Beautiful Soup's native API — slightly more verbose but readable, with keyword arguments for common attributes. select() uses CSS selector strings — more concise for complex queries, especially nested selections. Use whichever feels more natural for the task at hand.

📊 Decision Guide: find() vs. select()

Use find()/find_all() when... Use select()/select_one() when...
You know the tag name and one attribute You need nested selectors (".parent .child")
You want to pass keyword arguments You're comfortable with CSS syntax
You're searching for a specific ID You want concise multi-criteria queries
You want maximum readability for Python devs You want maximum readability for web devs

🔄 Check Your Understanding 1: Given the HTML above, write Beautiful Soup code to extract just the location text from the second event. Do it two ways: once with find_all() and once with select(). Which approach do you find more readable?


24.4 Building a Web Scraper

Let's build a real scraper. We'll use http://quotes.toscrape.com — a website created specifically for practicing web scraping. It's designed for educational use, so we don't need to worry about overloading a production server.

24.4.1 Step 1: Fetch the Page

We need requests (from Chapter 21) to download the HTML, and Beautiful Soup to parse it:

import time
import requests
from bs4 import BeautifulSoup

def fetch_page(url: str, delay: float = 1.0) -> BeautifulSoup | None:
    """Fetch a web page and return a BeautifulSoup object.

    Args:
        url: The URL to fetch.
        delay: Seconds to wait before making the request (rate limiting).

    Returns:
        A BeautifulSoup object, or None if the request failed.
    """
    time.sleep(delay)  # Be polite -- wait before requesting

    headers = {
        "User-Agent": "CS1-Textbook-Example/1.0 (Educational purposes)"
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        return BeautifulSoup(response.text, "html.parser")
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

Three essentials every scraper should include:

  1. time.sleep(delay) -- Rate limiting. Don't hammer the server (Section 24.5).
  2. A descriptive User-Agent header -- The default python-requests/2.x is often blocked. Transparency is both polite and practical.
  3. Error handling -- Networks fail, servers go down. try/except and timeout=10 ensure graceful failure.

24.4.2 Step 2: Parse and Extract

Now we need to understand the page's HTML structure. By inspecting quotes.toscrape.com in a browser (right-click, "Inspect"), we discover that each quote lives inside a <div class="quote">, with the text in <span class="text">, the author in <small class="author">, and tags inside <a class="tag"> elements.

def scrape_quotes(soup: BeautifulSoup) -> list[dict]:
    """Extract quotes, authors, and tags from a page."""
    quotes = []

    for quote_div in soup.select(".quote"):
        text = quote_div.select_one(".text").text
        text = text.strip("\u201c\u201d")  # Remove curly quote characters

        author = quote_div.select_one(".author").text

        tags = [tag.text for tag in quote_div.select(".tag")]

        quotes.append({
            "text": text,
            "author": author,
            "tags": tags,
        })

    return quotes

24.4.3 Step 3: Handle Pagination

Most websites split content across multiple pages. The quotes site has a "Next" button that links to the next page. We can follow it:

def get_next_page_url(soup: BeautifulSoup, base_url: str) -> str | None:
    """Find the URL of the next page, if it exists."""
    next_link = soup.select_one("li.next a")
    if next_link:
        return base_url + next_link["href"]
    return None

24.4.4 Step 4: Put It All Together

def scrape_all_quotes(max_pages: int = 3) -> list[dict]:
    """Scrape quotes from multiple pages."""
    base_url = "http://quotes.toscrape.com"
    current_url = base_url
    all_quotes = []
    page_num = 1

    while current_url and page_num <= max_pages:
        print(f"Scraping page {page_num}: {current_url}")

        soup = fetch_page(current_url)
        if soup is None:
            print("  Failed to fetch page. Stopping.")
            break

        page_quotes = scrape_quotes(soup)
        all_quotes.extend(page_quotes)
        print(f"  Found {len(page_quotes)} quotes "
              f"(total: {len(all_quotes)})")

        current_url = get_next_page_url(soup, base_url)
        page_num += 1

    return all_quotes


# --- Run it ---
if __name__ == "__main__":
    quotes = scrape_all_quotes(max_pages=3)
    for q in quotes[:3]:
        print(f'"{q["text"]}" -- {q["author"]}')
        print(f"  Tags: {', '.join(q['tags'])}\n")

This scraper respects the server (1-second delays between requests), handles errors (try/except around network calls), stops after a configurable number of pages (no infinite crawling), and produces clean, structured data (list of dictionaries).

Notice the max_pages parameter. This is deliberate self-limitation. Even on a practice site, there's no reason to scrape every page when three pages demonstrate the concept. On real sites, this kind of restraint is both polite and practical.

✅ Best Practice: Defensive Scraping

Real websites change their HTML structure without warning. Your selectors that work today might break tomorrow. Build your scrapers defensively: - Check for None before accessing .text or attributes - Wrap extraction in try/except blocks - Log what you couldn't parse instead of crashing - Validate extracted data (are prices actually numbers? are dates valid?)

🔗 Connection -- Spaced Review (Ch 21): Notice how similar this is to the API pattern from Chapter 21 -- make a request, check for errors, extract data from the response. The difference is that APIs return structured JSON, while web pages return HTML that we have to parse. Same pattern, different data format.

🔄 Check Your Understanding 2: The scraper above uses max_pages=3 to limit how many pages it fetches. Why is this important? What could happen if you removed this limit and the site had 10,000 pages?


24.5 The Ethics of Scraping

Just because you can scrape a website doesn't mean you should. Web scraping sits in an ethical and legal gray area, and responsible practitioners follow a set of norms. This section isn't optional -- it's arguably the most important part of this chapter.

24.5.1 Check robots.txt First

The robots.txt file at the root of a website tells automated crawlers which parts of the site are off-limits. It's a voluntary standard -- nothing technically prevents you from ignoring it -- but ethical scrapers always respect it. Here's a typical one:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/internal/
Crawl-delay: 5

This says: all crawlers (User-agent: *) should avoid /admin/, /private/, and /api/internal/. The Crawl-delay: 5 means you should wait at least 5 seconds between requests.

You can check robots.txt programmatically:

import requests

response = requests.get("http://quotes.toscrape.com/robots.txt")
print(response.text)

Always respect robots.txt. If a path is disallowed, don't scrape it. Period.

⚠️ Important: robots.txt is a voluntary standard, not a technical barrier. Your scraper can ignore it -- but it shouldn't. Ignoring robots.txt is like ignoring a "Do Not Enter" sign on an unlocked door. You can walk in, but you're violating the owner's stated wishes.

24.5.2 Rate Limiting: Don't Be a Bully

A web server is a shared resource. When you send 100 requests per second, you're consuming server capacity that real users need. Aggressive scraping can slow down a website -- or crash it entirely for a small site.

Rule of thumb: add at least time.sleep(1) between requests. If robots.txt specifies a Crawl-delay, use that instead. When in doubt, scrape slower.

import time

for url in urls_to_scrape:
    time.sleep(2)  # Wait 2 seconds between requests
    response = requests.get(url, headers=headers, timeout=10)
    # process response...

Guidelines for responsible rate limiting: - Minimum 1 second between requests to the same server - Respect Crawl-delay in robots.txt if specified - Back off if you receive 429 (Too Many Requests) or 503 (Service Unavailable) status codes - Consider the site's size -- a personal blog can handle far less traffic than Amazon

Many websites explicitly address scraping in their Terms of Service. Read the ToS. If they prohibit scraping, your options are: don't scrape, use their API instead, or contact the site owner for permission.

The legal landscape is evolving. The US case hiQ Labs v. LinkedIn (2019) established that scraping publicly available data generally doesn't violate the Computer Fraud and Abuse Act, but ToS violations can still create contract liability. The EU's GDPR and California's CCPA regulate collection of personal data even when publicly visible. Scraping behind login walls or circumventing access controls is much riskier.

The safe path: scrape public data, respect robots.txt and ToS, rate-limit aggressively, and use APIs when available.

24.5.5 The Ethical Scraping Checklist

Before you scrape any website, work through this checklist:

Step Question If "No"
1 Is there an API? Use the API instead
2 Does robots.txt allow it? Don't scrape those paths
3 Do the Terms of Service permit it? Don't scrape, or contact the owner
4 Am I rate-limiting responsibly? Add time.sleep() between requests
5 Am I identifying myself with a User-Agent? Add a descriptive header
6 Is the data public (not behind a login)? Don't scrape gated content
7 Would I be comfortable if the site owner saw my code? If not, reconsider

⚖️ Ethical Analysis: "It's technically public" is not the same as "it's ethical to bulk-collect." Aggregating personal data -- even from public profiles -- can cause real harm. Consider the reasonable expectations of the people whose data you're collecting, not just what's technically accessible. See Case Study 2 for five scenarios that explore this boundary in depth.

🔄 Check Your Understanding 3: A website's robots.txt allows scraping of /products/ but its Terms of Service say "no automated access." What should you do? Does robots.txt permission override the Terms of Service?


24.6 Automation Beyond Scraping

Web scraping is one form of automation, but Python can automate virtually any repetitive computer task. Let's look at some of the most common patterns.

24.6.1 File Organization

Your Downloads folder is a mess. Here's a script that sorts files into subdirectories by type:

import shutil
from pathlib import Path

def organize_downloads(download_dir: Path) -> dict[str, int]:
    """Sort files into subdirectories by extension."""
    extension_map = {
        ".pdf": "documents", ".docx": "documents", ".txt": "documents",
        ".jpg": "images", ".png": "images", ".gif": "images",
        ".mp3": "audio", ".wav": "audio",
        ".py": "code", ".js": "code", ".html": "code",
        ".csv": "data", ".json": "data", ".xlsx": "data",
    }

    moved_counts: dict[str, int] = {}

    for file_path in download_dir.iterdir():
        if file_path.is_file():
            ext = file_path.suffix.lower()
            folder_name = extension_map.get(ext, "other")

            dest_dir = download_dir / folder_name
            dest_dir.mkdir(exist_ok=True)

            shutil.move(str(file_path), str(dest_dir / file_path.name))
            moved_counts[folder_name] = moved_counts.get(folder_name, 0) + 1

    return moved_counts

24.6.2 Batch File Renaming

You took 200 photos named IMG_4382.jpg, IMG_4383.jpg, etc. You want rome_001.jpg, rome_002.jpg:

from pathlib import Path

def batch_rename(directory: Path, prefix: str, dry_run: bool = True):
    """Rename files with a consistent numbering scheme."""
    files = sorted(f for f in directory.iterdir() if f.is_file())

    for i, file_path in enumerate(files, start=1):
        new_name = f"{prefix}_{i:03d}{file_path.suffix}"

        if dry_run:
            print(f"  [DRY RUN] {file_path.name} -> {new_name}")
        else:
            file_path.rename(file_path.parent / new_name)
            print(f"  Renamed: {file_path.name} -> {new_name}")

Notice the dry_run parameter -- a professional practice that lets you preview what will happen before committing to irreversible file operations.

24.6.3 Automated Report Generation

Reports from any data source follow the same pattern: load data, compute statistics, format output, save to file.

from datetime import datetime
from pathlib import Path

def generate_directory_report(target_dir: Path) -> str:
    """Generate a summary report of a directory's contents."""
    file_count = 0
    total_size = 0
    extension_counts: dict[str, int] = {}

    for item in target_dir.rglob("*"):
        if item.is_file():
            file_count += 1
            total_size += item.stat().st_size
            ext = item.suffix.lower() or "(no extension)"
            extension_counts[ext] = extension_counts.get(ext, 0) + 1

    lines = [
        f"Directory Report: {target_dir}",
        f"Generated: {datetime.now():%Y-%m-%d %H:%M:%S}",
        "=" * 50,
        f"Total files: {file_count:,}",
        f"Total size: {total_size / 1024:.1f} KB",
        "",
        "By file type:",
    ]
    for ext, count in sorted(
        extension_counts.items(), key=lambda x: x[1], reverse=True
    ):
        lines.append(f"  {ext:15s} {count:>5} files")

    return "\n".join(lines)

🔗 Connection -- Spaced Review (Ch 10): The file I/O patterns here -- Path.mkdir(), Path.iterdir(), Path.rglob(), Path.stat() -- are the same pathlib tools you learned in Chapter 10. Automation is rarely about learning new tools; it's about combining familiar tools into workflows that save real time.


24.7 Scheduling Scripts

Writing an automation script is great. Having it run automatically -- without you remembering to execute it -- is even better. There are three main approaches.

24.7.1 cron (Linux/macOS)

cron is the standard Unix job scheduler. You define when to run a command using a five-field syntax:

# +------------ minute (0-59)
# | +---------- hour (0-23)
# | | +-------- day of month (1-31)
# | | | +------ month (1-12)
# | | | | +---- day of week (0-7, 0 and 7 are Sunday)
# | | | | |
# * * * * *  command

To edit your cron schedule:

crontab -e

Common examples:

# Every day at 8:00 AM
0 8 * * * python3 /home/elena/reports/daily_report.py

# Every Friday at 6:00 AM
0 6 * * 5 python3 /home/elena/reports/weekly_pipeline.py

# Every 30 minutes
*/30 * * * * python3 /home/elena/scripts/check_updates.py

# First day of every month at midnight
0 0 1 * * python3 /home/elena/scripts/monthly_cleanup.py

💡 Tip: The website crontab.guru lets you type a cron expression and see a plain-English explanation of when it will run. Bookmark it.

Two important details: use full paths for both Python and your script (cron doesn't know your PATH or virtual environment), and redirect output to a log file (>> /path/to/log.txt 2>&1) so you can debug failures.

24.7.2 Task Scheduler (Windows)

Windows has a GUI-based Task Scheduler and a command-line tool called schtasks:

schtasks /create /tn "Weekly Report" /tr "python C:\Users\Elena\reports\weekly_pipeline.py" /sc weekly /d FRI /st 06:00

24.7.3 The schedule Library (Cross-Platform)

If you want scheduling that's simpler than cron and works on any operating system, the schedule library provides a clean Python API:

pip install schedule
import schedule
import time

def daily_report():
    print("Generating daily report...")
    # your report logic here

def check_for_updates():
    print("Checking for updates...")
    # your update logic here

schedule.every().day.at("08:00").do(daily_report)
schedule.every(30).minutes.do(check_for_updates)
schedule.every().friday.at("06:00").do(daily_report)

# This loop must keep running for the schedule to work
while True:
    schedule.run_pending()
    time.sleep(60)  # Check every minute

The trade-off: schedule is simpler to set up, but it requires your Python process to stay running. If the process crashes or the computer restarts, the schedule stops. For production automation, cron or Task Scheduler is more robust because they're managed by the operating system.

🔄 Check Your Understanding 4: Elena wants her report to run every Friday at 6 AM on her macOS laptop. Write the cron expression. What happens if her laptop is turned off at 6 AM on Friday?


24.8 A Complete Automation Pipeline

Let's put everything together. The general pattern for any automation project is Fetch, Parse, Transform, Output -- four stages, each with a clear responsibility.

+-----------+    +-----------+    +-----------+    +-----------+
|   FETCH   | -> |   PARSE   | -> | TRANSFORM | -> |  OUTPUT   |
|           |    |           |    |           |    |           |
| Download  |    | Extract   |    | Clean,    |    | Save to   |
| HTML, CSV |    | data from |    | compute,  |    | CSV/JSON, |
| or API    |    | raw HTML  |    | validate  |    | report    |
+-----------+    +-----------+    +-----------+    +-----------+

Here's a simplified version of Elena's pipeline. Each stage is a function with one job; each function's output feeds the next:

"""Elena's automated report pipeline -- simplified."""

import csv
import json
import time
from datetime import datetime
from pathlib import Path


def fetch_data(csv_text: str) -> list[dict]:
    """Stage 1: Parse CSV into records."""
    reader = csv.DictReader(csv_text.strip().splitlines())
    return [
        {"county": row["county"], "service": row["service"],
         "clients": int(row["clients_served"]),
         "hours": int(row["hours_delivered"])}
        for row in reader
    ]


def validate(records: list[dict]) -> tuple[list[dict], list[str]]:
    """Stage 2: Separate clean records from errors."""
    clean, errors = [], []
    for r in records:
        if r["clients"] < 0:
            errors.append(f"Negative clients: {r}")
        elif r["hours"] < 0:
            errors.append(f"Negative hours: {r}")
        else:
            clean.append(r)
    return clean, errors


def summarize(records: list[dict]) -> dict:
    """Stage 3: Compute summary statistics."""
    by_county: dict[str, dict] = {}
    for r in records:
        county = r["county"]
        if county not in by_county:
            by_county[county] = {"clients": 0, "hours": 0}
        by_county[county]["clients"] += r["clients"]
        by_county[county]["hours"] += r["hours"]

    return {
        "generated": datetime.now().isoformat(),
        "total_clients": sum(r["clients"] for r in records),
        "total_hours": sum(r["hours"] for r in records),
        "by_county": by_county,
    }


def save_report(summary: dict, output_dir: Path) -> Path:
    """Stage 4: Generate and save formatted report."""
    output_dir.mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    lines = [
        "=" * 55,
        "  HARBOR COMMUNITY SERVICES -- IMPACT REPORT",
        f"  Generated: {summary['generated'][:19]}",
        f"  Clients: {summary['total_clients']:,}",
        f"  Hours:   {summary['total_hours']:,}",
        "",
    ]
    for county, data in sorted(summary["by_county"].items()):
        lines.append(f"  {county:15s} {data['clients']:>5,} clients"
                     f"  {data['hours']:>5,} hours")
    lines.append("=" * 55)

    path = output_dir / f"report_{timestamp}.txt"
    path.write_text("\n".join(lines), encoding="utf-8")
    return path


def run_pipeline():
    """Execute the full pipeline."""
    start = time.time()
    records = fetch_data(SAMPLE_DATA)
    clean, errors = validate(records)
    summary = summarize(clean)
    path = save_report(summary, Path("reports"))
    elapsed = time.time() - start
    print(f"Report saved: {path} ({elapsed:.2f}s)")
    print(f"Elena's old process: ~4 hours.")

The full version with sample data is in code/automation_pipeline.py; Elena's complete five-stage pipeline is in Case Study 1.

Why this structure matters. Each stage has one job. If the data format changes, you modify fetch_data(). If validation rules change, you modify validate(). If the report format changes, you modify generate_report(). The stages are independent, testable, and replaceable -- the same design principles you learned in Chapter 6 (functions) and Chapter 12 (modules).

This architecture gives you testability (test each stage independently with known data), debuggability (you know which stage failed), and extensibility (add a new data source to Stage 1 or a new output format to Stage 4 without changing other stages).

🔗 Connection -- Spaced Review (Ch 12): This pipeline is the module pattern from Chapter 12 applied to data processing. Each stage could live in its own module: fetch.py, validate.py, transform.py, output.py.

🔗 Connection -- Spaced Review (Ch 18): Scraping a website with nested categories is naturally recursive -- follow links to sub-pages, which have their own links, until a base case (maximum depth or no more links). This is tree traversal from Chapter 18.


24.9 When NOT to Scrape

Knowing when not to scrape is as important as knowing how. Here's when you should reach for something else.

24.9.1 Use an API Instead

If a website offers an API, use it. APIs return clean JSON with documented fields and versioned endpoints. Scraped HTML depends on the page's visual structure, which can change any time the site redesigns. Scraping when an API exists is like climbing through a window when the front door is open.

24.9.2 Other Situations to Avoid

Situation Better Alternative
Site offers an API Use the API
robots.txt disallows it Respect their wishes
Terms of Service prohibit it Find another source
Data is behind a login/paywall Request API access
Data contains personal information Obtain consent or use anonymized data
JavaScript-rendered pages Selenium, Playwright, or check for an API
Small site that can't handle load Contact the owner directly

If a page's content is rendered by JavaScript, Beautiful Soup won't see it -- it only gets the raw HTML before JavaScript runs. Before reaching for a headless browser like Selenium, check the browser's Network tab for XHR/fetch requests. Often the data is loaded from a JSON endpoint you can query directly.

💡 Intuition: If you have to work hard to get the data -- evading bot detection, rotating proxies, solving CAPTCHAs -- that's the website telling you "no."


24.10 Project Checkpoint: TaskFlow v2.3

It's time to add automation features to TaskFlow. Building on v2.2 (Chapter 23, which added a virtual environment and rich terminal output), we'll add two new capabilities:

  1. Daily motivational quote scraper -- Fetch a random quote from quotes.toscrape.com and display it when the app starts, with caching so we don't re-scrape unnecessarily.
  2. Automated task report generation -- Generate a formatted status report of all tasks and save it to a timestamped file.

24.10.1 Feature 1: Daily Quote Scraper

The quote scraper demonstrates every pattern from this chapter in miniature: fetch a page, parse HTML, extract data, handle errors, and cache the result to avoid unnecessary requests.

import json
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from pathlib import Path

QUOTE_CACHE = Path("taskflow_quote_cache.json")


def fetch_daily_quote() -> dict | None:
    """Scrape a random quote, with daily caching."""
    # Check cache -- don't re-scrape if we have today's quote
    if QUOTE_CACHE.exists():
        try:
            cache = json.loads(QUOTE_CACHE.read_text(encoding="utf-8"))
            if cache.get("date") == datetime.now().strftime("%Y-%m-%d"):
                return cache.get("quote")
        except (json.JSONDecodeError, KeyError):
            pass

    # Fetch a new quote
    try:
        headers = {"User-Agent": "TaskFlow/2.3 (Educational project)"}
        response = requests.get(
            "http://quotes.toscrape.com/random",
            headers=headers, timeout=10,
        )
        response.raise_for_status()

        soup = BeautifulSoup(response.text, "html.parser")
        quote_div = soup.select_one(".quote")

        if quote_div:
            text = quote_div.select_one(".text").text.strip("\u201c\u201d")
            author = quote_div.select_one(".author").text
            quote = {"text": text, "author": author}

            # Cache for today
            QUOTE_CACHE.write_text(json.dumps(
                {"date": datetime.now().strftime("%Y-%m-%d"), "quote": quote},
                indent=2), encoding="utf-8")
            return quote

    except requests.exceptions.RequestException as e:
        print(f"  Could not fetch quote: {e}")

    return None  # Caller displays a fallback offline quote

The caching pattern is worth noting: check the cache before making a network request, and if the network is down, degrade gracefully rather than crashing. The app works with or without internet.

24.10.2 Feature 2: Automated Task Report

The report generator follows the transform-output pattern from Section 24.8 -- compute summary statistics from the task list, format a text report, and save it with a timestamped filename:

def generate_task_report(tasks: list[dict]) -> str:
    """Generate a formatted task status report."""
    lines = [
        "=" * 55,
        "  TASKFLOW -- TASK STATUS REPORT",
        f"  Generated: {datetime.now():%Y-%m-%d %H:%M:%S}",
        f"  Total tasks: {len(tasks)}",
        "=" * 55, "",
    ]

    for priority in ["high", "medium", "low"]:
        count = sum(1 for t in tasks if t.get("priority") == priority)
        lines.append(f"  {priority:8s}  {count:>3}  {'#' * count}")
    lines.append("")

    for i, task in enumerate(tasks, 1):
        status = "[x]" if task.get("status") == "done" else "[ ]"
        lines.append(f"  {i:>3}. {status} {task.get('name', 'Untitled')}"
                     f" ({task.get('priority', 'medium')})")

    return "\n".join(lines)

The menu integration is straightforward: display a daily quote on startup (with an offline fallback), and add a "Generate report" option that calls generate_task_report(), prints the result, and saves it. The complete code is in code/project-checkpoint.py.

24.10.4 What Changed from v2.2 to v2.3

Aspect v2.2 (Chapter 23) v2.3 (This chapter)
External data None Quote scraper (Beautiful Soup + requests)
Caching None Daily quote cached to JSON
Report output Screen only Saved to timestamped text files
New dependencies rich requests, beautifulsoup4
Offline handling N/A Fallback quotes when network unavailable

🔗 Spaced Review (Ch 12): TaskFlow's modular structure from v1.1 pays off again. The quote scraper and report generator are self-contained functions that slot into the existing codebase without modifying the core task management logic. Good module boundaries make features additive, not invasive.

🔗 Spaced Review (Ch 22): In v2.1, you added regex-powered search to TaskFlow. If you wanted to validate scraped data -- say, checking that a quote's author name contains only expected characters -- regex would be the natural tool. Combine re.fullmatch(r"[A-Za-z .'-]+", author) with the Beautiful Soup extraction for robust data validation.


24.11 Debugging Walkthrough

🐛 Debugging Walkthrough: Common Scraping Errors

Symptom: GuessedAtParserWarning: No parser was explicitly specified Cause: You wrote BeautifulSoup(html) without specifying a parser. Fix: Always pass the parser: BeautifulSoup(html, "html.parser"). This makes your code portable across different machines.

Symptom: Your scraper returns empty results, even though you can see data in the browser. Cause 1: The page uses JavaScript to load content dynamically. requests.get() fetches the initial HTML, but JavaScript-rendered content isn't in the raw source. Fix: Check "View Page Source" (not "Inspect Element" -- they show different things). If the data isn't in the source, you need Selenium or Playwright, or check if the site has an API. Look in the browser's Network tab for XHR/fetch requests -- the data might be loaded from a JSON endpoint you can query directly. Cause 2: Your CSS selector doesn't match. Maybe the site uses class="product-card" and you're selecting ".product_card" (underscore instead of hyphen). Fix: Copy the exact class name from the browser's Inspect panel. Test your selector on a small HTML snippet first.

Symptom: AttributeError: 'NoneType' object has no attribute 'text' Cause: find() or select_one() returned None because no element matched, and you tried to call .text on it. Fix: Always check for None before accessing attributes: python element = soup.select_one(".price") if element: price = element.text else: price = "N/A" Or use the walrus operator for concise conditional extraction: python price = el.text if (el := soup.select_one(".price")) else "N/A"

Symptom: ConnectionError or Timeout when fetching pages. Cause: Network issues, or the server is blocking your requests (no User-Agent, too many requests too fast). Fix: Add a descriptive User-Agent header, increase time.sleep() between requests, add timeout=10 to requests.get(), and wrap everything in try/except.

Symptom: The scraper works for a few days, then suddenly breaks. Cause: The website redesigned its HTML structure. CSS class names changed or elements were rearranged. Fix: This is inherent to scraping. Document your assumptions, add validation to detect structure changes, and consider using an API if available. Scrapers require ongoing maintenance.


Chapter Summary

This chapter covered two of the most practical skills in programming: web scraping and automation.

You learned to parse HTML with Beautiful Soup -- searching by tag, class, ID, or CSS selector to extract structured data from web pages. You built a complete web scraper with error handling, rate limiting, and pagination. You confronted the ethics of scraping -- robots.txt, Terms of Service, and the question of when scraping is responsible. You explored automation beyond scraping -- file organization, batch renaming, and report generation. You learned to schedule scripts with cron, Task Scheduler, and the schedule library. And you saw the pipeline pattern: Fetch, Parse, Transform, Output.

The threshold concept -- automation as a multiplier -- is this: every hour spent writing an automation script pays for itself many times over. Elena spent 8 hours building her pipeline. It saves her 200 hours per year. But more than speed, the script is reliable. It doesn't get tired, doesn't make transcription errors, and doesn't forget to validate the data. The best automation isn't just faster -- it's better.


Spaced Review

Concepts from earlier chapters that appeared in this chapter:

Concept Original Chapter How It Appeared Here
File I/O with pathlib Ch 10 Saving reports, organizing files, Path.iterdir()
Error handling with try/except Ch 11 Wrapping network requests, handling None results
Module organization Ch 12 Pipeline stages as separate functions
Recursion Ch 18 Recursive site crawling pattern (mentioned)
HTTP requests with requests Ch 21 Fetching web pages, response.raise_for_status()
CSV and JSON processing Ch 21 Saving and loading pipeline data
Regular expressions Ch 22 Extracting patterns from scraped text
Virtual environments and pip Ch 23 Installing beautifulsoup4 and requests

What's Next

In Chapter 25, you'll learn version control with Git -- how to track your code's history, experiment fearlessly on branches, and collaborate with other developers. You'll initialize a Git repository for TaskFlow and learn the workflow that powers open-source development worldwide.


"The first rule of automation: if you have to do it twice, write a script. The second rule: if the script works, schedule it."