Case Study 1: Reviewing a Code Review

Analyzing AI-Generated Web Scraper Code


The Scenario

You are a junior developer at a small data analytics startup. Your team lead asks you to review a piece of code that a colleague generated using an AI coding assistant. The code is a web scraper designed to collect product information from an e-commerce website and save it to a CSV file. Your colleague says the code "seems to work," but before it goes into production (running daily as an automated job), someone needs to review it thoroughly.

Your task: perform a complete code review using the five-phase process from Chapter 7 and identify every issue, no matter how small.

The Code Under Review

import requests
from bs4 import BeautifulSoup
import csv
import time
import json

URL = "https://example-store.com/products"
OUTPUT_FILE = "products.csv"
API_KEY = "sk-prod-abc123def456ghi789"

def scrape_products(url, pages=10):
    all_products = []

    for page in range(1, pages):
        response = requests.get(url + f"?page={page}",
                               headers={"Authorization": f"Bearer {API_KEY}"})

        soup = BeautifulSoup(response.text, "html.parser")

        products = soup.find_all("div", class_="product-card")

        for product in products:
            name = product.find("h2").text
            price = product.find("span", class_="price").text
            price = float(price.replace("$", "").replace(",", ""))
            rating = product.find("div", class_="rating")
            if rating:
                rating = float(rating.text.split("/")[0])

            all_products.append({
                "name": name,
                "price": price,
                "rating": rating,
                "url": url + "/" + name.lower().replace(" ", "-")
            })

        time.sleep(0.5)

    return all_products

def save_to_csv(products, filename):
    f = open(filename, "w")
    writer = csv.DictWriter(f, fieldnames=["name", "price", "rating", "url"])
    writer.writeheader()
    for product in products:
        writer.writerow(product)
    f.close()
    print(f"Saved {len(products)} products to {filename}")

def get_product_details(product_url):
    resp = requests.get(product_url)
    soup = BeautifulSoup(resp.text, "html.parser")
    desc = soup.find("div", class_="description")
    if desc:
        return desc.text
    return ""

def analyze_prices(products):
    prices = []
    for p in products:
        prices.append(p["price"])

    avg = sum(prices) / len(prices)

    expensive = []
    for p in products:
        if p["price"] > avg * 1.5:
            expensive.append(p)

    cheap = []
    for p in products:
        if p["price"] < avg * 0.5:
            cheap.append(p)

    return {
        "average": avg,
        "min": min(prices),
        "max": max(prices),
        "expensive_products": expensive,
        "cheap_products": cheap,
        "total_products": len(products)
    }

if __name__ == "__main__":
    print("Starting scraper...")
    products = scrape_products(URL)
    save_to_csv(products, OUTPUT_FILE)
    analysis = analyze_prices(products)
    print(f"Average price: ${analysis['average']:.2f}")
    print(f"Price range: ${analysis['min']:.2f} - ${analysis['max']:.2f}")
    print(f"Expensive products: {len(analysis['expensive_products'])}")
    print(f"Cheap products: {len(analysis['cheap_products'])}")

Phase 1: Structural Scan

Let us begin with the bird's-eye view.

Imports: The code imports requests, BeautifulSoup, csv, time, and json. The json import is unused — it is never referenced anywhere in the code. This is a minor issue but indicates the AI included it "just in case."

Constants: Three module-level constants are defined: URL, OUTPUT_FILE, and API_KEY. The presence of API_KEY as a hardcoded constant immediately raises a security flag, which we will examine in detail later.

Functions: Four functions are defined: - scrape_products() — the main scraping logic - save_to_csv() — file output - get_product_details() — fetches individual product pages - analyze_prices() — statistical analysis

Entry point: A __name__ == "__main__" block orchestrates the workflow.

Observation: The get_product_details() function is defined but never called. It appears to be dead code — perhaps the AI generated it anticipating future needs, or it was part of an earlier conversation iteration that was not cleaned up.

Structural issues found: 2 (unused import, unused function)


Phase 2: Logic Trace

Let us trace through scrape_products(URL, pages=10).

The for page in range(1, pages) loop will iterate over pages 1 through 9. This is the first bug: the range(1, pages) call with pages=10 generates values 1 through 9, not 1 through 10. If the user expects 10 pages to be scraped (pages 1 through 10), this is an off-by-one error. The function parameter name pages=10 strongly implies "scrape 10 pages," but only 9 are scraped.

Inside the loop, for each page:

  1. A GET request is made. No error checking on the response status code. If the server returns a 404 or 500 error, the code will attempt to parse the error page as product HTML.

  2. The HTML is parsed, and all div.product-card elements are found.

  3. For each product, name = product.find("h2").text is called. If any product card lacks an h2 element, this will raise an AttributeError because .find() returns None and None has no .text attribute.

  4. The price parsing float(price.replace("$", "").replace(",", ""))` assumes the price always starts with `$ and uses commas as thousand separators. What about prices in other formats, like "EUR 29.99" or "Free"? This would crash.

  5. The rating is handled more carefully with an if rating: check, but when rating is None (the product has no rating), the None value is appended to the product dictionary. This is not necessarily a bug, but it could cause issues downstream if code expects a float.

  6. The URL is constructed by taking the base URL and appending a slugified version of the product name. This URL construction is fragile — it assumes a URL pattern that may not match the actual website.

  7. time.sleep(0.5) provides rate limiting. This is reasonable but may be insufficient for some websites and could be excessive for others.

Logic issues found: 3 (off-by-one in range, no HTTP error checking, no null checking on HTML elements)


Phase 3: Quality Check

Naming: Mostly acceptable. scrape_products, save_to_csv, and analyze_prices are descriptive function names. However, f for the file handle, p for products in analyze_prices, and resp instead of response in get_product_details are unnecessarily abbreviated.

Type hints: Completely absent. None of the functions have parameter or return type annotations.

Docstrings: Completely absent. None of the functions have docstrings explaining what they do, what they accept, or what they return.

Single responsibility: scrape_products does too much. It handles HTTP requests, HTML parsing, data extraction, and URL construction all in one function. These could be separated for better testability and maintainability.

DRY violations: The analyze_prices function loops through products three times (once to collect prices, once for expensive products, once for cheap products). This could be done in a single pass.

Quality issues found: Multiple (no type hints, no docstrings, weak naming in places, function doing too much)


Phase 4: Issue Scan

Now let us systematically look for bugs and potential failures.

Issue 1 (Critical): Hardcoded API Key The line API_KEY = "sk-prod-abc123def456ghi789" contains what appears to be a production API key directly in the source code. If this code is committed to a Git repository, the key is exposed to anyone with repository access. Severity: Critical.

Issue 2 (High): No HTTP Error Handling The requests.get() call does not check the response status code. A failed request (404, 500, timeout, connection error) will either crash the program with an unhandled exception or silently produce garbage results from parsing an error page. Severity: High.

Issue 3 (High): Resource Leak in save_to_csv The save_to_csv function opens a file with f = open(filename, "w") but uses manual f.close() instead of a context manager. If writer.writerow() raises an exception (for example, due to a Unicode encoding error), the file will never be closed. Severity: High.

Issue 4 (Medium): Off-by-One Error in Pagination range(1, pages) with pages=10 scrapes pages 1-9, not 1-10. This means the last page of results is always missed. Severity: Medium.

Issue 5 (Medium): No Null Safety on HTML Elements product.find("h2").text will crash with AttributeError if any product card does not contain an h2 element. The same applies to product.find("span", class_="price").text. Severity: Medium.

Issue 6 (Low): Fragile Price Parsing `float(price.replace("$", "").replace(",", ""))` assumes a specific price format. Any variation (different currency symbol, "Free", "Price on request", a range like "$10-$20") will crash the program. Severity: Low to Medium depending on the data.

Issue 7 (Low): Division by Zero in analyze_prices If products is an empty list (for example, if the scraper found no products), sum(prices) / len(prices) will raise a ZeroDivisionError. Severity: Low (but could crash the entire program).

Issues found: 7


Phase 5: Security and Performance

Security: - The hardcoded API key (already noted as Issue 1) is the most critical security problem. - The get_product_details function makes HTTP requests based on URLs constructed from scraped data. If an attacker could inject malicious URLs into the product data, this could be exploited. Given that this is a scraper reading from a third-party site, this risk is low but worth noting.

Performance: - The analyze_prices function makes three passes through the products list where one would suffice. For a typical scrape result of a few hundred products, this is negligible. - The time.sleep(0.5) between pages is the dominant performance factor and is intentional rate limiting. - No pagination memory concern — all products are stored in a list in memory. For a very large scrape (millions of products), this could be an issue, but for the expected scale, it is fine.


Summary of All Seven Issues

# Issue Severity Category Section
1 Hardcoded API key in source code Critical Security Constants
2 No HTTP error handling (status codes, timeouts, connection errors) High Error Handling scrape_products
3 Resource leak — file opened without context manager High Resource Management save_to_csv
4 Off-by-one error in pagination (scrapes 9 pages, not 10) Medium Logic Bug scrape_products
5 No null safety on HTML element access (.find().text) Medium Error Handling scrape_products
6 Fragile price parsing assumes specific format Low-Medium Robustness scrape_products
7 Division by zero if product list is empty Low Edge Case analyze_prices

Additional Observations (Not Bugs, but Improvements)

  • Missing type hints on all functions
  • Missing docstrings on all functions
  • Unused import json
  • Unused function get_product_details()
  • No logging (uses print() instead)
  • No retry logic for failed HTTP requests
  • No configuration file or command-line arguments for URL, pages, output file
  • CSV file does not use newline="" parameter recommended by the csv module on Windows

The Corrected Version

Here is how the code should look after addressing all seven issues:

"""Web scraper for collecting product information from an e-commerce site."""

import csv
import logging
import os
import time
from dataclasses import dataclass

import requests
from bs4 import BeautifulSoup, Tag

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

URL = "https://example-store.com/products"
OUTPUT_FILE = "products.csv"
API_KEY = os.environ.get("SCRAPER_API_KEY")  # Issue 1: Fixed


@dataclass
class Product:
    """Represents a scraped product."""
    name: str
    price: float
    rating: float | None
    url: str


def scrape_page(url: str, page: int) -> requests.Response:
    """Fetch a single page with error handling and retries."""
    if not API_KEY:
        raise RuntimeError("SCRAPER_API_KEY environment variable is required")

    try:
        response = requests.get(
            f"{url}?page={page}",
            headers={"Authorization": f"Bearer {API_KEY}"},
            timeout=30,  # Issue 2: Added timeout
        )
        response.raise_for_status()  # Issue 2: Check status code
        return response
    except requests.RequestException as e:
        logger.error("Failed to fetch page %d: %s", page, e)
        raise


def parse_product(card: Tag, base_url: str) -> Product | None:
    """Parse a single product card, returning None if parsing fails."""
    # Issue 5: Null safety
    name_elem = card.find("h2")
    price_elem = card.find("span", class_="price")

    if not name_elem or not price_elem:
        logger.warning("Skipping product card with missing elements")
        return None

    name = name_elem.text.strip()

    # Issue 6: Robust price parsing
    try:
        price_text = price_elem.text.strip()
        price = float(price_text.replace("$", "").replace(",", ""))
    except ValueError:
        logger.warning("Could not parse price '%s' for '%s'", price_text, name)
        return None

    rating_elem = card.find("div", class_="rating")
    rating = None
    if rating_elem:
        try:
            rating = float(rating_elem.text.split("/")[0])
        except (ValueError, IndexError):
            logger.warning("Could not parse rating for '%s'", name)

    slug = name.lower().replace(" ", "-")
    product_url = f"{base_url}/{slug}"

    return Product(name=name, price=price, rating=rating, url=product_url)


def scrape_products(url: str, pages: int = 10) -> list[Product]:
    """Scrape products from multiple pages."""
    all_products: list[Product] = []

    for page in range(1, pages + 1):  # Issue 4: Fixed off-by-one
        logger.info("Scraping page %d of %d", page, pages)
        try:
            response = scrape_page(url, page)
        except requests.RequestException:
            logger.warning("Skipping page %d due to request failure", page)
            continue

        soup = BeautifulSoup(response.text, "html.parser")
        cards = soup.find_all("div", class_="product-card")

        for card in cards:
            product = parse_product(card, url)
            if product is not None:
                all_products.append(product)

        time.sleep(0.5)

    return all_products


def save_to_csv(products: list[Product], filename: str) -> None:
    """Save products to a CSV file."""
    # Issue 3: Fixed resource leak with context manager
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(
            f, fieldnames=["name", "price", "rating", "url"]
        )
        writer.writeheader()
        for product in products:
            writer.writerow({
                "name": product.name,
                "price": product.price,
                "rating": product.rating,
                "url": product.url,
            })
    logger.info("Saved %d products to %s", len(products), filename)


def analyze_prices(products: list[Product]) -> dict:
    """Analyze price distribution of products."""
    # Issue 7: Handle empty list
    if not products:
        return {
            "average": 0.0,
            "min": 0.0,
            "max": 0.0,
            "expensive_products": [],
            "cheap_products": [],
            "total_products": 0,
        }

    prices = [p.price for p in products]
    avg = sum(prices) / len(prices)

    return {
        "average": avg,
        "min": min(prices),
        "max": max(prices),
        "expensive_products": [p for p in products if p.price > avg * 1.5],
        "cheap_products": [p for p in products if p.price < avg * 0.5],
        "total_products": len(products),
    }

Lessons Learned

This case study demonstrates several key points:

  1. AI-generated code that "seems to work" can have critical issues. The original code would run successfully on a happy-path test, but contained a security vulnerability, resource leaks, and logic bugs that would surface in production.

  2. The five-phase review process catches issues systematically. By progressing from structural scan to logic trace to quality check to issue scan to security review, we found every problem without needing to rely on intuition alone.

  3. Severity assessment matters. Not all issues are equal. The hardcoded API key is a critical security risk that must be fixed immediately. The off-by-one error is a functional bug that produces incorrect results. The missing type hints are a quality concern that improves maintainability but does not affect functionality.

  4. Every bug you find is a checklist item for the future. After this review, you might add "Check for hardcoded API keys" and "Verify range() boundaries in pagination" to your personal code review checklist.