Case Study 1: Reviewing a Code Review
Analyzing AI-Generated Web Scraper Code
The Scenario
You are a junior developer at a small data analytics startup. Your team lead asks you to review a piece of code that a colleague generated using an AI coding assistant. The code is a web scraper designed to collect product information from an e-commerce website and save it to a CSV file. Your colleague says the code "seems to work," but before it goes into production (running daily as an automated job), someone needs to review it thoroughly.
Your task: perform a complete code review using the five-phase process from Chapter 7 and identify every issue, no matter how small.
The Code Under Review
import requests
from bs4 import BeautifulSoup
import csv
import time
import json
URL = "https://example-store.com/products"
OUTPUT_FILE = "products.csv"
API_KEY = "sk-prod-abc123def456ghi789"
def scrape_products(url, pages=10):
all_products = []
for page in range(1, pages):
response = requests.get(url + f"?page={page}",
headers={"Authorization": f"Bearer {API_KEY}"})
soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="product-card")
for product in products:
name = product.find("h2").text
price = product.find("span", class_="price").text
price = float(price.replace("$", "").replace(",", ""))
rating = product.find("div", class_="rating")
if rating:
rating = float(rating.text.split("/")[0])
all_products.append({
"name": name,
"price": price,
"rating": rating,
"url": url + "/" + name.lower().replace(" ", "-")
})
time.sleep(0.5)
return all_products
def save_to_csv(products, filename):
f = open(filename, "w")
writer = csv.DictWriter(f, fieldnames=["name", "price", "rating", "url"])
writer.writeheader()
for product in products:
writer.writerow(product)
f.close()
print(f"Saved {len(products)} products to {filename}")
def get_product_details(product_url):
resp = requests.get(product_url)
soup = BeautifulSoup(resp.text, "html.parser")
desc = soup.find("div", class_="description")
if desc:
return desc.text
return ""
def analyze_prices(products):
prices = []
for p in products:
prices.append(p["price"])
avg = sum(prices) / len(prices)
expensive = []
for p in products:
if p["price"] > avg * 1.5:
expensive.append(p)
cheap = []
for p in products:
if p["price"] < avg * 0.5:
cheap.append(p)
return {
"average": avg,
"min": min(prices),
"max": max(prices),
"expensive_products": expensive,
"cheap_products": cheap,
"total_products": len(products)
}
if __name__ == "__main__":
print("Starting scraper...")
products = scrape_products(URL)
save_to_csv(products, OUTPUT_FILE)
analysis = analyze_prices(products)
print(f"Average price: ${analysis['average']:.2f}")
print(f"Price range: ${analysis['min']:.2f} - ${analysis['max']:.2f}")
print(f"Expensive products: {len(analysis['expensive_products'])}")
print(f"Cheap products: {len(analysis['cheap_products'])}")
Phase 1: Structural Scan
Let us begin with the bird's-eye view.
Imports: The code imports requests, BeautifulSoup, csv, time, and json. The json import is unused — it is never referenced anywhere in the code. This is a minor issue but indicates the AI included it "just in case."
Constants: Three module-level constants are defined: URL, OUTPUT_FILE, and API_KEY. The presence of API_KEY as a hardcoded constant immediately raises a security flag, which we will examine in detail later.
Functions: Four functions are defined:
- scrape_products() — the main scraping logic
- save_to_csv() — file output
- get_product_details() — fetches individual product pages
- analyze_prices() — statistical analysis
Entry point: A __name__ == "__main__" block orchestrates the workflow.
Observation: The get_product_details() function is defined but never called. It appears to be dead code — perhaps the AI generated it anticipating future needs, or it was part of an earlier conversation iteration that was not cleaned up.
Structural issues found: 2 (unused import, unused function)
Phase 2: Logic Trace
Let us trace through scrape_products(URL, pages=10).
The for page in range(1, pages) loop will iterate over pages 1 through 9. This is the first bug: the range(1, pages) call with pages=10 generates values 1 through 9, not 1 through 10. If the user expects 10 pages to be scraped (pages 1 through 10), this is an off-by-one error. The function parameter name pages=10 strongly implies "scrape 10 pages," but only 9 are scraped.
Inside the loop, for each page:
-
A GET request is made. No error checking on the response status code. If the server returns a 404 or 500 error, the code will attempt to parse the error page as product HTML.
-
The HTML is parsed, and all
div.product-cardelements are found. -
For each product,
name = product.find("h2").textis called. If any product card lacks anh2element, this will raise anAttributeErrorbecause.find()returnsNoneandNonehas no.textattribute. -
The price parsing
float(price.replace("$", "").replace(",", ""))` assumes the price always starts with `$and uses commas as thousand separators. What about prices in other formats, like "EUR 29.99" or "Free"? This would crash. -
The rating is handled more carefully with an
if rating:check, but when rating isNone(the product has no rating), theNonevalue is appended to the product dictionary. This is not necessarily a bug, but it could cause issues downstream if code expects a float. -
The URL is constructed by taking the base URL and appending a slugified version of the product name. This URL construction is fragile — it assumes a URL pattern that may not match the actual website.
-
time.sleep(0.5)provides rate limiting. This is reasonable but may be insufficient for some websites and could be excessive for others.
Logic issues found: 3 (off-by-one in range, no HTTP error checking, no null checking on HTML elements)
Phase 3: Quality Check
Naming: Mostly acceptable. scrape_products, save_to_csv, and analyze_prices are descriptive function names. However, f for the file handle, p for products in analyze_prices, and resp instead of response in get_product_details are unnecessarily abbreviated.
Type hints: Completely absent. None of the functions have parameter or return type annotations.
Docstrings: Completely absent. None of the functions have docstrings explaining what they do, what they accept, or what they return.
Single responsibility: scrape_products does too much. It handles HTTP requests, HTML parsing, data extraction, and URL construction all in one function. These could be separated for better testability and maintainability.
DRY violations: The analyze_prices function loops through products three times (once to collect prices, once for expensive products, once for cheap products). This could be done in a single pass.
Quality issues found: Multiple (no type hints, no docstrings, weak naming in places, function doing too much)
Phase 4: Issue Scan
Now let us systematically look for bugs and potential failures.
Issue 1 (Critical): Hardcoded API Key
The line API_KEY = "sk-prod-abc123def456ghi789" contains what appears to be a production API key directly in the source code. If this code is committed to a Git repository, the key is exposed to anyone with repository access. Severity: Critical.
Issue 2 (High): No HTTP Error Handling
The requests.get() call does not check the response status code. A failed request (404, 500, timeout, connection error) will either crash the program with an unhandled exception or silently produce garbage results from parsing an error page. Severity: High.
Issue 3 (High): Resource Leak in save_to_csv
The save_to_csv function opens a file with f = open(filename, "w") but uses manual f.close() instead of a context manager. If writer.writerow() raises an exception (for example, due to a Unicode encoding error), the file will never be closed. Severity: High.
Issue 4 (Medium): Off-by-One Error in Pagination
range(1, pages) with pages=10 scrapes pages 1-9, not 1-10. This means the last page of results is always missed. Severity: Medium.
Issue 5 (Medium): No Null Safety on HTML Elements
product.find("h2").text will crash with AttributeError if any product card does not contain an h2 element. The same applies to product.find("span", class_="price").text. Severity: Medium.
Issue 6 (Low): Fragile Price Parsing `float(price.replace("$", "").replace(",", ""))` assumes a specific price format. Any variation (different currency symbol, "Free", "Price on request", a range like "$10-$20") will crash the program. Severity: Low to Medium depending on the data.
Issue 7 (Low): Division by Zero in analyze_prices
If products is an empty list (for example, if the scraper found no products), sum(prices) / len(prices) will raise a ZeroDivisionError. Severity: Low (but could crash the entire program).
Issues found: 7
Phase 5: Security and Performance
Security:
- The hardcoded API key (already noted as Issue 1) is the most critical security problem.
- The get_product_details function makes HTTP requests based on URLs constructed from scraped data. If an attacker could inject malicious URLs into the product data, this could be exploited. Given that this is a scraper reading from a third-party site, this risk is low but worth noting.
Performance:
- The analyze_prices function makes three passes through the products list where one would suffice. For a typical scrape result of a few hundred products, this is negligible.
- The time.sleep(0.5) between pages is the dominant performance factor and is intentional rate limiting.
- No pagination memory concern — all products are stored in a list in memory. For a very large scrape (millions of products), this could be an issue, but for the expected scale, it is fine.
Summary of All Seven Issues
| # | Issue | Severity | Category | Section |
|---|---|---|---|---|
| 1 | Hardcoded API key in source code | Critical | Security | Constants |
| 2 | No HTTP error handling (status codes, timeouts, connection errors) | High | Error Handling | scrape_products |
| 3 | Resource leak — file opened without context manager | High | Resource Management | save_to_csv |
| 4 | Off-by-one error in pagination (scrapes 9 pages, not 10) | Medium | Logic Bug | scrape_products |
| 5 | No null safety on HTML element access (.find().text) | Medium | Error Handling | scrape_products |
| 6 | Fragile price parsing assumes specific format | Low-Medium | Robustness | scrape_products |
| 7 | Division by zero if product list is empty | Low | Edge Case | analyze_prices |
Additional Observations (Not Bugs, but Improvements)
- Missing type hints on all functions
- Missing docstrings on all functions
- Unused
import json - Unused function
get_product_details() - No logging (uses
print()instead) - No retry logic for failed HTTP requests
- No configuration file or command-line arguments for URL, pages, output file
- CSV file does not use
newline=""parameter recommended by the csv module on Windows
The Corrected Version
Here is how the code should look after addressing all seven issues:
"""Web scraper for collecting product information from an e-commerce site."""
import csv
import logging
import os
import time
from dataclasses import dataclass
import requests
from bs4 import BeautifulSoup, Tag
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
URL = "https://example-store.com/products"
OUTPUT_FILE = "products.csv"
API_KEY = os.environ.get("SCRAPER_API_KEY") # Issue 1: Fixed
@dataclass
class Product:
"""Represents a scraped product."""
name: str
price: float
rating: float | None
url: str
def scrape_page(url: str, page: int) -> requests.Response:
"""Fetch a single page with error handling and retries."""
if not API_KEY:
raise RuntimeError("SCRAPER_API_KEY environment variable is required")
try:
response = requests.get(
f"{url}?page={page}",
headers={"Authorization": f"Bearer {API_KEY}"},
timeout=30, # Issue 2: Added timeout
)
response.raise_for_status() # Issue 2: Check status code
return response
except requests.RequestException as e:
logger.error("Failed to fetch page %d: %s", page, e)
raise
def parse_product(card: Tag, base_url: str) -> Product | None:
"""Parse a single product card, returning None if parsing fails."""
# Issue 5: Null safety
name_elem = card.find("h2")
price_elem = card.find("span", class_="price")
if not name_elem or not price_elem:
logger.warning("Skipping product card with missing elements")
return None
name = name_elem.text.strip()
# Issue 6: Robust price parsing
try:
price_text = price_elem.text.strip()
price = float(price_text.replace("$", "").replace(",", ""))
except ValueError:
logger.warning("Could not parse price '%s' for '%s'", price_text, name)
return None
rating_elem = card.find("div", class_="rating")
rating = None
if rating_elem:
try:
rating = float(rating_elem.text.split("/")[0])
except (ValueError, IndexError):
logger.warning("Could not parse rating for '%s'", name)
slug = name.lower().replace(" ", "-")
product_url = f"{base_url}/{slug}"
return Product(name=name, price=price, rating=rating, url=product_url)
def scrape_products(url: str, pages: int = 10) -> list[Product]:
"""Scrape products from multiple pages."""
all_products: list[Product] = []
for page in range(1, pages + 1): # Issue 4: Fixed off-by-one
logger.info("Scraping page %d of %d", page, pages)
try:
response = scrape_page(url, page)
except requests.RequestException:
logger.warning("Skipping page %d due to request failure", page)
continue
soup = BeautifulSoup(response.text, "html.parser")
cards = soup.find_all("div", class_="product-card")
for card in cards:
product = parse_product(card, url)
if product is not None:
all_products.append(product)
time.sleep(0.5)
return all_products
def save_to_csv(products: list[Product], filename: str) -> None:
"""Save products to a CSV file."""
# Issue 3: Fixed resource leak with context manager
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(
f, fieldnames=["name", "price", "rating", "url"]
)
writer.writeheader()
for product in products:
writer.writerow({
"name": product.name,
"price": product.price,
"rating": product.rating,
"url": product.url,
})
logger.info("Saved %d products to %s", len(products), filename)
def analyze_prices(products: list[Product]) -> dict:
"""Analyze price distribution of products."""
# Issue 7: Handle empty list
if not products:
return {
"average": 0.0,
"min": 0.0,
"max": 0.0,
"expensive_products": [],
"cheap_products": [],
"total_products": 0,
}
prices = [p.price for p in products]
avg = sum(prices) / len(prices)
return {
"average": avg,
"min": min(prices),
"max": max(prices),
"expensive_products": [p for p in products if p.price > avg * 1.5],
"cheap_products": [p for p in products if p.price < avg * 0.5],
"total_products": len(products),
}
Lessons Learned
This case study demonstrates several key points:
-
AI-generated code that "seems to work" can have critical issues. The original code would run successfully on a happy-path test, but contained a security vulnerability, resource leaks, and logic bugs that would surface in production.
-
The five-phase review process catches issues systematically. By progressing from structural scan to logic trace to quality check to issue scan to security review, we found every problem without needing to rely on intuition alone.
-
Severity assessment matters. Not all issues are equal. The hardcoded API key is a critical security risk that must be fixed immediately. The off-by-one error is a functional bug that produces incorrect results. The missing type hints are a quality concern that improves maintainability but does not affect functionality.
-
Every bug you find is a checklist item for the future. After this review, you might add "Check for hardcoded API keys" and "Verify range() boundaries in pagination" to your personal code review checklist.