Case Study 20-1: Acme Corp Competitor Price Monitoring
The Business Problem
Marcus Webb, Acme Corp's Sales Director, needs to know where Acme's widget prices stand relative to the competition. Acme sells three product lines: Standard Widgets, Professional Widgets, and Enterprise Widget Bundles. Three competitors — TechWidgets.net, WidgetWorld.com, and ProParts Direct — publish their catalog prices publicly on their websites.
Previously, the pricing intelligence process looked like this: once per quarter, an analyst spent half a day manually visiting each competitor's site, finding the relevant products, noting down the prices, and building a comparison spreadsheet. The problems with this approach:
- Infrequent updates — prices can change monthly or more often
- Time-intensive — half a day per cycle, consumed by mechanical work
- Error-prone — manual transcription introduces mistakes
- No history — a snapshot every quarter does not show trends
Marcus's request to Priya: "Can you make this automated? Ideally I want weekly price data so I can see when competitors change prices."
Priya's response: "Yes. But first, some homework."
Step 1: Due Diligence Before Writing a Single Line of Code
Priya's first task is not coding — it is compliance checking.
Competitor 1: TechWidgets.net
Priya navigates to https://www.techwidgets.net/robots.txt:
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Crawl-delay: 3
User-agent: Googlebot
Allow: /
The public product catalog (/products/, /catalog/) is not disallowed. She checks the Terms of Service at /terms. There is a clause about "automated scraping of pricing data" being prohibited for commercial re-publication, but Priya's use is internal business intelligence, not resale of the data. She makes a note and proceeds.
Competitor 2: WidgetWorld.com
robots.txt shows:
User-agent: *
Disallow: /api/
Disallow: /internal/
Allow: /
Crawl-delay: 2
All clear for the public product pages.
Competitor 3: ProParts Direct
robots.txt shows:
User-agent: *
Disallow: /
A blanket Disallow: / means this site explicitly asks all bots not to crawl it. Priya excludes ProParts Direct from the automated system entirely. If Marcus needs ProParts pricing, it will be collected manually.
Step 2: Inspecting the Target Pages
Priya opens TechWidgets.net's product catalog in her browser, right-clicks a product listing, and selects "Inspect Element." She is looking for a consistent HTML structure that Python can navigate reliably.
She observes the following structure (representative; real competitor sites vary):
<div class="catalog-grid">
<div class="product-card" data-sku="TW-STD-001">
<div class="product-header">
<h3 class="product-name">Standard Widget Pro 500</h3>
<span class="product-sku">SKU: TW-STD-001</span>
</div>
<div class="pricing-block">
<span class="list-price">$89.99</span>
<span class="unit-label">/ unit</span>
<div class="bulk-pricing">
<span>10+ units: $84.99</span>
<span>50+ units: $79.99</span>
</div>
</div>
<div class="product-meta">
<span class="category">Standard Widgets</span>
<span class="in-stock">In Stock</span>
</div>
</div>
<!-- more product cards... -->
</div>
Good news: the structure is consistent across all products, with identifiable class names.
Step 3: The Scraping Script
"""
competitor_price_monitor.py
===========================
Acme Corp — Competitor Price Intelligence
Weekly scraper for public competitor product catalogs.
Respects robots.txt and ToS. Internal use only.
Data is saved to a running CSV file for trend analysis.
Author: Priya Kapoor, Acme Corp Operations
"""
import csv
import logging
import random
import time
import urllib.robotparser
from datetime import date, datetime
from pathlib import Path
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s — %(levelname)s — %(message)s"
)
# ---------------------------------------------------------------------------
# Configuration — update when competitor site structure changes
# ---------------------------------------------------------------------------
COMPETITORS = [
{
"name": "TechWidgets",
"base_url": "https://www.techwidgets.net",
"catalog_url": "https://www.techwidgets.net/catalog/widgets/",
"product_selector": "div.product-card",
"name_selector": "h3.product-name",
"price_selector": "span.list-price",
"sku_selector": "span.product-sku",
"category_selector": "span.category",
"crawl_delay": 3, # from robots.txt
},
{
"name": "WidgetWorld",
"base_url": "https://www.widgetworld.com",
"catalog_url": "https://www.widgetworld.com/products/",
"product_selector": "li.product-listing",
"name_selector": "span.item-title",
"price_selector": "div.item-price strong",
"sku_selector": "span.item-sku",
"category_selector": "span.item-category",
"crawl_delay": 2, # from robots.txt
},
]
OUTPUT_CSV = Path("data/competitor_prices.csv")
USER_AGENT = (
"AcmePriceMonitor/1.0 "
"(Internal business intelligence tool; "
"contact priya@acmecorp.com)"
)
HEADERS = {
"User-Agent": USER_AGENT,
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
}
# ---------------------------------------------------------------------------
# robots.txt compliance
# ---------------------------------------------------------------------------
def is_scraping_allowed(catalog_url: str) -> bool:
"""Return True if robots.txt permits scraping the catalog URL."""
parsed = urlparse(catalog_url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
try:
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
allowed = rp.can_fetch(USER_AGENT, catalog_url)
logger.info(
f"robots.txt for {parsed.netloc}: "
f"{'ALLOWED' if allowed else 'DISALLOWED'}"
)
return allowed
except Exception as e:
logger.warning(f"Could not read robots.txt: {e}. Proceeding cautiously.")
return True
# ---------------------------------------------------------------------------
# Fetching
# ---------------------------------------------------------------------------
def fetch_catalog_page(url: str, crawl_delay: float) -> str | None:
"""Fetch a catalog page with delay and error handling."""
time.sleep(crawl_delay + random.uniform(0, 1)) # honor crawl-delay + jitter
try:
response = requests.get(url, headers=HEADERS, timeout=15)
if response.status_code == 200:
return response.text
elif response.status_code == 429:
logger.warning("Rate limited. Waiting 60s...")
time.sleep(60)
return None
else:
logger.error(f"HTTP {response.status_code} for {url}")
return None
except requests.RequestException as e:
logger.error(f"Request failed: {e}")
return None
# ---------------------------------------------------------------------------
# Price parsing
# ---------------------------------------------------------------------------
def parse_price(price_text: str) -> float | None:
"""
Convert a price string like "$89.99" or "£49.99" to a float.
Returns None if parsing fails.
"""
if not price_text:
return None
cleaned = (
price_text.strip()
.replace("$", "").replace("£", "").replace("€", "")
.replace(",", "").replace(" ", "")
)
try:
return float(cleaned)
except ValueError:
return None
# ---------------------------------------------------------------------------
# Data extraction
# ---------------------------------------------------------------------------
def extract_products(html: str, competitor: dict) -> list[dict]:
"""
Extract product data from a catalog page using the competitor's
configured CSS selectors.
Args:
html: Page HTML content.
competitor: Competitor configuration dict from COMPETITORS list.
Returns:
List of product dictionaries.
"""
soup = BeautifulSoup(html, "lxml")
products = []
product_elements = soup.select(competitor["product_selector"])
if not product_elements:
logger.warning(
f"No product elements found using selector "
f"'{competitor['product_selector']}'. "
"The site structure may have changed."
)
return []
logger.info(
f"Found {len(product_elements)} product listings on "
f"{competitor['name']} catalog page."
)
for element in product_elements:
# Safe extraction helpers
def get_text(selector: str) -> str:
tag = element.select_one(selector)
return tag.get_text(strip=True) if tag else ""
name = get_text(competitor["name_selector"])
price_text = get_text(competitor["price_selector"])
sku_text = get_text(competitor["sku_selector"])
category = get_text(competitor["category_selector"])
# Skip empty records (malformed elements)
if not name or not price_text:
logger.debug(f"Skipping empty product element.")
continue
# Clean up SKU — remove "SKU:" prefix if present
sku = sku_text.replace("SKU:", "").replace("SKU", "").strip()
price_numeric = parse_price(price_text)
products.append({
"competitor": competitor["name"],
"sku": sku,
"product_name": name,
"category": category,
"price_display": price_text,
"price_usd": price_numeric,
"scraped_date": date.today().isoformat(),
"scraped_at": datetime.now().isoformat(),
"source_url": competitor["catalog_url"],
})
return products
# ---------------------------------------------------------------------------
# Storage: append to running CSV (tracks price history over time)
# ---------------------------------------------------------------------------
FIELDNAMES = [
"competitor",
"sku",
"product_name",
"category",
"price_display",
"price_usd",
"scraped_date",
"scraped_at",
"source_url",
]
def append_prices_to_csv(products: list[dict]) -> None:
"""Append new price records to the running history CSV."""
OUTPUT_CSV.parent.mkdir(parents=True, exist_ok=True)
file_exists = OUTPUT_CSV.exists()
with open(OUTPUT_CSV, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(
f, fieldnames=FIELDNAMES, extrasaction="ignore"
)
if not file_exists:
writer.writeheader()
logger.info(f"Created new price history file: {OUTPUT_CSV}")
writer.writerows(products)
logger.info(f"Appended {len(products)} price records to {OUTPUT_CSV}")
# ---------------------------------------------------------------------------
# Analysis: detect price changes since last scrape
# ---------------------------------------------------------------------------
def detect_price_changes(
new_products: list[dict],
history_csv: Path,
) -> list[dict]:
"""
Compare today's prices to the most recent historical prices.
Args:
new_products: Prices scraped today.
history_csv: Path to the historical price CSV.
Returns:
List of dicts describing price changes (competitor, sku, old, new, change%).
"""
if not history_csv.exists():
return [] # No history to compare to
# Load historical data
historical = {} # key: (competitor, sku), value: most recent price
with open(history_csv, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
key = (row["competitor"], row["sku"])
# Keep only the most recent record per product
if key not in historical or row["scraped_date"] > historical[key]["scraped_date"]:
historical[key] = row
changes = []
for product in new_products:
key = (product["competitor"], product["sku"])
if key not in historical:
continue # New product, no history to compare
old_price = float(historical[key].get("price_usd") or 0)
new_price = product.get("price_usd") or 0
if old_price > 0 and old_price != new_price:
change_pct = ((new_price - old_price) / old_price) * 100
changes.append({
"competitor": product["competitor"],
"sku": product["sku"],
"product_name": product["product_name"],
"old_price": old_price,
"new_price": new_price,
"change_pct": change_pct,
"direction": "up" if new_price > old_price else "down",
"detected_date": date.today().isoformat(),
})
return changes
# ---------------------------------------------------------------------------
# Main pipeline
# ---------------------------------------------------------------------------
def run_price_monitor() -> None:
"""Run the weekly competitor price monitoring pipeline."""
today = date.today().strftime("%A, %B %d, %Y")
print(f"Acme Corp Competitor Price Monitor")
print(f"Date: {today}")
print()
all_products = []
for competitor in COMPETITORS:
print(f"Scraping {competitor['name']}...")
# robots.txt check
if not is_scraping_allowed(competitor["catalog_url"]):
print(f" SKIPPED: robots.txt disallows access.")
continue
# Fetch catalog
html = fetch_catalog_page(
competitor["catalog_url"],
competitor["crawl_delay"]
)
if not html:
print(f" FAILED: Could not fetch catalog page.")
continue
# Extract products
products = extract_products(html, competitor)
print(f" Extracted {len(products)} products.")
if products:
all_products.extend(products)
# Save to history CSV
if all_products:
# Detect changes before appending new data
changes = detect_price_changes(all_products, OUTPUT_CSV)
append_prices_to_csv(all_products)
print(f"\nTotal products recorded: {len(all_products)}")
print(f"Price history saved to: {OUTPUT_CSV}")
if changes:
print(f"\nPRICE CHANGES DETECTED ({len(changes)}):")
print(f" {'Competitor':<15} {'SKU':<12} {'Old Price':>10} {'New Price':>10} {'Change':>8}")
print(f" {'-'*15} {'-'*12} {'-'*10} {'-'*10} {'-'*8}")
for change in sorted(changes, key=lambda x: abs(x["change_pct"]), reverse=True):
print(
f" {change['competitor']:<15} {change['sku']:<12} "
f"${change['old_price']:>9.2f} ${change['new_price']:>9.2f} "
f"{change['change_pct']:>+7.1f}%"
)
else:
print("\nNo price changes detected since last scrape.")
else:
print("\nNo products were scraped. Check configuration and logs.")
if __name__ == "__main__":
run_price_monitor()
Step 4: Building the Comparison Table
After several weeks of running the scraper, Marcus has enough data to build a price comparison table. He uses a simple pandas script to pivot the data:
import pandas as pd
df = pd.read_csv("data/competitor_prices.csv")
# Get the most recent price for each competitor/product
latest = (
df.sort_values("scraped_date")
.groupby(["competitor", "sku", "product_name"])
.last()
.reset_index()[["competitor", "sku", "product_name", "category", "price_usd"]]
)
# Pivot: rows = products, columns = competitors
comparison = latest.pivot_table(
index=["sku", "product_name", "category"],
columns="competitor",
values="price_usd",
)
# Add Acme's own prices (from internal system)
acme_prices = pd.read_csv("data/acme_prices.csv")
comparison = comparison.join(
acme_prices.set_index("sku")["acme_price_usd"].rename("Acme"),
on="sku"
)
print(comparison.to_string())
comparison.to_csv("output/price_comparison.csv")
Outcome
Three months after deployment, Marcus's team has:
- Weekly price data instead of quarterly snapshots
- Automated price change alerts — the script emails Marcus when any competitor price changes by more than 5%
- A full price history in
competitor_prices.csvshowing when each price changed - A comparison dashboard (simple CSV fed into a chart) showing Acme's position relative to competitors
One finding: TechWidgets dropped their Standard Widget price by 12% in October — a change that went unnoticed in the old quarterly process. The automated scraper caught it the following Monday. Marcus was able to adjust Acme's pricing strategy within the week rather than discovering the change three months later.
Priya's time investment: two afternoons to build and test the scraper. Time saved annually: approximately 26 hours of manual research. More importantly, the data quality improved and the response time to market changes shrank from months to days.
Key Lessons from This Case Study
robots.txt is not optional. Priya excluded ProParts Direct not because their site was hard to scrape but because they explicitly requested no bot access. Respecting this is both ethical and practical — violating it creates legal exposure and reputational risk.
Separate configuration from logic. The COMPETITORS list holds all the site-specific details. When a competitor changes their site structure, Priya updates only the configuration, not the scraping logic. The extraction function is generic.
Append, do not overwrite. Saving price data to an append-only CSV builds a history automatically. This time-series data is far more valuable than any single snapshot.
Price change detection is the real value. The raw prices are less interesting than the changes. The script's most useful feature is not the CSV — it is the alert that something changed.
Test with your own site first. Priya built the scraper using Acme's own public website (which she obviously has permission to scrape) before pointing it at competitors. This confirmed the logic was correct before touching any external site.