Case Study 20-1: Acme Corp Competitor Price Monitoring

DataField.Dev

Case Study 20-1: Acme Corp Competitor Price Monitoring

The Business Problem

Marcus Webb, Acme Corp's Sales Director, needs to know where Acme's widget prices stand relative to the competition. Acme sells three product lines: Standard Widgets, Professional Widgets, and Enterprise Widget Bundles. Three competitors — TechWidgets.net, WidgetWorld.com, and ProParts Direct — publish their catalog prices publicly on their websites.

Previously, the pricing intelligence process looked like this: once per quarter, an analyst spent half a day manually visiting each competitor's site, finding the relevant products, noting down the prices, and building a comparison spreadsheet. The problems with this approach:

Infrequent updates — prices can change monthly or more often
Time-intensive — half a day per cycle, consumed by mechanical work
Error-prone — manual transcription introduces mistakes
No history — a snapshot every quarter does not show trends

Marcus's request to Priya: "Can you make this automated? Ideally I want weekly price data so I can see when competitors change prices."

Priya's response: "Yes. But first, some homework."

Step 1: Due Diligence Before Writing a Single Line of Code

Priya's first task is not coding — it is compliance checking.

Competitor 1: TechWidgets.net

Priya navigates to https://www.techwidgets.net/robots.txt:

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Crawl-delay: 3

User-agent: Googlebot
Allow: /

The public product catalog (/products/, /catalog/) is not disallowed. She checks the Terms of Service at /terms. There is a clause about "automated scraping of pricing data" being prohibited for commercial re-publication, but Priya's use is internal business intelligence, not resale of the data. She makes a note and proceeds.

Competitor 2: WidgetWorld.com

robots.txt shows:

User-agent: *
Disallow: /api/
Disallow: /internal/
Allow: /
Crawl-delay: 2

All clear for the public product pages.

Competitor 3: ProParts Direct

robots.txt shows:

User-agent: *
Disallow: /

A blanket Disallow: / means this site explicitly asks all bots not to crawl it. Priya excludes ProParts Direct from the automated system entirely. If Marcus needs ProParts pricing, it will be collected manually.

Step 2: Inspecting the Target Pages

Priya opens TechWidgets.net's product catalog in her browser, right-clicks a product listing, and selects "Inspect Element." She is looking for a consistent HTML structure that Python can navigate reliably.

She observes the following structure (representative; real competitor sites vary):

<div class="catalog-grid">
  <div class="product-card" data-sku="TW-STD-001">
    <div class="product-header">
      <h3 class="product-name">Standard Widget Pro 500</h3>
      <span class="product-sku">SKU: TW-STD-001</span>
    </div>
    <div class="pricing-block">
      <span class="list-price">$89.99</span>
      <span class="unit-label">/ unit</span>
      <div class="bulk-pricing">
        <span>10+ units: $84.99</span>
        <span>50+ units: $79.99</span>
      </div>
    </div>
    <div class="product-meta">
      <span class="category">Standard Widgets</span>
      <span class="in-stock">In Stock</span>
    </div>
  </div>
  <!-- more product cards... -->
</div>

Good news: the structure is consistent across all products, with identifiable class names.

Step 3: The Scraping Script

"""
competitor_price_monitor.py
===========================
Acme Corp — Competitor Price Intelligence
Weekly scraper for public competitor product catalogs.

Respects robots.txt and ToS. Internal use only.
Data is saved to a running CSV file for trend analysis.

Author: Priya Okonkwo, Acme Corp Operations
"""

import csv
import logging
import random
import time
import urllib.robotparser
from datetime import date, datetime
from pathlib import Path
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s — %(levelname)s — %(message)s"
)

# ---------------------------------------------------------------------------
# Configuration — update when competitor site structure changes
# ---------------------------------------------------------------------------

COMPETITORS = [
    {
        "name": "TechWidgets",
        "base_url": "https://www.techwidgets.net",
        "catalog_url": "https://www.techwidgets.net/catalog/widgets/",
        "product_selector": "div.product-card",
        "name_selector": "h3.product-name",
        "price_selector": "span.list-price",
        "sku_selector": "span.product-sku",
        "category_selector": "span.category",
        "crawl_delay": 3,  # from robots.txt
    },
    {
        "name": "WidgetWorld",
        "base_url": "https://www.widgetworld.com",
        "catalog_url": "https://www.widgetworld.com/products/",
        "product_selector": "li.product-listing",
        "name_selector": "span.item-title",
        "price_selector": "div.item-price strong",
        "sku_selector": "span.item-sku",
        "category_selector": "span.item-category",
        "crawl_delay": 2,  # from robots.txt
    },
]

OUTPUT_CSV = Path("data/competitor_prices.csv")
USER_AGENT = (
    "AcmePriceMonitor/1.0 "
    "(Internal business intelligence tool; "
    "contact priya@acmecorp.com)"
)

HEADERS = {
    "User-Agent": USER_AGENT,
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9",
}


# ---------------------------------------------------------------------------
# robots.txt compliance
# ---------------------------------------------------------------------------

def is_scraping_allowed(catalog_url: str) -> bool:
    """Return True if robots.txt permits scraping the catalog URL."""
    parsed = urlparse(catalog_url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    try:
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(robots_url)
        rp.read()
        allowed = rp.can_fetch(USER_AGENT, catalog_url)
        logger.info(
            f"robots.txt for {parsed.netloc}: "
            f"{'ALLOWED' if allowed else 'DISALLOWED'}"
        )
        return allowed
    except Exception as e:
        logger.warning(f"Could not read robots.txt: {e}. Proceeding cautiously.")
        return True


# ---------------------------------------------------------------------------
# Fetching
# ---------------------------------------------------------------------------

def fetch_catalog_page(url: str, crawl_delay: float) -> str | None:
    """Fetch a catalog page with delay and error handling."""
    time.sleep(crawl_delay + random.uniform(0, 1))  # honor crawl-delay + jitter
    try:
        response = requests.get(url, headers=HEADERS, timeout=15)
        if response.status_code == 200:
            return response.text
        elif response.status_code == 429:
            logger.warning("Rate limited. Waiting 60s...")
            time.sleep(60)
            return None
        else:
            logger.error(f"HTTP {response.status_code} for {url}")
            return None
    except requests.RequestException as e:
        logger.error(f"Request failed: {e}")
        return None


# ---------------------------------------------------------------------------
# Price parsing
# ---------------------------------------------------------------------------

def parse_price(price_text: str) -> float | None:
    """
    Convert a price string like "$89.99" or "£49.99" to a float.
    Returns None if parsing fails.
    """
    if not price_text:
        return None
    cleaned = (
        price_text.strip()
        .replace("$", "").replace("£", "").replace("€", "")
        .replace(",", "").replace(" ", "")
    )
    try:
        return float(cleaned)
    except ValueError:
        return None


# ---------------------------------------------------------------------------
# Data extraction
# ---------------------------------------------------------------------------

def extract_products(html: str, competitor: dict) -> list[dict]:
    """
    Extract product data from a catalog page using the competitor's
    configured CSS selectors.

    Args:
        html: Page HTML content.
        competitor: Competitor configuration dict from COMPETITORS list.

    Returns:
        List of product dictionaries.
    """
    soup = BeautifulSoup(html, "lxml")
    products = []

    product_elements = soup.select(competitor["product_selector"])
    if not product_elements:
        logger.warning(
            f"No product elements found using selector "
            f"'{competitor['product_selector']}'. "
            "The site structure may have changed."
        )
        return []

    logger.info(
        f"Found {len(product_elements)} product listings on "
        f"{competitor['name']} catalog page."
    )

    for element in product_elements:
        # Safe extraction helpers
        def get_text(selector: str) -> str:
            tag = element.select_one(selector)
            return tag.get_text(strip=True) if tag else ""

        name = get_text(competitor["name_selector"])
        price_text = get_text(competitor["price_selector"])
        sku_text = get_text(competitor["sku_selector"])
        category = get_text(competitor["category_selector"])

        # Skip empty records (malformed elements)
        if not name or not price_text:
            logger.debug(f"Skipping empty product element.")
            continue

        # Clean up SKU — remove "SKU:" prefix if present
        sku = sku_text.replace("SKU:", "").replace("SKU", "").strip()

        price_numeric = parse_price(price_text)

        products.append({
            "competitor": competitor["name"],
            "sku": sku,
            "product_name": name,
            "category": category,
            "price_display": price_text,
            "price_usd": price_numeric,
            "scraped_date": date.today().isoformat(),
            "scraped_at": datetime.now().isoformat(),
            "source_url": competitor["catalog_url"],
        })

    return products


# ---------------------------------------------------------------------------
# Storage: append to running CSV (tracks price history over time)
# ---------------------------------------------------------------------------

FIELDNAMES = [
    "competitor",
    "sku",
    "product_name",
    "category",
    "price_display",
    "price_usd",
    "scraped_date",
    "scraped_at",
    "source_url",
]


def append_prices_to_csv(products: list[dict]) -> None:
    """Append new price records to the running history CSV."""
    OUTPUT_CSV.parent.mkdir(parents=True, exist_ok=True)
    file_exists = OUTPUT_CSV.exists()

    with open(OUTPUT_CSV, "a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(
            f, fieldnames=FIELDNAMES, extrasaction="ignore"
        )
        if not file_exists:
            writer.writeheader()
            logger.info(f"Created new price history file: {OUTPUT_CSV}")
        writer.writerows(products)

    logger.info(f"Appended {len(products)} price records to {OUTPUT_CSV}")


# ---------------------------------------------------------------------------
# Analysis: detect price changes since last scrape
# ---------------------------------------------------------------------------

def detect_price_changes(
    new_products: list[dict],
    history_csv: Path,
) -> list[dict]:
    """
    Compare today's prices to the most recent historical prices.

    Args:
        new_products: Prices scraped today.
        history_csv: Path to the historical price CSV.

    Returns:
        List of dicts describing price changes (competitor, sku, old, new, change%).
    """
    if not history_csv.exists():
        return []  # No history to compare to

    # Load historical data
    historical = {}  # key: (competitor, sku), value: most recent price

    with open(history_csv, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            key = (row["competitor"], row["sku"])
            # Keep only the most recent record per product
            if key not in historical or row["scraped_date"] > historical[key]["scraped_date"]:
                historical[key] = row

    changes = []
    for product in new_products:
        key = (product["competitor"], product["sku"])
        if key not in historical:
            continue  # New product, no history to compare

        old_price = float(historical[key].get("price_usd") or 0)
        new_price = product.get("price_usd") or 0

        if old_price > 0 and old_price != new_price:
            change_pct = ((new_price - old_price) / old_price) * 100
            changes.append({
                "competitor": product["competitor"],
                "sku": product["sku"],
                "product_name": product["product_name"],
                "old_price": old_price,
                "new_price": new_price,
                "change_pct": change_pct,
                "direction": "up" if new_price > old_price else "down",
                "detected_date": date.today().isoformat(),
            })

    return changes


# ---------------------------------------------------------------------------
# Main pipeline
# ---------------------------------------------------------------------------

def run_price_monitor() -> None:
    """Run the weekly competitor price monitoring pipeline."""
    today = date.today().strftime("%A, %B %d, %Y")
    print(f"Acme Corp Competitor Price Monitor")
    print(f"Date: {today}")
    print()

    all_products = []

    for competitor in COMPETITORS:
        print(f"Scraping {competitor['name']}...")

        # robots.txt check
        if not is_scraping_allowed(competitor["catalog_url"]):
            print(f"  SKIPPED: robots.txt disallows access.")
            continue

        # Fetch catalog
        html = fetch_catalog_page(
            competitor["catalog_url"],
            competitor["crawl_delay"]
        )
        if not html:
            print(f"  FAILED: Could not fetch catalog page.")
            continue

        # Extract products
        products = extract_products(html, competitor)
        print(f"  Extracted {len(products)} products.")

        if products:
            all_products.extend(products)

    # Save to history CSV
    if all_products:
        # Detect changes before appending new data
        changes = detect_price_changes(all_products, OUTPUT_CSV)
        append_prices_to_csv(all_products)

        print(f"\nTotal products recorded: {len(all_products)}")
        print(f"Price history saved to:  {OUTPUT_CSV}")

        if changes:
            print(f"\nPRICE CHANGES DETECTED ({len(changes)}):")
            print(f"  {'Competitor':<15} {'SKU':<12} {'Old Price':>10} {'New Price':>10} {'Change':>8}")
            print(f"  {'-'*15} {'-'*12} {'-'*10} {'-'*10} {'-'*8}")
            for change in sorted(changes, key=lambda x: abs(x["change_pct"]), reverse=True):
                print(
                    f"  {change['competitor']:<15} {change['sku']:<12} "
                    f"${change['old_price']:>9.2f} ${change['new_price']:>9.2f} "
                    f"{change['change_pct']:>+7.1f}%"
                )
        else:
            print("\nNo price changes detected since last scrape.")
    else:
        print("\nNo products were scraped. Check configuration and logs.")


if __name__ == "__main__":
    run_price_monitor()

Step 4: Building the Comparison Table

After several weeks of running the scraper, Marcus has enough data to build a price comparison table. He uses a simple pandas script to pivot the data:

import pandas as pd

df = pd.read_csv("data/competitor_prices.csv")

# Get the most recent price for each competitor/product
latest = (
    df.sort_values("scraped_date")
    .groupby(["competitor", "sku", "product_name"])
    .last()
    .reset_index()[["competitor", "sku", "product_name", "category", "price_usd"]]
)

# Pivot: rows = products, columns = competitors
comparison = latest.pivot_table(
    index=["sku", "product_name", "category"],
    columns="competitor",
    values="price_usd",
)

# Add Acme's own prices (from internal system)
acme_prices = pd.read_csv("data/acme_prices.csv")
comparison = comparison.join(
    acme_prices.set_index("sku")["acme_price_usd"].rename("Acme"),
    on="sku"
)

print(comparison.to_string())
comparison.to_csv("output/price_comparison.csv")

Outcome

Three months after deployment, Marcus's team has:

Weekly price data instead of quarterly snapshots
Automated price change alerts — the script emails Marcus when any competitor price changes by more than 5%
A full price history in competitor_prices.csv showing when each price changed
A comparison dashboard (simple CSV fed into a chart) showing Acme's position relative to competitors

One finding: TechWidgets dropped their Standard Widget price by 12% in October — a change that went unnoticed in the old quarterly process. The automated scraper caught it the following Monday. Marcus was able to adjust Acme's pricing strategy within the week rather than discovering the change three months later.

Priya's time investment: two afternoons to build and test the scraper. Time saved annually: approximately 26 hours of manual research. More importantly, the data quality improved and the response time to market changes shrank from months to days.

Key Lessons from This Case Study

robots.txt is not optional. Priya excluded ProParts Direct not because their site was hard to scrape but because they explicitly requested no bot access. Respecting this is both ethical and practical — violating it creates legal exposure and reputational risk.

Separate configuration from logic. The COMPETITORS list holds all the site-specific details. When a competitor changes their site structure, Priya updates only the configuration, not the scraping logic. The extraction function is generic.

Append, do not overwrite. Saving price data to an append-only CSV builds a history automatically. This time-series data is far more valuable than any single snapshot.

Price change detection is the real value. The raw prices are less interesting than the changes. The script's most useful feature is not the CSV — it is the alert that something changed.

Test with your own site first. Priya built the scraper using Acme's own public website (which she obviously has permission to scrape) before pointing it at competitors. This confirmed the logic was correct before touching any external site.