Case Study 1: Priya Consolidates the Regional Sales Reports

Characters: Priya Okonkwo (Analyst, Acme Corp), Sandra Chen (VP Sales), Marcus Webb (IT) Concepts: pathlib, glob, csv.DictReader, csv.DictWriter, json.dump, processing metadata


The Situation

Every Monday morning at 8:45 a.m., Priya Okonkwo's inbox fills with the same message from Marcus Webb in IT: "Regional files are in the shared drive." Four CSV files — one for each of Acme Corp's sales regions — land in a network folder called data/regional_reports/. Each file covers the previous week's sales data.

The current process: Priya opens each file in Excel, copies the data, pastes it into a master workbook, saves, and emails it to Sandra Chen by 10:00 a.m. It takes an hour. The pasting introduces errors. Sandra has sent back corrections twice this month alone.

Sandra stopped by Priya's desk on Friday. "I need the consolidated report by nine, not ten. Can you do anything about that?" Sandra's tone was polite. Her deadline was not.

Priya has been working through this Python textbook for three weeks. She opens her laptop on Sunday evening and starts thinking through the problem.


Breaking Down the Problem

Priya sketches out what the script needs to do:

  1. Look in data/regional_reports/ and find all CSV files
  2. Read each one, collecting all rows
  3. Validate that each file has the expected columns (fail loudly if something is wrong, rather than silently producing a bad report)
  4. Write all rows into a single output/q1_consolidated.csv
  5. Write a JSON metadata file recording which files were processed, how many records came from each, and whether any errors occurred
  6. Print a clear summary to the terminal so Priya knows at a glance that everything worked

She also thinks about failure modes — what happens when one region's file is missing? Or malformed? She wants the script to handle a bad file gracefully: log the error, skip that file, and still produce output for the remaining three regions.


The Regional CSV Format

Each regional file has the same structure (Priya confirmed this with Marcus). Here is a sample from north_q1.csv:

rep_id,rep_name,region,product_line,units_sold,unit_price,revenue,quota,month
R001,Daniel Park,North,Enterprise,42,4500.00,189000.00,175000.00,January
R002,Fatima Hassan,North,SMB,88,1200.00,105600.00,100000.00,January
R001,Daniel Park,North,Enterprise,50,4500.00,225000.00,190000.00,February
...

The region field is always consistent within a file, but Priya wants to keep it — Sandra sometimes filters by region in the consolidated report.


The Script

"""
acme_consolidate_regional_sales.py
Priya Okonkwo — Acme Corp Analytics
Consolidates Q1 regional sales CSVs into a single master report.
"""

import csv
import json
from datetime import datetime
from pathlib import Path


# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------

REGIONAL_DIR = Path("data") / "regional_reports"
OUTPUT_DIR   = Path("output") / "consolidated"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

MASTER_CSV    = OUTPUT_DIR / "q1_master_sales.csv"
METADATA_JSON = OUTPUT_DIR / "processing_metadata.json"

# Every regional file must have exactly these columns.
REQUIRED_COLUMNS = {
    "rep_id", "rep_name", "region", "product_line",
    "units_sold", "unit_price", "revenue", "quota", "month",
}

# Output column order (controls the consolidated CSV structure)
OUTPUT_FIELDNAMES = [
    "rep_id", "rep_name", "region", "product_line",
    "units_sold", "unit_price", "revenue", "quota", "month",
]


# ---------------------------------------------------------------------------
# Step 1: Discover regional CSV files
# ---------------------------------------------------------------------------

def find_regional_files(directory: Path) -> list[Path]:
    """
    Return a sorted list of all CSV files in the regional reports directory.
    Raises FileNotFoundError if the directory itself does not exist.
    """
    if not directory.exists():
        raise FileNotFoundError(
            f"Regional reports directory not found: {directory}\n"
            "Check that the network drive is mounted and the path is correct."
        )

    csv_files = sorted(directory.glob("*.csv"))

    if not csv_files:
        raise FileNotFoundError(
            f"No CSV files found in {directory}.\n"
            "The regional files may not have been uploaded yet."
        )

    print(f"[discover] Found {len(csv_files)} CSV files in {directory}:")
    for f in csv_files:
        size_kb = f.stat().st_size / 1024
        print(f"  {f.name:<30}  {size_kb:6.1f} KB")
    print()

    return csv_files


# ---------------------------------------------------------------------------
# Step 2: Validate a single regional file
# ---------------------------------------------------------------------------

def validate_csv_columns(file_path: Path) -> tuple[bool, str]:
    """
    Check that the CSV's header row contains all required columns.

    Returns
    -------
    (True, "")                    if valid
    (False, error_message)        if a column is missing
    """
    with open(file_path, mode="r", newline="", encoding="utf-8") as csv_file:
        reader = csv.DictReader(csv_file)
        actual_columns = set(reader.fieldnames or [])

    missing = REQUIRED_COLUMNS - actual_columns
    if missing:
        return False, f"Missing columns: {', '.join(sorted(missing))}"

    return True, ""


# ---------------------------------------------------------------------------
# Step 3: Read one regional file and return typed records
# ---------------------------------------------------------------------------

def read_regional_csv(file_path: Path) -> list[dict]:
    """
    Read a regional CSV file and return a list of typed record dicts.
    Numeric fields are converted from strings; invalid rows are skipped.
    """
    records = []
    skipped = 0

    with open(file_path, mode="r", newline="", encoding="utf-8") as csv_file:
        reader = csv.DictReader(csv_file)
        for row_number, row in enumerate(reader, start=2):  # start=2: header is row 1
            try:
                records.append({
                    "rep_id":       row["rep_id"].strip(),
                    "rep_name":     row["rep_name"].strip(),
                    "region":       row["region"].strip(),
                    "product_line": row["product_line"].strip(),
                    "units_sold":   int(row["units_sold"]),
                    "unit_price":   float(row["unit_price"]),
                    "revenue":      float(row["revenue"]),
                    "quota":        float(row["quota"]),
                    "month":        row["month"].strip(),
                })
            except (ValueError, KeyError) as parse_error:
                print(
                    f"  [skip] Row {row_number} in {file_path.name}: "
                    f"{parse_error}"
                )
                skipped += 1

    return records, skipped


# ---------------------------------------------------------------------------
# Step 4: Consolidate all regional files
# ---------------------------------------------------------------------------

def consolidate_regional_files(csv_files: list[Path]) -> dict:
    """
    Validate and read all regional CSV files.
    Returns a dict with combined records and per-file processing stats.
    """
    all_records = []
    file_stats = []

    for csv_path in csv_files:
        print(f"[process] {csv_path.name}")

        # Validate columns before reading
        is_valid, error_message = validate_csv_columns(csv_path)
        if not is_valid:
            print(f"  [ERROR] Validation failed — {error_message}")
            print(f"  [skip]  Skipping {csv_path.name} entirely.")
            file_stats.append({
                "filename":     csv_path.name,
                "status":       "error",
                "error":        error_message,
                "records_read": 0,
                "rows_skipped": 0,
            })
            continue

        # Read and type-convert records
        records, skipped_count = read_regional_csv(csv_path)
        all_records.extend(records)

        print(
            f"  [ok]    {len(records)} records read"
            + (f", {skipped_count} rows skipped" if skipped_count > 0 else "")
        )

        file_stats.append({
            "filename":     csv_path.name,
            "status":       "success",
            "error":        "",
            "records_read": len(records),
            "rows_skipped": skipped_count,
        })

    print()
    return {
        "all_records": all_records,
        "file_stats":  file_stats,
    }


# ---------------------------------------------------------------------------
# Step 5: Write the consolidated master CSV
# ---------------------------------------------------------------------------

def write_master_csv(file_path: Path, records: list[dict]) -> None:
    """
    Write all records to the master consolidated CSV.
    Sorted by region, then rep_name, then month for consistent output.
    """
    month_order = {
        "January": 1, "February": 2, "March": 3, "April": 4,
        "May": 5, "June": 6, "July": 7, "August": 8,
        "September": 9, "October": 10, "November": 11, "December": 12,
    }

    sorted_records = sorted(
        records,
        key=lambda r: (r["region"], r["rep_name"], month_order.get(r["month"], 99)),
    )

    with open(file_path, mode="w", newline="", encoding="utf-8") as csv_file:
        writer = csv.DictWriter(
            csv_file,
            fieldnames=OUTPUT_FIELDNAMES,
            extrasaction="ignore",
        )
        writer.writeheader()
        writer.writerows(sorted_records)

    print(f"[write] Master CSV: {len(sorted_records)} records → {file_path}")


# ---------------------------------------------------------------------------
# Step 6: Build and write processing metadata (JSON)
# ---------------------------------------------------------------------------

def build_processing_metadata(
    file_stats: list[dict],
    all_records: list[dict],
) -> dict:
    """
    Assemble a metadata summary of the consolidation run.
    """
    # Compute revenue totals per region from the consolidated data
    revenue_by_region: dict[str, float] = {}
    for record in all_records:
        region = record["region"]
        revenue_by_region[region] = revenue_by_region.get(region, 0.0) + record["revenue"]

    successful_files = [f for f in file_stats if f["status"] == "success"]
    error_files      = [f for f in file_stats if f["status"] == "error"]

    return {
        "report_title":         "Q1 Consolidated Sales Report",
        "generated_at":         datetime.now().isoformat(timespec="seconds"),
        "generated_by":         "acme_consolidate_regional_sales.py",
        "total_source_files":   len(file_stats),
        "successful_files":     len(successful_files),
        "error_files":          len(error_files),
        "total_records":        len(all_records),
        "total_revenue":        round(sum(r["revenue"] for r in all_records), 2),
        "revenue_by_region":    {k: round(v, 2) for k, v in sorted(revenue_by_region.items())},
        "output_file":          str(MASTER_CSV),
        "file_details":         file_stats,
    }


def write_metadata_json(file_path: Path, metadata: dict) -> None:
    """Write the processing metadata dict to a JSON file."""
    with open(file_path, mode="w", encoding="utf-8") as json_file:
        json.dump(metadata, json_file, indent=2)
    print(f"[write] Metadata JSON → {file_path}")


# ---------------------------------------------------------------------------
# Step 7: Print a human-readable summary to the terminal
# ---------------------------------------------------------------------------

def print_run_summary(metadata: dict) -> None:
    """Print a formatted summary of the consolidation run."""
    print()
    print("=" * 60)
    print("  ACME CORP — Q1 SALES CONSOLIDATION COMPLETE")
    print("=" * 60)
    print(f"  Generated   : {metadata['generated_at']}")
    print(f"  Files read  : {metadata['successful_files']} of {metadata['total_source_files']}")

    if metadata["error_files"] > 0:
        print(f"  *** {metadata['error_files']} FILE(S) HAD ERRORS — SEE METADATA JSON ***")

    print(f"  Total records: {metadata['total_records']:,}")
    print(f"  Total revenue: ${metadata['total_revenue']:>12,.2f}")
    print()
    print("  Revenue by Region:")
    for region, revenue in metadata["revenue_by_region"].items():
        bar = "#" * int(revenue / 20000)    # one # per $20,000
        print(f"    {region:<8}  ${revenue:>12,.2f}  {bar}")
    print()
    print(f"  Output: {metadata['output_file']}")
    print("=" * 60)


# ---------------------------------------------------------------------------
# Main entry point
# ---------------------------------------------------------------------------

def main() -> None:
    print("=" * 60)
    print("  Acme Corp — Regional Sales Consolidation Script")
    print(f"  Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 60)
    print()

    # Step 1: Find files
    csv_files = find_regional_files(REGIONAL_DIR)

    # Steps 2–4: Validate and consolidate
    consolidation_result = consolidate_regional_files(csv_files)
    all_records = consolidation_result["all_records"]
    file_stats  = consolidation_result["file_stats"]

    if not all_records:
        print("ERROR: No records were successfully read. Aborting.")
        return

    # Step 5: Write master CSV
    write_master_csv(MASTER_CSV, all_records)

    # Step 6: Write metadata
    metadata = build_processing_metadata(file_stats, all_records)
    write_metadata_json(METADATA_JSON, metadata)

    # Step 7: Print summary
    print_run_summary(metadata)


if __name__ == "__main__":
    main()

What the Script Produces

After a successful run, the output/consolidated/ folder contains two files.

q1_master_sales.csv — all records from all four regional files, sorted by region, then rep name, then month. Priya can send this directly to Sandra without opening Excel.

processing_metadata.json — a machine-readable record of the run:

{
  "report_title": "Q1 Consolidated Sales Report",
  "generated_at": "2024-04-01T08:52:14",
  "generated_by": "acme_consolidate_regional_sales.py",
  "total_source_files": 4,
  "successful_files": 4,
  "error_files": 0,
  "total_records": 72,
  "total_revenue": 2177300.00,
  "revenue_by_region": {
    "East":  591900.00,
    "North": 621900.00,
    "South": 484200.00,
    "West": 479300.00
  },
  "output_file": "output/consolidated/q1_master_sales.csv",
  "file_details": [
    {
      "filename": "east_q1.csv",
      "status": "success",
      "error": "",
      "records_read": 18,
      "rows_skipped": 0
    }
  ]
}

If the South region's file is missing or malformed, the script logs the error in file_details, continues with the other three regions, and records "error_files": 1 in the summary. Priya and Sandra get three regions' worth of data and a clear error message rather than a crash.


The Monday Morning Workflow

Priya sets the script up to run from the command line with one command:

python acme_consolidate_regional_sales.py

Total run time: under two seconds. She sends Sandra the output CSV file path. Sandra views it directly from the shared drive. No pasting. No formatting errors. Priya's Monday mornings now start with coffee instead of spreadsheets.


What Priya Learned

pathlib.glob("*.csv") is the right tool for batch file discovery. Sorting the result with sorted() ensures the order is deterministic — the master CSV comes out in the same order every time, making it easier to spot anomalies between weeks.

Validation before reading catches problems at the source. Checking that every required column exists before loading a file means a truncated or renamed column produces a clear error message, not a confusing KeyError buried in the middle of the output.

JSON metadata is worth the five minutes it takes to write. When Sandra asks "why is the East revenue 3% lower than last week?", Priya can open the JSON file and immediately see exactly how many records came from each file, whether any rows were skipped, and when the script ran.

extrasaction="ignore" in DictWriter makes output resilient. When Marcus occasionally adds an extra column to a regional file for debugging, the consolidation script continues working correctly instead of crashing on the unexpected key.