Case Study 1: Priya Consolidates the Regional Sales Reports
Characters: Priya Okonkwo (Analyst, Acme Corp), Sandra Chen (VP Sales), Marcus Webb (IT) Concepts: pathlib, glob, csv.DictReader, csv.DictWriter, json.dump, processing metadata
The Situation
Every Monday morning at 8:45 a.m., Priya Okonkwo's inbox fills with the same message from Marcus Webb in IT: "Regional files are in the shared drive." Four CSV files — one for each of Acme Corp's sales regions — land in a network folder called data/regional_reports/. Each file covers the previous week's sales data.
The current process: Priya opens each file in Excel, copies the data, pastes it into a master workbook, saves, and emails it to Sandra Chen by 10:00 a.m. It takes an hour. The pasting introduces errors. Sandra has sent back corrections twice this month alone.
Sandra stopped by Priya's desk on Friday. "I need the consolidated report by nine, not ten. Can you do anything about that?" Sandra's tone was polite. Her deadline was not.
Priya has been working through this Python textbook for three weeks. She opens her laptop on Sunday evening and starts thinking through the problem.
Breaking Down the Problem
Priya sketches out what the script needs to do:
- Look in
data/regional_reports/and find all CSV files - Read each one, collecting all rows
- Validate that each file has the expected columns (fail loudly if something is wrong, rather than silently producing a bad report)
- Write all rows into a single
output/q1_consolidated.csv - Write a JSON metadata file recording which files were processed, how many records came from each, and whether any errors occurred
- Print a clear summary to the terminal so Priya knows at a glance that everything worked
She also thinks about failure modes — what happens when one region's file is missing? Or malformed? She wants the script to handle a bad file gracefully: log the error, skip that file, and still produce output for the remaining three regions.
The Regional CSV Format
Each regional file has the same structure (Priya confirmed this with Marcus). Here is a sample from north_q1.csv:
rep_id,rep_name,region,product_line,units_sold,unit_price,revenue,quota,month
R001,Daniel Park,North,Enterprise,42,4500.00,189000.00,175000.00,January
R002,Fatima Hassan,North,SMB,88,1200.00,105600.00,100000.00,January
R001,Daniel Park,North,Enterprise,50,4500.00,225000.00,190000.00,February
...
The region field is always consistent within a file, but Priya wants to keep it — Sandra sometimes filters by region in the consolidated report.
The Script
"""
acme_consolidate_regional_sales.py
Priya Okonkwo — Acme Corp Analytics
Consolidates Q1 regional sales CSVs into a single master report.
"""
import csv
import json
from datetime import datetime
from pathlib import Path
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
REGIONAL_DIR = Path("data") / "regional_reports"
OUTPUT_DIR = Path("output") / "consolidated"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
MASTER_CSV = OUTPUT_DIR / "q1_master_sales.csv"
METADATA_JSON = OUTPUT_DIR / "processing_metadata.json"
# Every regional file must have exactly these columns.
REQUIRED_COLUMNS = {
"rep_id", "rep_name", "region", "product_line",
"units_sold", "unit_price", "revenue", "quota", "month",
}
# Output column order (controls the consolidated CSV structure)
OUTPUT_FIELDNAMES = [
"rep_id", "rep_name", "region", "product_line",
"units_sold", "unit_price", "revenue", "quota", "month",
]
# ---------------------------------------------------------------------------
# Step 1: Discover regional CSV files
# ---------------------------------------------------------------------------
def find_regional_files(directory: Path) -> list[Path]:
"""
Return a sorted list of all CSV files in the regional reports directory.
Raises FileNotFoundError if the directory itself does not exist.
"""
if not directory.exists():
raise FileNotFoundError(
f"Regional reports directory not found: {directory}\n"
"Check that the network drive is mounted and the path is correct."
)
csv_files = sorted(directory.glob("*.csv"))
if not csv_files:
raise FileNotFoundError(
f"No CSV files found in {directory}.\n"
"The regional files may not have been uploaded yet."
)
print(f"[discover] Found {len(csv_files)} CSV files in {directory}:")
for f in csv_files:
size_kb = f.stat().st_size / 1024
print(f" {f.name:<30} {size_kb:6.1f} KB")
print()
return csv_files
# ---------------------------------------------------------------------------
# Step 2: Validate a single regional file
# ---------------------------------------------------------------------------
def validate_csv_columns(file_path: Path) -> tuple[bool, str]:
"""
Check that the CSV's header row contains all required columns.
Returns
-------
(True, "") if valid
(False, error_message) if a column is missing
"""
with open(file_path, mode="r", newline="", encoding="utf-8") as csv_file:
reader = csv.DictReader(csv_file)
actual_columns = set(reader.fieldnames or [])
missing = REQUIRED_COLUMNS - actual_columns
if missing:
return False, f"Missing columns: {', '.join(sorted(missing))}"
return True, ""
# ---------------------------------------------------------------------------
# Step 3: Read one regional file and return typed records
# ---------------------------------------------------------------------------
def read_regional_csv(file_path: Path) -> list[dict]:
"""
Read a regional CSV file and return a list of typed record dicts.
Numeric fields are converted from strings; invalid rows are skipped.
"""
records = []
skipped = 0
with open(file_path, mode="r", newline="", encoding="utf-8") as csv_file:
reader = csv.DictReader(csv_file)
for row_number, row in enumerate(reader, start=2): # start=2: header is row 1
try:
records.append({
"rep_id": row["rep_id"].strip(),
"rep_name": row["rep_name"].strip(),
"region": row["region"].strip(),
"product_line": row["product_line"].strip(),
"units_sold": int(row["units_sold"]),
"unit_price": float(row["unit_price"]),
"revenue": float(row["revenue"]),
"quota": float(row["quota"]),
"month": row["month"].strip(),
})
except (ValueError, KeyError) as parse_error:
print(
f" [skip] Row {row_number} in {file_path.name}: "
f"{parse_error}"
)
skipped += 1
return records, skipped
# ---------------------------------------------------------------------------
# Step 4: Consolidate all regional files
# ---------------------------------------------------------------------------
def consolidate_regional_files(csv_files: list[Path]) -> dict:
"""
Validate and read all regional CSV files.
Returns a dict with combined records and per-file processing stats.
"""
all_records = []
file_stats = []
for csv_path in csv_files:
print(f"[process] {csv_path.name}")
# Validate columns before reading
is_valid, error_message = validate_csv_columns(csv_path)
if not is_valid:
print(f" [ERROR] Validation failed — {error_message}")
print(f" [skip] Skipping {csv_path.name} entirely.")
file_stats.append({
"filename": csv_path.name,
"status": "error",
"error": error_message,
"records_read": 0,
"rows_skipped": 0,
})
continue
# Read and type-convert records
records, skipped_count = read_regional_csv(csv_path)
all_records.extend(records)
print(
f" [ok] {len(records)} records read"
+ (f", {skipped_count} rows skipped" if skipped_count > 0 else "")
)
file_stats.append({
"filename": csv_path.name,
"status": "success",
"error": "",
"records_read": len(records),
"rows_skipped": skipped_count,
})
print()
return {
"all_records": all_records,
"file_stats": file_stats,
}
# ---------------------------------------------------------------------------
# Step 5: Write the consolidated master CSV
# ---------------------------------------------------------------------------
def write_master_csv(file_path: Path, records: list[dict]) -> None:
"""
Write all records to the master consolidated CSV.
Sorted by region, then rep_name, then month for consistent output.
"""
month_order = {
"January": 1, "February": 2, "March": 3, "April": 4,
"May": 5, "June": 6, "July": 7, "August": 8,
"September": 9, "October": 10, "November": 11, "December": 12,
}
sorted_records = sorted(
records,
key=lambda r: (r["region"], r["rep_name"], month_order.get(r["month"], 99)),
)
with open(file_path, mode="w", newline="", encoding="utf-8") as csv_file:
writer = csv.DictWriter(
csv_file,
fieldnames=OUTPUT_FIELDNAMES,
extrasaction="ignore",
)
writer.writeheader()
writer.writerows(sorted_records)
print(f"[write] Master CSV: {len(sorted_records)} records → {file_path}")
# ---------------------------------------------------------------------------
# Step 6: Build and write processing metadata (JSON)
# ---------------------------------------------------------------------------
def build_processing_metadata(
file_stats: list[dict],
all_records: list[dict],
) -> dict:
"""
Assemble a metadata summary of the consolidation run.
"""
# Compute revenue totals per region from the consolidated data
revenue_by_region: dict[str, float] = {}
for record in all_records:
region = record["region"]
revenue_by_region[region] = revenue_by_region.get(region, 0.0) + record["revenue"]
successful_files = [f for f in file_stats if f["status"] == "success"]
error_files = [f for f in file_stats if f["status"] == "error"]
return {
"report_title": "Q1 Consolidated Sales Report",
"generated_at": datetime.now().isoformat(timespec="seconds"),
"generated_by": "acme_consolidate_regional_sales.py",
"total_source_files": len(file_stats),
"successful_files": len(successful_files),
"error_files": len(error_files),
"total_records": len(all_records),
"total_revenue": round(sum(r["revenue"] for r in all_records), 2),
"revenue_by_region": {k: round(v, 2) for k, v in sorted(revenue_by_region.items())},
"output_file": str(MASTER_CSV),
"file_details": file_stats,
}
def write_metadata_json(file_path: Path, metadata: dict) -> None:
"""Write the processing metadata dict to a JSON file."""
with open(file_path, mode="w", encoding="utf-8") as json_file:
json.dump(metadata, json_file, indent=2)
print(f"[write] Metadata JSON → {file_path}")
# ---------------------------------------------------------------------------
# Step 7: Print a human-readable summary to the terminal
# ---------------------------------------------------------------------------
def print_run_summary(metadata: dict) -> None:
"""Print a formatted summary of the consolidation run."""
print()
print("=" * 60)
print(" ACME CORP — Q1 SALES CONSOLIDATION COMPLETE")
print("=" * 60)
print(f" Generated : {metadata['generated_at']}")
print(f" Files read : {metadata['successful_files']} of {metadata['total_source_files']}")
if metadata["error_files"] > 0:
print(f" *** {metadata['error_files']} FILE(S) HAD ERRORS — SEE METADATA JSON ***")
print(f" Total records: {metadata['total_records']:,}")
print(f" Total revenue: ${metadata['total_revenue']:>12,.2f}")
print()
print(" Revenue by Region:")
for region, revenue in metadata["revenue_by_region"].items():
bar = "#" * int(revenue / 20000) # one # per $20,000
print(f" {region:<8} ${revenue:>12,.2f} {bar}")
print()
print(f" Output: {metadata['output_file']}")
print("=" * 60)
# ---------------------------------------------------------------------------
# Main entry point
# ---------------------------------------------------------------------------
def main() -> None:
print("=" * 60)
print(" Acme Corp — Regional Sales Consolidation Script")
print(f" Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 60)
print()
# Step 1: Find files
csv_files = find_regional_files(REGIONAL_DIR)
# Steps 2–4: Validate and consolidate
consolidation_result = consolidate_regional_files(csv_files)
all_records = consolidation_result["all_records"]
file_stats = consolidation_result["file_stats"]
if not all_records:
print("ERROR: No records were successfully read. Aborting.")
return
# Step 5: Write master CSV
write_master_csv(MASTER_CSV, all_records)
# Step 6: Write metadata
metadata = build_processing_metadata(file_stats, all_records)
write_metadata_json(METADATA_JSON, metadata)
# Step 7: Print summary
print_run_summary(metadata)
if __name__ == "__main__":
main()
What the Script Produces
After a successful run, the output/consolidated/ folder contains two files.
q1_master_sales.csv — all records from all four regional files, sorted by region, then rep name, then month. Priya can send this directly to Sandra without opening Excel.
processing_metadata.json — a machine-readable record of the run:
{
"report_title": "Q1 Consolidated Sales Report",
"generated_at": "2024-04-01T08:52:14",
"generated_by": "acme_consolidate_regional_sales.py",
"total_source_files": 4,
"successful_files": 4,
"error_files": 0,
"total_records": 72,
"total_revenue": 2177300.00,
"revenue_by_region": {
"East": 591900.00,
"North": 621900.00,
"South": 484200.00,
"West": 479300.00
},
"output_file": "output/consolidated/q1_master_sales.csv",
"file_details": [
{
"filename": "east_q1.csv",
"status": "success",
"error": "",
"records_read": 18,
"rows_skipped": 0
}
]
}
If the South region's file is missing or malformed, the script logs the error in file_details, continues with the other three regions, and records "error_files": 1 in the summary. Priya and Sandra get three regions' worth of data and a clear error message rather than a crash.
The Monday Morning Workflow
Priya sets the script up to run from the command line with one command:
python acme_consolidate_regional_sales.py
Total run time: under two seconds. She sends Sandra the output CSV file path. Sandra views it directly from the shared drive. No pasting. No formatting errors. Priya's Monday mornings now start with coffee instead of spreadsheets.
What Priya Learned
pathlib.glob("*.csv") is the right tool for batch file discovery. Sorting the result with sorted() ensures the order is deterministic — the master CSV comes out in the same order every time, making it easier to spot anomalies between weeks.
Validation before reading catches problems at the source. Checking that every required column exists before loading a file means a truncated or renamed column produces a clear error message, not a confusing KeyError buried in the middle of the output.
JSON metadata is worth the five minutes it takes to write. When Sandra asks "why is the East revenue 3% lower than last week?", Priya can open the JSON file and immediately see exactly how many records came from each file, whether any rows were skipped, and when the script ran.
extrasaction="ignore" in DictWriter makes output resilient. When Marcus occasionally adds an extra column to a regional file for debugging, the consolidation script continues working correctly instead of crashing on the unexpected key.