Case Study 1: Priya Reconciles Invoices and Generates Branded Reports

Part A: The Invoice Reconciliation Problem

Sandra drops 47 PDF invoices from Q4 vendors on the shared drive and tells Priya she needs the totals reconciled against the PO system by end of day. It is 10:15 AM.

At Priya's current manual pace — open PDF, find total, type into spreadsheet, close, repeat — this will take until at least 2 PM and will almost certainly contain errors. Two hours of Priya's time, for a task that is almost entirely mechanical.

She decides to spend 30 minutes building a script and reclaim the afternoon.

Analyzing the Invoice Population

Priya opens a random sample of ten invoices before writing any code. She is doing reconnaissance.

She notices: - Eight of the ten use the phrase "Total Due" on the final line - One uses "Amount Due" - One uses "Invoice Total" - All ten are text-based PDFs (she can select text in her PDF viewer) - Invoice numbers are always in a consistent location near the top - Dates vary — some show MM/DD/YYYY, some show "Month DD, YYYY"

This is a good situation. Consistent enough for automation, with predictable variation.

The Script

"""
reconcile_invoices.py

Extracts invoice totals from vendor PDF files and exports a reconciliation
spreadsheet matching invoices to purchase orders.

Usage:
    python reconcile_invoices.py /data/invoices/q4_vendor_invoices/
    python reconcile_invoices.py /data/invoices/q4_vendor_invoices/ --output reconciliation.xlsx
"""

import csv
import re
import sys
from dataclasses import dataclass, field
from pathlib import Path

import pypdf


@dataclass
class InvoiceRecord:
    filename: str
    invoice_number: str = ""
    invoice_date: str = ""
    vendor_hint: str = ""
    total_amount: float | None = None
    confidence: str = "low"
    notes: list[str] = field(default_factory=list)


def extract_text_from_pdf(pdf_path: Path) -> str:
    """Extract all text from a PDF file."""
    parts = []
    with open(pdf_path, "rb") as f:
        reader = pypdf.PdfReader(f)
        if reader.is_encrypted:
            return ""
        for page in reader.pages:
            text = page.extract_text()
            if text:
                parts.append(text)
    return "\n".join(parts)


def parse_invoice(pdf_path: Path) -> InvoiceRecord:
    """Extract key fields from a vendor invoice PDF."""
    record = InvoiceRecord(filename=pdf_path.name)

    text = extract_text_from_pdf(pdf_path)

    if not text.strip():
        record.notes.append("No text layer — possible scanned image")
        return record

    # ── TOTAL AMOUNT ──────────────────────────────────────────────────────────
    total_patterns = [
        # Most specific first — "Total Due" is the most reliable signal
        (r"(?:total\s+due|amount\s+due|balance\s+due)\s*[:\$]?\s*([\d,]+\.?\d*)", "total due"),
        (r"(?:invoice\s+total|grand\s+total)\s*[:\$]?\s*([\d,]+\.?\d*)", "invoice total"),
        (r"\btotal\b\s*[:\$]?\s*\$?([\d,]+\.\d{2})\b", "total (with cents)"),
    ]

    for pattern, label in total_patterns:
        m = re.search(pattern, text, re.IGNORECASE)
        if m:
            try:
                record.total_amount = float(m.group(1).replace(",", ""))
                record.notes.append(f"Total found via '{label}' pattern")
                break
            except ValueError:
                continue

    if record.total_amount is None:
        record.notes.append("Total not found — manual review needed")

    # ── INVOICE NUMBER ────────────────────────────────────────────────────────
    for pattern in [
        r"invoice\s+(?:no\.?|number|#)\s*[:\s]?\s*([A-Z0-9\-]{4,20})",
        r"inv[.\s#]*([A-Z0-9\-]{6,20})",
    ]:
        m = re.search(pattern, text, re.IGNORECASE)
        if m:
            record.invoice_number = m.group(1).strip()
            break

    # ── DATE ──────────────────────────────────────────────────────────────────
    for pattern in [
        r"(?:invoice\s+date|date\s+issued)\s*[:\s]+(\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4})",
        r"(?:invoice\s+date|date)\s*[:\s]+(\w+\s+\d{1,2},\s+\d{4})",
    ]:
        m = re.search(pattern, text, re.IGNORECASE)
        if m:
            record.invoice_date = m.group(1).strip()
            break

    # ── VENDOR HINT (first line of text, often the vendor name) ──────────────
    first_lines = [line.strip() for line in text.split("\n") if line.strip()]
    if first_lines:
        # The vendor name is often the first substantial text in the document
        record.vendor_hint = first_lines[0][:50]  # Cap at 50 chars

    # ── CONFIDENCE ────────────────────────────────────────────────────────────
    found = sum([
        bool(record.total_amount),
        bool(record.invoice_number),
        bool(record.invoice_date),
    ])
    if found == 3:
        record.confidence = "high"
    elif found >= 1:
        record.confidence = "medium"
    else:
        record.confidence = "low"

    return record


def process_invoice_folder(invoices_dir: Path) -> list[InvoiceRecord]:
    """Process all PDFs in a folder and return extraction results."""
    pdf_files = sorted(invoices_dir.glob("*.pdf"))

    if not pdf_files:
        print(f"No PDF files found in: {invoices_dir}")
        return []

    print(f"Processing {len(pdf_files)} invoice PDFs...\n")

    results = []
    for pdf_file in pdf_files:
        record = parse_invoice(pdf_file)
        total_str = f"${record.total_amount:,.2f}" if record.total_amount else "NOT FOUND"
        print(
            f"  {pdf_file.name:<45} "
            f"{total_str:<16} "
            f"[{record.confidence}]"
        )
        results.append(record)

    return results


def export_to_csv(records: list[InvoiceRecord], output_path: Path) -> None:
    """
    Export extraction results to a CSV for review and reconciliation.

    High-confidence records go in the main CSV.
    Low/medium confidence go in a separate '_review_needed' CSV.
    """
    confirmed = [r for r in records if r.confidence == "high"]
    review = [r for r in records if r.confidence != "high"]

    fieldnames = [
        "filename", "invoice_number", "invoice_date",
        "vendor_hint", "total_amount", "confidence", "notes"
    ]

    # Main output
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for record in confirmed:
            writer.writerow({
                "filename": record.filename,
                "invoice_number": record.invoice_number,
                "invoice_date": record.invoice_date,
                "vendor_hint": record.vendor_hint,
                "total_amount": record.total_amount,
                "confidence": record.confidence,
                "notes": "; ".join(record.notes),
            })

    # Review queue
    if review:
        review_path = output_path.parent / (output_path.stem + "_review_needed.csv")
        with open(review_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            for record in review:
                writer.writerow({
                    "filename": record.filename,
                    "invoice_number": record.invoice_number,
                    "invoice_date": record.invoice_date,
                    "vendor_hint": record.vendor_hint,
                    "total_amount": record.total_amount,
                    "confidence": record.confidence,
                    "notes": "; ".join(record.notes),
                })
        print(f"\nReview queue: {review_path}")

    print(f"Confirmed results: {output_path}")


def print_summary(records: list[InvoiceRecord]) -> None:
    """Print a processing summary."""
    high = [r for r in records if r.confidence == "high"]
    medium = [r for r in records if r.confidence == "medium"]
    low = [r for r in records if r.confidence == "low"]

    total_confirmed = sum(r.total_amount or 0 for r in high)

    print("\n" + "=" * 60)
    print("EXTRACTION SUMMARY")
    print("=" * 60)
    print(f"  Total invoices:          {len(records)}")
    print(f"  High confidence:         {len(high)}")
    print(f"  Medium confidence:       {len(medium)}")
    print(f"  Low confidence (manual): {len(low)}")
    print(f"  Sum (high confidence):   ${total_confirmed:,.2f}")
    print("=" * 60)
    if low or medium:
        print(f"\n  {len(low) + len(medium)} invoices require manual review.")


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description="Extract and reconcile vendor invoice PDFs")
    parser.add_argument("invoices_dir", type=Path)
    parser.add_argument("--output", type=Path, default=Path("invoice_reconciliation.csv"))
    args = parser.parse_args()

    records = process_invoice_folder(args.invoices_dir.resolve())
    if records:
        print_summary(records)
        export_to_csv(records, args.output.resolve())

Results

Priya runs the script at 10:45 AM. It processes all 47 PDFs in 12 seconds.

Processing 47 invoice PDFs...

  acme_supplies_inv_2024_1042.pdf        $12,450.00       [high]
  allied_packaging_Q42024.pdf            $8,234.50        [high]
  benson_tech_invoice_dec.pdf            $4,180.00        [high]
  central_logistics_12_2024.pdf          NOT FOUND        [low]
  coastal_cleaning_Q4.pdf                $2,890.25        [medium]
  ...

EXTRACTION SUMMARY
  Total invoices:          47
  High confidence:         38
  Medium confidence:       4
  Low confidence (manual): 5
  Sum (high confidence):   $187,432.10

Five invoices need manual review. Priya opens each one. Three are scanned images (she can confirm this because text selection fails in her PDF viewer). Two have unusual layouts where "Net Amount Payable" is used instead of "Total Due."

She adjusts the two unusual layouts manually in the CSV, adds the three scanned invoice totals by hand, and sends the reconciled file to finance by noon. The total time: 12 seconds of script runtime, 20 minutes of manual review for the five edge cases, 15 minutes to set up and run the script. Total: ~50 minutes versus the projected 3.5 hours.

Part B: Generating Branded Regional Reports for Sandra

Sandra needs eight regional Q4 sales reports by Friday. Each report has the same structure: cover page, executive summary, key highlights, product performance table, and notes section. The content differs by region.

Priya has the regional data in a pandas DataFrame from the acme_sales_2023.csv file. She needs to turn that data into eight .docx files.

The Generation Script

"""
generate_regional_reports.py

Generate branded Word reports for all Acme Corp regions from sales data.

Usage:
    python generate_regional_reports.py
    python generate_regional_reports.py --template /templates/report_template.docx
"""

import datetime
from pathlib import Path

import docx
import pandas
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH


OUTPUT_DIR = Path("/data/reports/q4_2024_regional")
SALES_DATA = Path("/data/acme/acme_sales_2023.csv")

REGIONS = ["Chicago", "Nashville", "Cincinnati", "St. Louis",
           "Memphis", "Louisville", "Indianapolis", "Detroit"]

COLOR_DARK_BLUE = RGBColor(0x00, 0x35, 0x6B)
COLOR_GRAY = RGBColor(0x44, 0x44, 0x44)
COLOR_LIGHT_GRAY = RGBColor(0x99, 0x99, 0x99)


def load_regional_summary(csv_path: Path, region: str) -> dict:
    """Load and aggregate sales data for one region."""
    df = pandas.read_csv(csv_path)

    # Filter to Q4 (months 10, 11, 12) and the specific region
    df["date"] = pandas.to_datetime(df["date"])
    q4_mask = (df["date"].dt.month >= 10) & (df["region"] == region)
    region_df = df[q4_mask].copy()

    if region_df.empty:
        return {}

    total_revenue = region_df["revenue"].sum()
    total_units = region_df["units_sold"].sum()

    product_summary = (
        region_df.groupby("product")
        .agg(
            revenue=("revenue", "sum"),
            units=("units_sold", "sum"),
        )
        .reset_index()
        .sort_values("revenue", ascending=False)
    )

    # Compute mock target variances for illustration
    # (In production, join against a targets table)
    product_rows = []
    for _, row in product_summary.iterrows():
        product_rows.append([
            row["product"],
            f"${row['revenue']:,.0f}",
            f"{row['units']:,}",
            "+/--%",  # Would come from targets data in production
        ])

    return {
        "region": region,
        "total_revenue": f"${total_revenue:,.0f}",
        "total_units": f"{total_units:,}",
        "product_rows": product_rows,
    }


def build_report(region_data: dict, output_path: Path) -> Path:
    """Generate one Word report for one region."""
    region = region_data["region"]
    doc = docx.Document()

    # Page margins
    for section in doc.sections:
        section.top_margin = Inches(1.0)
        section.bottom_margin = Inches(1.0)
        section.left_margin = Inches(1.25)
        section.right_margin = Inches(1.25)

    # Cover
    title_para = doc.add_paragraph()
    title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    title_run = title_para.add_run("Q4 2024 Regional Sales Review")
    title_run.font.size = Pt(22)
    title_run.font.bold = True
    title_run.font.color.rgb = COLOR_DARK_BLUE

    sub_para = doc.add_paragraph()
    sub_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    sub_run = sub_para.add_run(f"{region} Region  |  October – December 2024")
    sub_run.font.size = Pt(14)
    sub_run.font.color.rgb = COLOR_GRAY

    doc.add_paragraph()

    meta_para = doc.add_paragraph()
    meta_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    meta_run = meta_para.add_run(
        f"Prepared by Priya Okonkwo, Analytics\n"
        f"{datetime.date.today().strftime('%B %d, %Y')}\n"
        f"Acme Corp Confidential"
    )
    meta_run.font.size = Pt(11)
    meta_run.font.italic = True
    meta_run.font.color.rgb = COLOR_GRAY

    doc.add_page_break()

    # Summary
    doc.add_heading("Executive Summary", level=1)
    doc.add_paragraph(
        f"The {region} region delivered Q4 2024 total revenue of "
        f"{region_data.get('total_revenue', 'N/A')}, representing "
        f"{region_data.get('total_units', 'N/A')} units sold across all product "
        f"categories. Full year-over-year comparison and variance analysis "
        f"are available in the attached data appendix."
    )
    doc.add_paragraph()

    # Product table
    doc.add_heading("Product Category Performance", level=1)
    if region_data.get("product_rows"):
        headers = ["Product Category", "Q4 Revenue", "Units Sold", "vs Target"]
        table = doc.add_table(
            rows=1 + len(region_data["product_rows"]),
            cols=len(headers)
        )
        table.style = "Light Shading Accent 1"

        for col_i, header in enumerate(headers):
            cell = table.rows[0].cells[col_i]
            cell.text = header
            for run in cell.paragraphs[0].runs:
                run.font.bold = True

        for row_i, row_data in enumerate(region_data["product_rows"], start=1):
            for col_i, value in enumerate(row_data):
                table.rows[row_i].cells[col_i].text = str(value)

    output_path.parent.mkdir(parents=True, exist_ok=True)
    doc.save(str(output_path))
    return output_path


def main():
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    created = []

    print(f"Generating reports for {len(REGIONS)} regions...\n")

    for region in REGIONS:
        print(f"  Processing: {region}...")
        region_data = load_regional_summary(SALES_DATA, region)

        if not region_data:
            print(f"    No data found for {region} — skipping")
            continue

        safe_name = region.lower().replace(" ", "_").replace(".", "")
        output_path = OUTPUT_DIR / f"q4_2024_{safe_name}_report.docx"

        created_path = build_report(region_data, output_path)
        print(f"    Created: {created_path.name}")
        created.append(created_path)

    print(f"\nDone. {len(created)} reports created in: {OUTPUT_DIR}")


if __name__ == "__main__":
    main()

Results

Priya runs the generation script on Thursday afternoon. It finishes in under two seconds. Eight .docx files appear in the output folder, each correctly populated with the region's data and branded with the Acme Corp title formatting.

She opens each one in Word, reads through the summaries, and finds two minor issues: the "vs Target" column shows "+/--%" because the targets table is not yet joined in. She notes this for the next iteration — the current script is still dramatically faster than creating eight documents by hand.

She sends the reports to Sandra by 4:00 PM Thursday. Sandra asks for one additional column in the product table. Priya adds it to the script and regenerates all eight reports in seconds.

What Priya Learned

Pattern-matching confidence levels are load-bearing. The three-tier confidence system (high/medium/low) is the difference between useful automation and dangerous automation. Routing low-confidence results to a review queue meant no data was silently wrong.

PDF extraction is a starting point, not an endpoint. The 38 high-confidence extractions handled themselves. The 9 remaining required human judgment. The script's job was to eliminate the mechanical work, not eliminate the human entirely.

Template-based document generation scales linearly. Once the build_report() function was working correctly for one region, generating reports for all eight regions took exactly as long as generating one. This is the compounding return on automation investment.

Iteration is free when the code is right. Sandra's request for an additional column would have meant manually editing eight Word documents. With the generation script, it meant changing one line of code and pressing Enter.