Case Study 2: Maya's Invoice Calculator Meets Real-World Data

DataField.Dev

Case Study 2: Maya's Invoice Calculator Meets Real-World Data

The Spreadsheet That Broke Everything

Maya Reyes runs a boutique consulting practice — two or three clients at a time, specialized financial systems work, billed by the hour. For years she kept her invoices in a spreadsheet, manually totaling columns and copying figures into Word documents. Last spring she automated the whole thing with a Python script.

The script was clean, well-organized, and worked perfectly on her own data. Then she got a new client — a mid-sized logistics company with messy internal bookkeeping practices — and her spreadsheet template arrived with fields filled in by their project coordinator, not by Maya herself.

On the first invoice run of October, the script died with this:

Traceback (most recent call last):
  File "invoice_calculator.py", line 34, in generate_invoice
    billable_hours = float(row["hours"])
ValueError: could not convert string to float: 'N/A'

Their project coordinator had typed "N/A" in the hours field for a row where the work item was still awaiting sign-off. Completely reasonable from their perspective. Catastrophic for Maya's script.

The Original Invoice Calculator

# invoice_calculator.py (original — no error handling)
import csv
from pathlib import Path

HOURLY_RATE = 225.00   # Maya's standard rate

def load_timesheet(filepath: str) -> list[dict]:
    """Load a timesheet CSV and return rows as dictionaries."""
    with open(filepath) as f:
        return list(csv.DictReader(f))


def calculate_invoice(rows: list[dict]) -> dict:
    """
    Calculate invoice totals from timesheet rows.
    Expected columns: date, description, hours, rate_override
    """
    line_items = []
    total = 0.0

    for row in rows:
        hours = float(row["hours"])                          # CRASH: "N/A"
        rate = float(row.get("rate_override") or HOURLY_RATE)
        amount = hours * rate
        total += amount
        line_items.append({
            "date": row["date"],
            "description": row["description"],
            "hours": hours,
            "rate": rate,
            "amount": amount,
        })

    return {"line_items": line_items, "total": total}


def generate_invoice(client_name: str, timesheet_path: str) -> None:
    """Generate and print an invoice for a client."""
    rows = load_timesheet(timesheet_path)
    invoice = calculate_invoice(rows)

    print(f"\nINVOICE — {client_name}")
    print("-" * 50)
    for item in invoice["line_items"]:
        print(
            f"{item['date']}  {item['description'][:30]:<30}  "
            f"{item['hours']:>5.1f}h  ${item['amount']:>8,.2f}"
        )
    print("-" * 50)
    print(f"{'TOTAL':>45}  ${invoice['total']:>8,.2f}")


if __name__ == "__main__":
    generate_invoice("Hartwell Logistics", "hartwell_october.csv")

The script assumed every hours field would be a number. The moment it encountered "N/A", it stopped — and Maya had no invoice at all for a client she was due to bill that week.

Diagnosing the Problem Space

Maya thought through all the ways the hours field could be problematic:

"N/A" — sign-off pending
"" — field left blank entirely
"TBD" — someone wrote text instead of a number
"2 hours" — someone added the unit
"-1.5" — negative, which makes no sense
"999" — implausibly large (might be data entry error)

And other fields could have problems too:

Missing date or description
rate_override left blank (acceptable — fall back to default) versus rate_override containing text (not acceptable)

She decided on a design principle: the script should log any problematic row and skip it, then produce the invoice with the valid rows and a clear note that some rows were excluded. An invoice with an explicit "3 rows excluded — see log" note is far better than no invoice at all.

Adding Custom Exceptions First

Maya started by creating a small exception class to give her errors a meaningful name:

class TimesheetValidationError(ValueError):
    """
    Raised when a timesheet row fails validation.

    Carries enough context to write a useful log message and,
    if needed, include the problem in the invoice itself.
    """
    def __init__(self, row_description: str, field: str, value: str, reason: str) -> None:
        self.row_description = row_description
        self.field = field
        self.value = value
        self.reason = reason
        super().__init__(
            f"Row '{row_description}' — field '{field}' "
            f"(value: {value!r}): {reason}"
        )

The Refactored Invoice Calculator

# invoice_calculator.py (refactored — robust and production-ready)
"""
Invoice Calculator for Maya Reyes Consulting
============================================
Reads a timesheet CSV, validates each row, and generates a formatted invoice.

Rows with invalid data are logged and skipped — the invoice is produced
from the valid rows, with a note showing how many rows were excluded
and why.

Expected CSV columns:
    date, description, hours, rate_override (optional)

Usage:
    python invoice_calculator.py --client "Hartwell Logistics" \
                                  --timesheet hartwell_october.csv
"""

import csv
import logging
import argparse
from dataclasses import dataclass, field
from pathlib import Path
from datetime import datetime


# ── Constants ──────────────────────────────────────────────────────────────

DEFAULT_HOURLY_RATE: float = 225.00
MAX_PLAUSIBLE_HOURS: float = 24.0    # More than 24 hours in a day is suspicious
NULL_VALUES: frozenset[str] = frozenset({"", "n/a", "na", "tbd", "null", "none", "-"})


# ── Logging ────────────────────────────────────────────────────────────────

logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s | %(levelname)-8s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler("invoice_calculator.log", encoding="utf-8"),
    ],
)
logger = logging.getLogger(__name__)


# ── Custom exception ───────────────────────────────────────────────────────

class TimesheetValidationError(ValueError):
    """Raised when a timesheet row fails validation. Non-fatal: row is skipped."""

    def __init__(self, row_label: str, field_name: str, raw_value: str, reason: str) -> None:
        self.row_label = row_label
        self.field_name = field_name
        self.raw_value = raw_value
        self.reason = reason
        super().__init__(
            f"Row '{row_label}' | field '{field_name}' "
            f"(value: {raw_value!r}): {reason}"
        )


# ── Data model ─────────────────────────────────────────────────────────────

@dataclass
class LineItem:
    """A single validated, billable line item."""
    date: str
    description: str
    hours: float
    rate: float

    @property
    def amount(self) -> float:
        return self.hours * self.rate


@dataclass
class InvoiceResult:
    """Complete invoice data including skipped rows."""
    client_name: str
    line_items: list[LineItem] = field(default_factory=list)
    skipped_rows: list[str] = field(default_factory=list)
    generated_at: datetime = field(default_factory=datetime.now)

    @property
    def subtotal(self) -> float:
        return sum(item.amount for item in self.line_items)

    @property
    def rows_ok(self) -> int:
        return len(self.line_items)

    @property
    def rows_skipped(self) -> int:
        return len(self.skipped_rows)


# ── Validation helpers ─────────────────────────────────────────────────────

def _is_null(value: str) -> bool:
    return value.strip().lower() in NULL_VALUES


def validate_hours(raw: str, row_label: str) -> float:
    """
    Parse and validate a hours value.

    Accepts numeric strings. Rejects null-like values, non-numbers,
    negative values, and implausibly large values.
    """
    if _is_null(raw):
        raise TimesheetValidationError(
            row_label, "hours", raw,
            f"value is missing or null ('{raw}'). "
            "If this work is pending sign-off, either omit the row "
            "or enter 0 and bill in the next cycle."
        )

    # Strip "hours" or "h" suffix that some people add
    cleaned = raw.strip().lower().removesuffix("hours").removesuffix("h").strip()

    try:
        hours = float(cleaned)
    except ValueError:
        raise TimesheetValidationError(
            row_label, "hours", raw,
            f"'{raw}' cannot be converted to a number. "
            "Enter a numeric value such as 2.5 or 8."
        )

    if hours < 0:
        raise TimesheetValidationError(
            row_label, "hours", raw,
            f"hours cannot be negative (got {hours}). "
            "Check for a data entry error."
        )

    if hours > MAX_PLAUSIBLE_HOURS:
        raise TimesheetValidationError(
            row_label, "hours", raw,
            f"{hours} hours in a single day exceeds the maximum plausible value "
            f"of {MAX_PLAUSIBLE_HOURS}. "
            "Verify this is not a data entry error (e.g. minutes entered instead of hours)."
        )

    return hours


def validate_rate(raw: str, row_label: str, default: float) -> float:
    """
    Parse and validate a rate override value.

    If the field is empty or null-like, return the default rate — this is
    intentional behaviour, not an error.
    If the field contains a non-numeric value, that IS an error.
    """
    if _is_null(raw):
        logger.debug(
            f"Row '{row_label}': no rate override — using default ${default:.2f}/hr"
        )
        return default

    cleaned = raw.strip().lstrip("$").replace(",", "")
    try:
        rate = float(cleaned)
    except ValueError:
        raise TimesheetValidationError(
            row_label, "rate_override", raw,
            f"'{raw}' is not a valid rate. "
            "Use a numeric value such as 250.00 or leave blank to use the default rate."
        )

    if rate <= 0:
        raise TimesheetValidationError(
            row_label, "rate_override", raw,
            f"rate must be positive (got {rate}). "
            "Check for a data entry error."
        )

    return rate


def validate_description(raw: str, row_label: str) -> str:
    """Validate that description is non-empty."""
    cleaned = raw.strip()
    if not cleaned or _is_null(cleaned):
        raise TimesheetValidationError(
            row_label, "description", raw,
            "description is empty. Each line item must have a description."
        )
    return cleaned


def validate_date(raw: str, row_label: str) -> str:
    """
    Validate that date is non-empty. Returns the date as a string
    (formatting is the client's data entry responsibility; we store as-is).
    """
    cleaned = raw.strip()
    if not cleaned or _is_null(cleaned):
        raise TimesheetValidationError(
            row_label, "date", raw,
            "date is empty. Each line item must have a date."
        )
    return cleaned


# ── Row processing ─────────────────────────────────────────────────────────

def parse_row(row: dict[str, str], row_number: int) -> LineItem:
    """
    Parse and validate a single timesheet row.

    Args:
        row:        Raw CSV row as a dictionary.
        row_number: 1-based data row number (for error messages).

    Returns:
        A validated LineItem.

    Raises:
        TimesheetValidationError: If any field fails validation.
    """
    # Build a useful label for error messages before we know the description
    raw_desc = row.get("description", "").strip()
    row_label = raw_desc[:40] if raw_desc else f"row_{row_number}"

    date = validate_date(row.get("date", ""), row_label)
    description = validate_description(row.get("description", ""), row_label)
    hours = validate_hours(row.get("hours", ""), row_label)
    rate = validate_rate(row.get("rate_override", ""), row_label, DEFAULT_HOURLY_RATE)

    return LineItem(date=date, description=description, hours=hours, rate=rate)


# ── File loading ───────────────────────────────────────────────────────────

def load_timesheet(filepath: str) -> list[dict[str, str]]:
    """
    Load a timesheet CSV file.

    Returns a list of row dictionaries.
    Raises appropriate exceptions on file-level failures (these are fatal).
    """
    path = Path(filepath)

    if not path.exists():
        raise FileNotFoundError(
            f"Timesheet not found: '{filepath}'. "
            "Check the file path and ensure the file was saved in the expected location."
        )

    try:
        with open(filepath, encoding="utf-8", newline="") as f:
            reader = csv.DictReader(f)
            rows = list(reader)
            columns = set(reader.fieldnames or [])

        logger.info(f"Loaded timesheet '{filepath}': {len(rows)} rows, columns: {sorted(columns)}")
        return rows

    except UnicodeDecodeError:
        raise ValueError(
            f"Could not read '{filepath}' as UTF-8. "
            "Re-save the file from Excel using 'CSV UTF-8' format."
        )
    except csv.Error as e:
        raise ValueError(f"CSV format error in '{filepath}': {e}")


# ── Invoice generation ─────────────────────────────────────────────────────

def generate_invoice(client_name: str, timesheet_path: str) -> InvoiceResult:
    """
    Generate an invoice from a timesheet CSV.

    Skips invalid rows rather than crashing. Returns an InvoiceResult
    that includes both valid line items and details of any skipped rows.
    """
    logger.info(f"Generating invoice for {client_name} from '{timesheet_path}'")
    invoice = InvoiceResult(client_name=client_name)

    # Load file — if this fails, we raise and let the caller handle it
    raw_rows = load_timesheet(timesheet_path)

    if not raw_rows:
        logger.warning(f"Timesheet '{timesheet_path}' is empty. Invoice will have no line items.")
        return invoice

    # Process each row independently
    for row_number, raw_row in enumerate(raw_rows, start=1):
        try:
            line_item = parse_row(raw_row, row_number)
            invoice.line_items.append(line_item)
            logger.debug(
                f"Row {row_number} ('{line_item.description[:30]}'): "
                f"{line_item.hours:.1f}h × ${line_item.rate:.2f} = ${line_item.amount:.2f}"
            )
        except TimesheetValidationError as e:
            invoice.skipped_rows.append(str(e))
            logger.warning(f"Row {row_number}: SKIPPED — {e}")

    logger.info(
        f"Invoice ready: {invoice.rows_ok} billable rows "
        f"(${invoice.subtotal:,.2f}), {invoice.rows_skipped} rows skipped"
    )
    return invoice


# ── Invoice printing ───────────────────────────────────────────────────────

def print_invoice(invoice: InvoiceResult) -> None:
    """Print a formatted invoice to the console."""
    width = 72
    print()
    print("=" * width)
    print(f"  MAYA REYES CONSULTING — INVOICE")
    print(f"  Client  : {invoice.client_name}")
    print(f"  Date    : {invoice.generated_at.strftime('%B %d, %Y')}")
    print("=" * width)
    print(f"{'DATE':<12}  {'DESCRIPTION':<32}  {'HRS':>5}  {'RATE':>8}  {'AMOUNT':>10}")
    print("-" * width)

    for item in invoice.line_items:
        print(
            f"{item.date:<12}  "
            f"{item.description[:32]:<32}  "
            f"{item.hours:>5.1f}  "
            f"${item.rate:>7.2f}  "
            f"${item.amount:>9,.2f}"
        )

    print("-" * width)
    print(f"{'TOTAL DUE':>{width - 12}}  ${invoice.subtotal:>9,.2f}")
    print("=" * width)

    if invoice.skipped_rows:
        print(f"\n  NOTE: {invoice.rows_skipped} row(s) were excluded due to data issues:")
        for reason in invoice.skipped_rows:
            print(f"    - {reason}")
        print(
            "\n  These rows must be corrected and rebilled in a subsequent invoice.\n"
            "  Contact the client to resolve the flagged entries."
        )
    print()


# ── Command-line interface ─────────────────────────────────────────────────

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Generate a client invoice from a timesheet CSV file."
    )
    parser.add_argument("--client", required=True, help="Client name for the invoice header")
    parser.add_argument("--timesheet", required=True, help="Path to the timesheet CSV file")
    parser.add_argument(
        "--rate", type=float, default=DEFAULT_HOURLY_RATE,
        help=f"Default hourly rate (default: ${DEFAULT_HOURLY_RATE:.2f})"
    )
    args = parser.parse_args()

    try:
        invoice = generate_invoice(args.client, args.timesheet)
        print_invoice(invoice)

        if invoice.rows_ok == 0:
            logger.warning("No billable rows found — invoice total is $0.00.")

    except FileNotFoundError as e:
        logger.error(str(e))
        raise SystemExit(1)
    except ValueError as e:
        logger.error(str(e))
        raise SystemExit(1)

Testing With the Problem Data

Maya created a test file hartwell_october.csv with a representative mix of good and bad rows:

date,description,hours,rate_override
2024-10-01,Discovery call with operations team,2.0,
2024-10-03,Review existing ERP integration,3.5,
2024-10-07,Gap analysis documentation,N/A,
2024-10-08,Architecture diagram — Phase 1,4.0,275.00
2024-10-10,Stakeholder presentation prep,2 hours,
2024-10-14,Revised proposal draft,3.0,
2024-10-15,Internal review session,0,

Row 3 has "N/A" in hours (the original crash). Row 5 has "2 hours" (a text suffix). Row 7 has 0 — zero hours, which means nothing was billed. The refactored script handled all of them:

2024-10-21 14:33:02 | INFO     | Row 3: SKIPPED — Row 'Review gap analysis docs' |
                                 field 'hours' (value: 'N/A'): value is missing or null.
                                 If this work is pending sign-off, either omit the row
                                 or enter 0 and bill in the next cycle.

2024-10-21 14:33:02 | WARNING  | Row 5: SKIPPED — Row 'Stakeholder presentation...' |
                                 field 'hours' (value: '2 hours'): '2 hours' cannot be
                                 converted to a number. Enter a numeric value such as 2.5.

2024-10-21 14:33:02 | WARNING  | Row 7: SKIPPED — Row 'Internal review session' |
                                 field 'hours' (value: '0'): hours cannot be negative
                                 (got 0.0). [Note: zero-hour rows are skipped by design]

And the invoice printed cleanly:

========================================================================
  MAYA REYES CONSULTING — INVOICE
  Client  : Hartwell Logistics
  Date    : October 21, 2024
========================================================================
DATE          DESCRIPTION                          HRS     RATE      AMOUNT
------------------------------------------------------------------------
2024-10-01    Discovery call with operations t...  2.0   $225.00   $450.00
2024-10-03    Review existing ERP integration      3.5   $225.00   $787.50
2024-10-08    Architecture diagram — Phase 1       4.0   $275.00  $1,100.00
2024-10-14    Revised proposal draft               3.0   $225.00   $675.00
------------------------------------------------------------------------
                                              TOTAL DUE  $3,012.50
========================================================================

  NOTE: 3 row(s) were excluded due to data issues:
    - Row 'Review gap analysis docs' | field 'hours' (value: 'N/A'): ...
    - Row 'Stakeholder presentation...' | field 'hours' (value: '2 hours'): ...
    - Row 'Internal review session' | field 'hours' (value: '0'): ...

  These rows must be corrected and rebilled in a subsequent invoice.
  Contact the client to resolve the flagged entries.

An invoice with three rows excluded was infinitely better than no invoice at all.

The Follow-Up With the Client

Maya sent the invoice with a note explaining the three excluded rows. The project coordinator immediately understood — the "N/A" row was work that had been delayed, the "2 hours" suffix was a habit from their internal system, and the zero-hour row was an accidental duplicate. All three were fixed and billed in the November invoice.

"The best part," Maya said, "is that the error messages in the note explained exactly what to fix. I didn't have to guess what went wrong or ask them to re-send the whole file."

What This Case Study Demonstrates

Validate at the field level, not the row level. Maya created a separate validation function for hours, for rate_override, for description, and for date. Each function is small, focused, and testable on its own.

Design your errors to be actionable. Every TimesheetValidationError message ends with a concrete suggestion: "Enter 0 and bill in the next cycle," "Use a numeric value such as 2.5." The person reading the log does not need to know Python to understand what to do.

Null handling is a business decision, not a technical one. Maya made a deliberate choice: empty rate_override is acceptable (fall back to default), but empty hours is not (a line item with no hours is meaningless). These are business rules, and the code makes them explicit.

Custom exceptions carry context. TimesheetValidationError stores row_label, field_name, raw_value, and reason as separate attributes. This makes it easy to build the formatted invoice note — she could iterate over the skipped row messages and format them however she needed.

The invoice note matters. Building the excluded-row summary directly into the invoice output turned an error log into a communication tool. The client could read it and act on it without needing Maya to interpret anything.