Case Study 23-2: Maya Builds Her Business Database

Characters

  • Maya Reyes — Freelance Business Consultant

Situation

Maya Reyes has been running her consulting practice for three years using a system she is not proud of: a Clients.xlsx file, a Projects.xlsx file, an Invoices.xlsx file, and a TimeTracking.xlsx file that she fills in at the end of each day. Mostly.

The problems are accumulating:

  • She has to cross-reference three files to answer the question "how much does TechNova still owe me?"
  • The Projects file has a "Client Name" column where she typed the same client's name four different ways over three years ("TechNova Inc", "TechNova", "Tech Nova Inc", "Technova")
  • She cannot easily answer "what was my hourly rate by client last quarter?" without building a VLOOKUP maze
  • Last month she accidentally saved an interim draft over the invoice file and lost three days of entries

She has been reading about databases for a month. This weekend, she is finally doing it.


Designing the Schema

Before writing any code, Maya sketches her data model on paper. She asks: "What are the core things my business needs to track?"

Entities (things that need their own table): 1. Clients — companies and individuals she works with 2. Projects — engagements for a client; a client can have multiple projects 3. Invoices — billing documents; one invoice can cover multiple projects or time periods 4. TimeEntries — individual work sessions; the raw source of truth for billable hours

Relationships: - A Client has many Projects - A Project belongs to one Client - A Project has many TimeEntries - An Invoice belongs to one Client - An Invoice can reference multiple TimeEntries (an invoice "covers" a set of time entries)

She draws this on paper:

Clients
  id, name, contact_name, email, rate_per_hour, industry, active

  |-- Projects (client_id FK)
  |     id, client_id, name, description, status, start_date, end_date, rate_per_hour
  |
  |     |-- TimeEntries (project_id FK)
  |           id, project_id, date, hours, description, invoice_id (nullable)
  |
  |-- Invoices (client_id FK)
        id, client_id, invoice_number, issued_date, due_date, amount, status

The invoice_id on TimeEntries is a nullable foreign key — it is NULL when the time entry has not yet been billed, and filled in when she generates an invoice. This lets her query "which hours are unbilled?" efficiently.


Building the Database

# consulting_db.py
"""
Maya Reyes Consulting — Business Database
Replaces: Clients.xlsx, Projects.xlsx, Invoices.xlsx, TimeTracking.xlsx
"""

import sqlite3
import pandas as pd
import os
from datetime import date, datetime
from pathlib import Path

DB_PATH = "maya_consulting.db"


def get_connection() -> sqlite3.Connection:
    conn = sqlite3.connect(DB_PATH)
    conn.row_factory = sqlite3.Row
    conn.execute("PRAGMA foreign_keys = ON")
    return conn


def create_schema(conn: sqlite3.Connection) -> None:
    """Create all tables for Maya's consulting business."""

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS clients (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            name            TEXT    NOT NULL,
            contact_name    TEXT,
            email           TEXT    NOT NULL UNIQUE,
            phone           TEXT,
            industry        TEXT,
            default_rate    REAL    NOT NULL DEFAULT 150.0,
            active          INTEGER NOT NULL DEFAULT 1,
            notes           TEXT,
            created_at      TEXT    NOT NULL DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS projects (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            client_id       INTEGER NOT NULL REFERENCES clients(id),
            name            TEXT    NOT NULL,
            description     TEXT,
            status          TEXT    NOT NULL DEFAULT 'active',
            rate_per_hour   REAL,
            start_date      TEXT,
            end_date        TEXT,
            budget_hours    REAL,
            notes           TEXT,
            created_at      TEXT    NOT NULL DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS invoices (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            client_id       INTEGER NOT NULL REFERENCES clients(id),
            invoice_number  TEXT    NOT NULL UNIQUE,
            issued_date     TEXT    NOT NULL,
            due_date        TEXT    NOT NULL,
            amount          REAL    NOT NULL DEFAULT 0.0,
            status          TEXT    NOT NULL DEFAULT 'draft',
            notes           TEXT,
            created_at      TEXT    NOT NULL DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS time_entries (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            project_id      INTEGER NOT NULL REFERENCES projects(id),
            entry_date      TEXT    NOT NULL,
            hours           REAL    NOT NULL CHECK (hours > 0),
            description     TEXT    NOT NULL,
            invoice_id      INTEGER REFERENCES invoices(id),
            created_at      TEXT    NOT NULL DEFAULT CURRENT_TIMESTAMP
        );

        -- Index for fast lookups of unbilled time entries
        CREATE INDEX IF NOT EXISTS idx_time_entries_invoice
            ON time_entries(invoice_id);

        CREATE INDEX IF NOT EXISTS idx_time_entries_project
            ON time_entries(project_id);

        CREATE INDEX IF NOT EXISTS idx_projects_client
            ON projects(client_id);
    """)

    conn.commit()
    print("[setup] Schema created.")

Migrating from CSV Files

Maya's old data lives in four CSV files. She writes a migration script to pull it in:

def migrate_from_csv(conn: sqlite3.Connection) -> None:
    """
    Migrate data from Maya's legacy CSV files into the database.
    Handles the messy client name inconsistencies with a normalization map.
    """

    # --- STEP 1: Normalize client names before importing ---
    # Three years of typos, all mapped to the correct canonical name.
    name_corrections = {
        "TechNova":       "TechNova Inc",
        "Tech Nova Inc":  "TechNova Inc",
        "Technova":       "TechNova Inc",
        "GreenLeaf":      "GreenLeaf Solutions",
        "Green Leaf":     "GreenLeaf Solutions",
        "Meridian":       "Meridian Group LLC",
        "Meridian Group": "Meridian Group LLC",
    }

    # --- STEP 2: Import clients ---
    clients_df = pd.read_csv("legacy_clients.csv")

    # Apply name normalization
    clients_df["name"] = clients_df["name"].replace(name_corrections)

    # Deduplicate (the inconsistent names created duplicate rows)
    clients_df = clients_df.drop_duplicates(subset=["name"])

    for _, row in clients_df.iterrows():
        conn.execute(
            """
            INSERT OR IGNORE INTO clients
                (name, contact_name, email, phone, industry, default_rate)
            VALUES (?, ?, ?, ?, ?, ?)
            """,
            (
                row.get("name"),
                row.get("contact_name"),
                row.get("email"),
                row.get("phone"),
                row.get("industry"),
                float(row.get("hourly_rate", 150.0))
            )
        )
    conn.commit()
    print(f"[migrate] Imported {len(clients_df)} clients.")

    # --- STEP 3: Build a name -> id lookup ---
    cursor = conn.execute("SELECT id, name FROM clients")
    client_id_map = {row["name"]: row["id"] for row in cursor.fetchall()}

    # --- STEP 4: Import projects ---
    projects_df = pd.read_csv("legacy_projects.csv")
    projects_df["client_name"] = projects_df["client_name"].replace(name_corrections)

    projects_inserted = 0
    projects_skipped  = 0

    for _, row in projects_df.iterrows():
        client_id = client_id_map.get(row.get("client_name"))
        if client_id is None:
            print(f"  [warn] Unknown client: {row.get('client_name')!r} — skipping project")
            projects_skipped += 1
            continue

        conn.execute(
            """
            INSERT OR IGNORE INTO projects
                (client_id, name, description, status, rate_per_hour,
                 start_date, end_date)
            VALUES (?, ?, ?, ?, ?, ?, ?)
            """,
            (
                client_id,
                row.get("project_name"),
                row.get("description"),
                row.get("status", "completed"),
                float(row.get("rate_per_hour", 0)),
                row.get("start_date"),
                row.get("end_date"),
            )
        )
        projects_inserted += 1

    conn.commit()
    print(f"[migrate] Projects: {projects_inserted} imported, {projects_skipped} skipped.")

    # --- STEP 5: Import time entries ---
    # Build project lookup map
    cursor = conn.execute("""
        SELECT p.id, p.name, c.name AS client_name
        FROM projects p INNER JOIN clients c ON c.id = p.client_id
    """)
    project_map = {
        (row["client_name"], row["name"]): row["id"]
        for row in cursor.fetchall()
    }

    time_df = pd.read_csv("legacy_time_tracking.csv")
    time_df["client_name"] = time_df["client_name"].replace(name_corrections)

    entries_inserted = 0
    entries_skipped  = 0

    for _, row in time_df.iterrows():
        project_id = project_map.get(
            (row.get("client_name"), row.get("project_name"))
        )
        if project_id is None:
            entries_skipped += 1
            continue

        conn.execute(
            """
            INSERT INTO time_entries
                (project_id, entry_date, hours, description)
            VALUES (?, ?, ?, ?)
            """,
            (
                project_id,
                row.get("date"),
                float(row.get("hours", 0)),
                row.get("description", ""),
            )
        )
        entries_inserted += 1

    conn.commit()
    print(f"[migrate] Time entries: {entries_inserted} imported, "
          f"{entries_skipped} skipped.")

The migration script finds 12 duplicate client rows that existed in the spreadsheet because of the inconsistent naming. In the database, each client exists exactly once.


Business Queries Maya Can Now Run Instantly

After the migration, Maya opens a Python REPL and starts asking questions she could never easily answer before.

How much does each client owe me right now?

conn = get_connection()

df = pd.read_sql("""
    SELECT
        c.name                      AS client,
        COUNT(DISTINCT i.id)        AS open_invoices,
        ROUND(SUM(i.amount), 2)     AS total_outstanding
    FROM invoices AS i
    INNER JOIN clients AS c ON c.id = i.client_id
    WHERE i.status IN ('sent', 'overdue')
    GROUP BY c.id, c.name
    ORDER BY total_outstanding DESC
""", conn)

print(df.to_string(index=False))
              client  open_invoices  total_outstanding
         TechNova Inc              2            6400.00
  GreenLeaf Solutions              1            2250.00
   Meridian Group LLC              1            1875.00

What are my unbilled hours right now?

df = pd.read_sql("""
    SELECT
        c.name                          AS client,
        p.name                          AS project,
        COUNT(te.id)                    AS entries,
        ROUND(SUM(te.hours), 1)         AS unbilled_hours,
        COALESCE(p.rate_per_hour, c.default_rate) AS rate,
        ROUND(SUM(te.hours) *
              COALESCE(p.rate_per_hour, c.default_rate), 2) AS unbilled_amount
    FROM time_entries AS te
    INNER JOIN projects AS p ON p.id = te.project_id
    INNER JOIN clients  AS c ON c.id = p.client_id
    WHERE te.invoice_id IS NULL
    GROUP BY c.id, c.name, p.id, p.name
    ORDER BY unbilled_amount DESC
""", conn)

print(df.to_string(index=False))
total_unbilled = df["unbilled_amount"].sum()
print(f"\nTotal unbilled: ${total_unbilled:,.2f}")

What was my effective hourly rate by client last quarter?

df = pd.read_sql("""
    SELECT
        c.name                              AS client,
        ROUND(SUM(te.hours), 1)             AS hours_worked,
        ROUND(SUM(i.amount), 2)             AS revenue,
        ROUND(SUM(i.amount) / SUM(te.hours), 2) AS effective_rate
    FROM time_entries AS te
    INNER JOIN projects AS p  ON p.id = te.project_id
    INNER JOIN clients  AS c  ON c.id = p.client_id
    INNER JOIN invoices AS i  ON i.id = te.invoice_id
    WHERE
        te.entry_date >= date('now', '-3 months')
        AND i.status = 'paid'
    GROUP BY c.id, c.name
    ORDER BY effective_rate DESC
""", conn)

print(df.to_string(index=False))

The answer surprises her. Her effective rate with TechNova is $127/hour even though she bills them $150 — because she has been logging more hours than she invoices. With GreenLeaf, her effective rate is $158/hour because her estimates consistently come in under budget on that engagement.

Which projects are running over their hour budget?

df = pd.read_sql("""
    SELECT
        c.name                          AS client,
        p.name                          AS project,
        p.budget_hours,
        ROUND(SUM(te.hours), 1)         AS hours_logged,
        ROUND(SUM(te.hours) - p.budget_hours, 1) AS over_under,
        p.status
    FROM projects AS p
    INNER JOIN clients     AS c  ON c.id = p.client_id
    LEFT  JOIN time_entries AS te ON te.project_id = p.id
    WHERE p.budget_hours IS NOT NULL AND p.status = 'active'
    GROUP BY p.id, c.name, p.name, p.budget_hours, p.status
    ORDER BY over_under DESC
""", conn)

print(df.to_string(index=False))

Adding New Data Going Forward

Maya writes two helper functions for her daily workflow:

def log_time(conn, project_id: int, entry_date: str,
             hours: float, description: str) -> int:
    """Add a time entry. Returns the new entry's ID."""
    cursor = conn.execute(
        """
        INSERT INTO time_entries (project_id, entry_date, hours, description)
        VALUES (?, ?, ?, ?)
        """,
        (project_id, entry_date, hours, description)
    )
    conn.commit()
    return cursor.lastrowid


def create_invoice(conn, client_id: int, invoice_number: str,
                   issued_date: str, due_date: str) -> int:
    """
    Create an invoice for a client and mark all unbilled time entries
    for that client's projects as billed.
    """
    try:
        conn.execute("BEGIN")

        # Calculate invoice total from unbilled hours
        result = conn.execute("""
            SELECT
                ROUND(SUM(te.hours * COALESCE(p.rate_per_hour, c.default_rate)), 2)
                AS total
            FROM time_entries AS te
            INNER JOIN projects AS p ON p.id = te.project_id
            INNER JOIN clients  AS c ON c.id = p.client_id
            WHERE c.id = ? AND te.invoice_id IS NULL
        """, (client_id,)).fetchone()

        total = result["total"] or 0.0
        if total == 0:
            conn.rollback()
            raise ValueError(f"No unbilled hours for client {client_id}")

        # Create the invoice
        cursor = conn.execute(
            """
            INSERT INTO invoices
                (client_id, invoice_number, issued_date, due_date, amount, status)
            VALUES (?, ?, ?, ?, ?, 'sent')
            """,
            (client_id, invoice_number, issued_date, due_date, total)
        )
        invoice_id = cursor.lastrowid

        # Link unbilled time entries to this invoice
        conn.execute("""
            UPDATE time_entries
            SET invoice_id = ?
            WHERE project_id IN (
                SELECT id FROM projects WHERE client_id = ?
            )
            AND invoice_id IS NULL
        """, (invoice_id, client_id))

        conn.commit()
        print(f"[invoice] Created {invoice_number} for ${total:,.2f}")
        return invoice_id

    except Exception as e:
        conn.rollback()
        print(f"[error] Invoice creation failed: {e}")
        raise

What Changed

Before:

  • "How much does TechNova owe me?" — 5 minutes, three files, a VLOOKUP
  • "What's my effective rate per client?" — 20 minutes, a complex spreadsheet formula
  • "Which projects are over budget?" — had to build a new tab every time
  • Data entry errors (misspelled client names) — chronic problem with no solution
  • Risk of losing data — one bad save, gone

After:

  • Any of those questions — 30 seconds, one SQL query
  • Client names consistent by design — the database enforces it via UNIQUE and foreign keys
  • Historical data preserved — the time entry log is append-only; she never overwrites it
  • Backup is one command: cp maya_consulting.db maya_consulting_backup_$(date +%Y%m%d).db

What Maya Learned

  1. Schema design before code. Sketching the tables and relationships on paper before writing any Python saved her from rebuilding the database three times.

  2. Normalization solves the "TechNova vs Tech Nova Inc" problem. Storing the client name once (in the clients table) and referencing it everywhere else via foreign key means she can only have one canonical version.

  3. Null is information. The invoice_id IS NULL on time entries is not a problem — it is how she tracks which hours are unbilled. Null as a meaningful state is a database concept she had not appreciated before.

  4. Migrations are hard; schema design is an investment. The messiest part of this project was cleaning the CSV data. Good schema design from day one makes future data easier to manage.

  5. The REPL is a great database client. Maya now opens a Python REPL with conn = get_connection() to explore data whenever she has a question. It is faster than any GUI tool she has tried.


Try It Yourself

Design and build a database for a simple business you know well. Ideas:

  • A food blog with recipes, ingredients, and categories
  • A personal library with books, authors, and reading status
  • A plant nursery with plants, suppliers, and sales

For each scenario: 1. Identify the entities (tables) 2. Identify the relationships (foreign keys) 3. Write the CREATE TABLE statements 4. Insert five rows of sample data per table 5. Write three SQL queries that answer useful business questions