Case Study 23-2: Maya Builds Her Business Database
Characters
- Maya Reyes — Freelance Business Consultant
Situation
Maya Reyes has been running her consulting practice for three years using a system she is not proud of: a Clients.xlsx file, a Projects.xlsx file, an Invoices.xlsx file, and a TimeTracking.xlsx file that she fills in at the end of each day. Mostly.
The problems are accumulating:
- She has to cross-reference three files to answer the question "how much does TechNova still owe me?"
- The
Projectsfile has a "Client Name" column where she typed the same client's name four different ways over three years ("TechNova Inc", "TechNova", "Tech Nova Inc", "Technova") - She cannot easily answer "what was my hourly rate by client last quarter?" without building a VLOOKUP maze
- Last month she accidentally saved an interim draft over the invoice file and lost three days of entries
She has been reading about databases for a month. This weekend, she is finally doing it.
Designing the Schema
Before writing any code, Maya sketches her data model on paper. She asks: "What are the core things my business needs to track?"
Entities (things that need their own table): 1. Clients — companies and individuals she works with 2. Projects — engagements for a client; a client can have multiple projects 3. Invoices — billing documents; one invoice can cover multiple projects or time periods 4. TimeEntries — individual work sessions; the raw source of truth for billable hours
Relationships: - A Client has many Projects - A Project belongs to one Client - A Project has many TimeEntries - An Invoice belongs to one Client - An Invoice can reference multiple TimeEntries (an invoice "covers" a set of time entries)
She draws this on paper:
Clients
id, name, contact_name, email, rate_per_hour, industry, active
|-- Projects (client_id FK)
| id, client_id, name, description, status, start_date, end_date, rate_per_hour
|
| |-- TimeEntries (project_id FK)
| id, project_id, date, hours, description, invoice_id (nullable)
|
|-- Invoices (client_id FK)
id, client_id, invoice_number, issued_date, due_date, amount, status
The invoice_id on TimeEntries is a nullable foreign key — it is NULL when the time entry has not yet been billed, and filled in when she generates an invoice. This lets her query "which hours are unbilled?" efficiently.
Building the Database
# consulting_db.py
"""
Maya Reyes Consulting — Business Database
Replaces: Clients.xlsx, Projects.xlsx, Invoices.xlsx, TimeTracking.xlsx
"""
import sqlite3
import pandas as pd
import os
from datetime import date, datetime
from pathlib import Path
DB_PATH = "maya_consulting.db"
def get_connection() -> sqlite3.Connection:
conn = sqlite3.connect(DB_PATH)
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA foreign_keys = ON")
return conn
def create_schema(conn: sqlite3.Connection) -> None:
"""Create all tables for Maya's consulting business."""
conn.executescript("""
CREATE TABLE IF NOT EXISTS clients (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
contact_name TEXT,
email TEXT NOT NULL UNIQUE,
phone TEXT,
industry TEXT,
default_rate REAL NOT NULL DEFAULT 150.0,
active INTEGER NOT NULL DEFAULT 1,
notes TEXT,
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS projects (
id INTEGER PRIMARY KEY AUTOINCREMENT,
client_id INTEGER NOT NULL REFERENCES clients(id),
name TEXT NOT NULL,
description TEXT,
status TEXT NOT NULL DEFAULT 'active',
rate_per_hour REAL,
start_date TEXT,
end_date TEXT,
budget_hours REAL,
notes TEXT,
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS invoices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
client_id INTEGER NOT NULL REFERENCES clients(id),
invoice_number TEXT NOT NULL UNIQUE,
issued_date TEXT NOT NULL,
due_date TEXT NOT NULL,
amount REAL NOT NULL DEFAULT 0.0,
status TEXT NOT NULL DEFAULT 'draft',
notes TEXT,
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS time_entries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
project_id INTEGER NOT NULL REFERENCES projects(id),
entry_date TEXT NOT NULL,
hours REAL NOT NULL CHECK (hours > 0),
description TEXT NOT NULL,
invoice_id INTEGER REFERENCES invoices(id),
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- Index for fast lookups of unbilled time entries
CREATE INDEX IF NOT EXISTS idx_time_entries_invoice
ON time_entries(invoice_id);
CREATE INDEX IF NOT EXISTS idx_time_entries_project
ON time_entries(project_id);
CREATE INDEX IF NOT EXISTS idx_projects_client
ON projects(client_id);
""")
conn.commit()
print("[setup] Schema created.")
Migrating from CSV Files
Maya's old data lives in four CSV files. She writes a migration script to pull it in:
def migrate_from_csv(conn: sqlite3.Connection) -> None:
"""
Migrate data from Maya's legacy CSV files into the database.
Handles the messy client name inconsistencies with a normalization map.
"""
# --- STEP 1: Normalize client names before importing ---
# Three years of typos, all mapped to the correct canonical name.
name_corrections = {
"TechNova": "TechNova Inc",
"Tech Nova Inc": "TechNova Inc",
"Technova": "TechNova Inc",
"GreenLeaf": "GreenLeaf Solutions",
"Green Leaf": "GreenLeaf Solutions",
"Meridian": "Meridian Group LLC",
"Meridian Group": "Meridian Group LLC",
}
# --- STEP 2: Import clients ---
clients_df = pd.read_csv("legacy_clients.csv")
# Apply name normalization
clients_df["name"] = clients_df["name"].replace(name_corrections)
# Deduplicate (the inconsistent names created duplicate rows)
clients_df = clients_df.drop_duplicates(subset=["name"])
for _, row in clients_df.iterrows():
conn.execute(
"""
INSERT OR IGNORE INTO clients
(name, contact_name, email, phone, industry, default_rate)
VALUES (?, ?, ?, ?, ?, ?)
""",
(
row.get("name"),
row.get("contact_name"),
row.get("email"),
row.get("phone"),
row.get("industry"),
float(row.get("hourly_rate", 150.0))
)
)
conn.commit()
print(f"[migrate] Imported {len(clients_df)} clients.")
# --- STEP 3: Build a name -> id lookup ---
cursor = conn.execute("SELECT id, name FROM clients")
client_id_map = {row["name"]: row["id"] for row in cursor.fetchall()}
# --- STEP 4: Import projects ---
projects_df = pd.read_csv("legacy_projects.csv")
projects_df["client_name"] = projects_df["client_name"].replace(name_corrections)
projects_inserted = 0
projects_skipped = 0
for _, row in projects_df.iterrows():
client_id = client_id_map.get(row.get("client_name"))
if client_id is None:
print(f" [warn] Unknown client: {row.get('client_name')!r} — skipping project")
projects_skipped += 1
continue
conn.execute(
"""
INSERT OR IGNORE INTO projects
(client_id, name, description, status, rate_per_hour,
start_date, end_date)
VALUES (?, ?, ?, ?, ?, ?, ?)
""",
(
client_id,
row.get("project_name"),
row.get("description"),
row.get("status", "completed"),
float(row.get("rate_per_hour", 0)),
row.get("start_date"),
row.get("end_date"),
)
)
projects_inserted += 1
conn.commit()
print(f"[migrate] Projects: {projects_inserted} imported, {projects_skipped} skipped.")
# --- STEP 5: Import time entries ---
# Build project lookup map
cursor = conn.execute("""
SELECT p.id, p.name, c.name AS client_name
FROM projects p INNER JOIN clients c ON c.id = p.client_id
""")
project_map = {
(row["client_name"], row["name"]): row["id"]
for row in cursor.fetchall()
}
time_df = pd.read_csv("legacy_time_tracking.csv")
time_df["client_name"] = time_df["client_name"].replace(name_corrections)
entries_inserted = 0
entries_skipped = 0
for _, row in time_df.iterrows():
project_id = project_map.get(
(row.get("client_name"), row.get("project_name"))
)
if project_id is None:
entries_skipped += 1
continue
conn.execute(
"""
INSERT INTO time_entries
(project_id, entry_date, hours, description)
VALUES (?, ?, ?, ?)
""",
(
project_id,
row.get("date"),
float(row.get("hours", 0)),
row.get("description", ""),
)
)
entries_inserted += 1
conn.commit()
print(f"[migrate] Time entries: {entries_inserted} imported, "
f"{entries_skipped} skipped.")
The migration script finds 12 duplicate client rows that existed in the spreadsheet because of the inconsistent naming. In the database, each client exists exactly once.
Business Queries Maya Can Now Run Instantly
After the migration, Maya opens a Python REPL and starts asking questions she could never easily answer before.
How much does each client owe me right now?
conn = get_connection()
df = pd.read_sql("""
SELECT
c.name AS client,
COUNT(DISTINCT i.id) AS open_invoices,
ROUND(SUM(i.amount), 2) AS total_outstanding
FROM invoices AS i
INNER JOIN clients AS c ON c.id = i.client_id
WHERE i.status IN ('sent', 'overdue')
GROUP BY c.id, c.name
ORDER BY total_outstanding DESC
""", conn)
print(df.to_string(index=False))
client open_invoices total_outstanding
TechNova Inc 2 6400.00
GreenLeaf Solutions 1 2250.00
Meridian Group LLC 1 1875.00
What are my unbilled hours right now?
df = pd.read_sql("""
SELECT
c.name AS client,
p.name AS project,
COUNT(te.id) AS entries,
ROUND(SUM(te.hours), 1) AS unbilled_hours,
COALESCE(p.rate_per_hour, c.default_rate) AS rate,
ROUND(SUM(te.hours) *
COALESCE(p.rate_per_hour, c.default_rate), 2) AS unbilled_amount
FROM time_entries AS te
INNER JOIN projects AS p ON p.id = te.project_id
INNER JOIN clients AS c ON c.id = p.client_id
WHERE te.invoice_id IS NULL
GROUP BY c.id, c.name, p.id, p.name
ORDER BY unbilled_amount DESC
""", conn)
print(df.to_string(index=False))
total_unbilled = df["unbilled_amount"].sum()
print(f"\nTotal unbilled: ${total_unbilled:,.2f}")
What was my effective hourly rate by client last quarter?
df = pd.read_sql("""
SELECT
c.name AS client,
ROUND(SUM(te.hours), 1) AS hours_worked,
ROUND(SUM(i.amount), 2) AS revenue,
ROUND(SUM(i.amount) / SUM(te.hours), 2) AS effective_rate
FROM time_entries AS te
INNER JOIN projects AS p ON p.id = te.project_id
INNER JOIN clients AS c ON c.id = p.client_id
INNER JOIN invoices AS i ON i.id = te.invoice_id
WHERE
te.entry_date >= date('now', '-3 months')
AND i.status = 'paid'
GROUP BY c.id, c.name
ORDER BY effective_rate DESC
""", conn)
print(df.to_string(index=False))
The answer surprises her. Her effective rate with TechNova is $127/hour even though she bills them $150 — because she has been logging more hours than she invoices. With GreenLeaf, her effective rate is $158/hour because her estimates consistently come in under budget on that engagement.
Which projects are running over their hour budget?
df = pd.read_sql("""
SELECT
c.name AS client,
p.name AS project,
p.budget_hours,
ROUND(SUM(te.hours), 1) AS hours_logged,
ROUND(SUM(te.hours) - p.budget_hours, 1) AS over_under,
p.status
FROM projects AS p
INNER JOIN clients AS c ON c.id = p.client_id
LEFT JOIN time_entries AS te ON te.project_id = p.id
WHERE p.budget_hours IS NOT NULL AND p.status = 'active'
GROUP BY p.id, c.name, p.name, p.budget_hours, p.status
ORDER BY over_under DESC
""", conn)
print(df.to_string(index=False))
Adding New Data Going Forward
Maya writes two helper functions for her daily workflow:
def log_time(conn, project_id: int, entry_date: str,
hours: float, description: str) -> int:
"""Add a time entry. Returns the new entry's ID."""
cursor = conn.execute(
"""
INSERT INTO time_entries (project_id, entry_date, hours, description)
VALUES (?, ?, ?, ?)
""",
(project_id, entry_date, hours, description)
)
conn.commit()
return cursor.lastrowid
def create_invoice(conn, client_id: int, invoice_number: str,
issued_date: str, due_date: str) -> int:
"""
Create an invoice for a client and mark all unbilled time entries
for that client's projects as billed.
"""
try:
conn.execute("BEGIN")
# Calculate invoice total from unbilled hours
result = conn.execute("""
SELECT
ROUND(SUM(te.hours * COALESCE(p.rate_per_hour, c.default_rate)), 2)
AS total
FROM time_entries AS te
INNER JOIN projects AS p ON p.id = te.project_id
INNER JOIN clients AS c ON c.id = p.client_id
WHERE c.id = ? AND te.invoice_id IS NULL
""", (client_id,)).fetchone()
total = result["total"] or 0.0
if total == 0:
conn.rollback()
raise ValueError(f"No unbilled hours for client {client_id}")
# Create the invoice
cursor = conn.execute(
"""
INSERT INTO invoices
(client_id, invoice_number, issued_date, due_date, amount, status)
VALUES (?, ?, ?, ?, ?, 'sent')
""",
(client_id, invoice_number, issued_date, due_date, total)
)
invoice_id = cursor.lastrowid
# Link unbilled time entries to this invoice
conn.execute("""
UPDATE time_entries
SET invoice_id = ?
WHERE project_id IN (
SELECT id FROM projects WHERE client_id = ?
)
AND invoice_id IS NULL
""", (invoice_id, client_id))
conn.commit()
print(f"[invoice] Created {invoice_number} for ${total:,.2f}")
return invoice_id
except Exception as e:
conn.rollback()
print(f"[error] Invoice creation failed: {e}")
raise
What Changed
Before:
- "How much does TechNova owe me?" — 5 minutes, three files, a VLOOKUP
- "What's my effective rate per client?" — 20 minutes, a complex spreadsheet formula
- "Which projects are over budget?" — had to build a new tab every time
- Data entry errors (misspelled client names) — chronic problem with no solution
- Risk of losing data — one bad save, gone
After:
- Any of those questions — 30 seconds, one SQL query
- Client names consistent by design — the database enforces it via
UNIQUEand foreign keys - Historical data preserved — the time entry log is append-only; she never overwrites it
- Backup is one command:
cp maya_consulting.db maya_consulting_backup_$(date +%Y%m%d).db
What Maya Learned
-
Schema design before code. Sketching the tables and relationships on paper before writing any Python saved her from rebuilding the database three times.
-
Normalization solves the "TechNova vs Tech Nova Inc" problem. Storing the client name once (in the
clientstable) and referencing it everywhere else via foreign key means she can only have one canonical version. -
Null is information. The
invoice_id IS NULLon time entries is not a problem — it is how she tracks which hours are unbilled. Null as a meaningful state is a database concept she had not appreciated before. -
Migrations are hard; schema design is an investment. The messiest part of this project was cleaning the CSV data. Good schema design from day one makes future data easier to manage.
-
The REPL is a great database client. Maya now opens a Python REPL with
conn = get_connection()to explore data whenever she has a question. It is faster than any GUI tool she has tried.
Try It Yourself
Design and build a database for a simple business you know well. Ideas:
- A food blog with recipes, ingredients, and categories
- A personal library with books, authors, and reading status
- A plant nursery with plants, suppliers, and sales
For each scenario:
1. Identify the entities (tables)
2. Identify the relationships (foreign keys)
3. Write the CREATE TABLE statements
4. Insert five rows of sample data per table
5. Write three SQL queries that answer useful business questions