23 min read

> "Bad programmers worry about the code. Good programmers worry about data structures and their relationships." -- Linus Torvalds

Chapter 18: Database Design and Data Modeling

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships." -- Linus Torvalds

In the preceding chapters, you built command-line tools (Chapter 15), web frontends (Chapter 16), and REST APIs (Chapter 17). But every one of those projects shares a common need: persistent, structured data. A task manager that forgets your tasks when you close the terminal is a toy. A web application that loses user registrations on restart is useless. A REST API without reliable data storage is a facade with nothing behind it.

This chapter teaches you how to design, build, and manage the data layer that underpins real software. You will learn relational database fundamentals, write SQL with AI assistance, use SQLAlchemy to interact with databases from Python, manage schema evolution with Alembic, and understand when relational databases are not the right tool. By the end, you will be able to ask an AI assistant to help you design a database schema and critically evaluate the result -- knowing which suggestions to keep, which to modify, and which to reject entirely.

Database design is one of the areas where AI assistants shine brightest and stumble most dangerously. An AI can generate a perfectly normalized schema in seconds, but it can also produce one with subtle flaws -- missing indexes, incorrect relationship cardinalities, or denormalization choices that will haunt you at scale. The vibe coder who understands the underlying principles will consistently get better results than one who accepts AI output uncritically.

Learning Objectives

By the end of this chapter, you will be able to:

  • Define (Bloom's: Remember) the core components of a relational database: tables, rows, columns, primary keys, and foreign keys
  • Explain (Bloom's: Understand) normalization through Third Normal Form and articulate why it matters for data integrity
  • Write (Bloom's: Apply) SQL queries including SELECT, INSERT, UPDATE, DELETE, JOINs, GROUP BY, and aggregate functions with AI assistance
  • Construct (Bloom's: Apply) SQLAlchemy ORM models with relationships, constraints, and type hints
  • Manage (Bloom's: Apply) database schema changes using Alembic migrations
  • Analyze (Bloom's: Analyze) schema design patterns including one-to-many, many-to-many, and polymorphic associations
  • Evaluate (Bloom's: Evaluate) indexing strategies and their impact on query performance
  • Compare (Bloom's: Analyze) relational and NoSQL databases to select the appropriate tool for a given problem
  • Design (Bloom's: Create) a complete data layer with validation, constraints, and proper access patterns

18.1 Database Fundamentals for Vibe Coders

Before you can effectively prompt an AI to design your database, you need a mental model of how databases work. This section builds that model from first principles.

What Is a Database?

A database is an organized collection of data that supports efficient storage, retrieval, and modification. While you could store data in flat files, CSV spreadsheets, or JSON documents (as you did in Chapter 6), databases provide critical guarantees that files do not:

  • Concurrent access: Multiple users or processes can read and write simultaneously without corrupting data.
  • Durability: Once data is saved (committed), it survives crashes, power failures, and restarts.
  • Integrity constraints: Rules enforced by the database itself prevent invalid data from being stored.
  • Efficient querying: Indexes and query optimizers make it fast to find specific records among millions.

Relational Databases: The Workhorse

A relational database organizes data into tables (also called relations). Each table has:

  • Columns (fields/attributes): Define the type of data stored. A users table might have columns id, username, email, and created_at.
  • Rows (records/tuples): Each row is one entry in the table -- one user, one order, one product.
  • Primary key: A column (or set of columns) that uniquely identifies each row. Typically an auto-incrementing integer id or a UUID.
  • Foreign key: A column that references the primary key of another table, establishing a relationship between tables.

Here is a concrete example:

users table:
+----+----------+-------------------+---------------------+
| id | username | email             | created_at          |
+----+----------+-------------------+---------------------+
|  1 | alice    | alice@example.com | 2025-01-15 09:30:00 |
|  2 | bob      | bob@example.com   | 2025-02-20 14:15:00 |
+----+----------+-------------------+---------------------+

tasks table:
+----+---------+----------------+-----------+
| id | user_id | title          | completed |
+----+---------+----------------+-----------+
|  1 |       1 | Buy groceries  | false     |
|  2 |       1 | Write chapter  | true      |
|  3 |       2 | Fix bug #42    | false     |
+----+---------+----------------+-----------+

The user_id column in the tasks table is a foreign key pointing to users.id. This relationship means "each task belongs to one user, and each user can have many tasks."

The ACID Guarantees

Relational databases provide four guarantees known by the acronym ACID:

  • Atomicity: A transaction either completes entirely or has no effect. If you transfer money between accounts, both the debit and credit happen, or neither does.
  • Consistency: Every transaction moves the database from one valid state to another. Constraints are never violated.
  • Isolation: Concurrent transactions do not interfere with each other. Each transaction sees a consistent snapshot of the data.
  • Durability: Once a transaction is committed, it is permanent -- even if the server crashes immediately afterward.

Vibe Coding Insight: When you ask an AI to "create a database for my application," these guarantees are what you are really asking for. Understanding ACID helps you evaluate whether an AI's suggestion of using a simple JSON file versus a real database is appropriate for your use case.

For vibe coding projects, you will most commonly encounter:

Database Best For Key Trait
SQLite Prototyping, small apps, embedded systems Zero configuration, file-based
PostgreSQL Production web apps, complex queries Feature-rich, highly reliable
MySQL Web applications, read-heavy workloads Widely deployed, well-known

SQLite is the ideal starting point for vibe coders. It requires no server setup -- the entire database lives in a single file. You can start building immediately and migrate to PostgreSQL when your application demands it.

Practical Tip: When prompting an AI to build your data layer, start with SQLite for development and specify PostgreSQL for production. SQLAlchemy (Section 18.4) makes switching between them nearly painless.


18.2 Relational Database Design

Designing a good database schema is part science and part art. This section teaches the science -- normalization -- and gives you the intuition to apply it effectively.

Thinking in Relationships

The first step in database design is identifying the entities (things) in your system and the relationships between them. Consider an online bookstore:

  • Entities: Books, Authors, Customers, Orders, Reviews
  • Relationships: A book has one or more authors. A customer places orders. An order contains books. A customer writes reviews about books.

Each entity becomes a table. Each relationship becomes either a foreign key or a junction table (more on this shortly).

Types of Relationships

There are three fundamental relationship types:

One-to-Many (1:N): The most common relationship. One user has many tasks. One department has many employees. Implemented with a foreign key on the "many" side.

departments          employees
+----+--------+     +----+---------+---------------+
| id | name   |     | id | dept_id | name          |
+----+--------+     +----+---------+---------------+
|  1 | Sales  |     |  1 |       1 | Alice         |
|  2 | Eng    |     |  2 |       2 | Bob           |
+----+--------+     |  3 |       2 | Charlie       |
                     +----+---------+---------------+

One-to-One (1:1): Each row in table A corresponds to exactly one row in table B. A user has one profile. Implemented with a foreign key plus a unique constraint.

Many-to-Many (M:N): Books have multiple authors; authors write multiple books. Implemented with a junction table (also called an association table or bridge table):

books              book_authors         authors
+----+---------+   +---------+-----------+   +----+---------+
| id | title   |   | book_id | author_id |   | id | name    |
+----+---------+   +---------+-----------+   +----+---------+
|  1 | Book A  |   |       1 |         1 |   |  1 | Smith   |
|  2 | Book B  |   |       1 |         2 |   |  2 | Jones   |
+----+---------+   |       2 |         2 |   +----+---------+
                    +---------+-----------+

Normalization: Eliminating Redundancy

Normalization is the process of organizing your tables to minimize data redundancy and dependency. There are several "normal forms," but for practical purposes, understanding the first three is sufficient.

First Normal Form (1NF): No Repeating Groups

A table is in 1NF if every column contains atomic (indivisible) values -- no lists, no sets, no comma-separated values in a single column.

Bad (violates 1NF):

+----+----------+-------------------------+
| id | name     | phone_numbers           |
+----+----------+-------------------------+
|  1 | Alice    | 555-0101, 555-0102      |
+----+----------+-------------------------+

Good (1NF):

users:                    phone_numbers:
+----+-------+           +----+---------+----------+
| id | name  |           | id | user_id | number   |
+----+-------+           +----+---------+----------+
|  1 | Alice |           |  1 |       1 | 555-0101 |
+----+-------+           |  2 |       1 | 555-0102 |
                          +----+---------+----------+

Intuition: If you ever find yourself wanting to store a comma-separated list in a database column, that is a strong signal you need a separate table. The AI might sometimes suggest array columns (PostgreSQL supports them), but a separate table is almost always the better design for queryability and integrity.

Second Normal Form (2NF): No Partial Dependencies

A table is in 2NF if it is in 1NF and every non-key column depends on the entire primary key. This matters most when you have composite primary keys.

Bad (violates 2NF): In a table with composite key (student_id, course_id), if student_name depends only on student_id, that is a partial dependency.

+------------+-----------+--------------+-------+
| student_id | course_id | student_name | grade |
+------------+-----------+--------------+-------+
|          1 |       101 | Alice        | A     |
|          1 |       102 | Alice        | B     |
+------------+-----------+--------------+-------+

"Alice" is repeated because student_name depends only on student_id, not on the full key. Fix: move student_name to a separate students table.

Third Normal Form (3NF): No Transitive Dependencies

A table is in 3NF if it is in 2NF and no non-key column depends on another non-key column.

Bad (violates 3NF):

+----+---------+-----------+-------------+
| id | zip_code| city      | state       |
+----+---------+-----------+-------------+
|  1 | 10001   | New York  | NY          |
|  2 | 10001   | New York  | NY          |
+----+---------+-----------+-------------+

Here city and state depend on zip_code, not on id. Fix: create a separate zip_codes table.

Vibe Coding Insight: When you prompt an AI with "design a database for a bookstore," it will usually produce a reasonably normalized schema. But check for these common issues: (1) denormalized fields added for "convenience" that will cause update anomalies, (2) missing junction tables for many-to-many relationships, and (3) overly normalized designs that split data into too many tiny tables. The sweet spot for most applications is 3NF with strategic, intentional denormalization where performance demands it.

When to Break the Rules

Normalization is not an absolute mandate. Sometimes you deliberately denormalize for performance:

  • Caching computed values: Storing order_total on the orders table instead of recalculating from line items every time.
  • Read-heavy workloads: Duplicating a username alongside a foreign key to avoid a JOIN in frequently-read views.
  • Reporting tables: Creating flattened, denormalized tables specifically for analytics queries.

The key is that denormalization should be a conscious, documented decision -- not an accident.


18.3 SQL Through AI: Queries, Joins, and Aggregations

SQL (Structured Query Language) is the universal language for interacting with relational databases. Even if you use an ORM (Section 18.4), understanding SQL is essential because you need to verify what the ORM generates and write complex queries that ORMs handle poorly.

The Four Fundamental Operations (CRUD)

CREATE (INSERT):

INSERT INTO users (username, email, created_at)
VALUES ('alice', 'alice@example.com', '2025-01-15 09:30:00');

READ (SELECT):

SELECT username, email FROM users WHERE id = 1;

UPDATE:

UPDATE users SET email = 'newalice@example.com' WHERE id = 1;

DELETE:

DELETE FROM users WHERE id = 1;

Warning

A DELETE or UPDATE without a WHERE clause affects every row in the table. This is one of the most dangerous mistakes in SQL. Always double-check your WHERE clause, even when AI generates the query. Some teams enforce a policy of running SELECT with the same WHERE clause first to verify which rows will be affected.

Filtering and Sorting

-- Multiple conditions
SELECT * FROM tasks
WHERE completed = false AND user_id = 1
ORDER BY created_at DESC
LIMIT 10;

-- Pattern matching
SELECT * FROM users WHERE email LIKE '%@example.com';

-- Range queries
SELECT * FROM orders
WHERE total BETWEEN 100 AND 500
  AND created_at >= '2025-01-01';

JOINs: Combining Data from Multiple Tables

JOINs are where SQL becomes powerful. They let you combine rows from different tables based on related columns.

INNER JOIN returns only rows that have matches in both tables:

SELECT users.username, tasks.title, tasks.completed
FROM users
INNER JOIN tasks ON users.id = tasks.user_id;

LEFT JOIN returns all rows from the left table and matched rows from the right. Unmatched right-side columns become NULL:

SELECT users.username, COUNT(tasks.id) AS task_count
FROM users
LEFT JOIN tasks ON users.id = tasks.user_id
GROUP BY users.username;

This query returns all users, even those with zero tasks. An INNER JOIN would exclude users without tasks.

Why This Matters for Vibe Coding: When you prompt an AI to "get all users with their task counts," the AI might generate either an INNER JOIN or LEFT JOIN. You need to know the difference to verify the output matches your intent. Do you want to see users with zero tasks? Then you need a LEFT JOIN.

Aggregation with GROUP BY

Aggregation functions (COUNT, SUM, AVG, MIN, MAX) summarize data across groups of rows:

-- Count tasks per user
SELECT user_id, COUNT(*) AS task_count
FROM tasks
GROUP BY user_id;

-- Average order total by month
SELECT
    DATE_TRUNC('month', created_at) AS month,
    AVG(total) AS avg_total,
    COUNT(*) AS order_count
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;

-- Filter groups with HAVING
SELECT user_id, COUNT(*) AS task_count
FROM tasks
GROUP BY user_id
HAVING COUNT(*) > 5;

Important Distinction: WHERE filters individual rows before grouping. HAVING filters groups after aggregation. AI assistants sometimes confuse these -- always check that the filter is in the right clause.

Subqueries and Common Table Expressions

For more complex queries, you can nest queries inside each other:

-- Subquery: Find users with more than average tasks
SELECT username FROM users
WHERE id IN (
    SELECT user_id FROM tasks
    GROUP BY user_id
    HAVING COUNT(*) > (SELECT AVG(task_count) FROM (
        SELECT COUNT(*) AS task_count FROM tasks GROUP BY user_id
    ) AS counts)
);

-- CTE (Common Table Expression): Same query, more readable
WITH task_counts AS (
    SELECT user_id, COUNT(*) AS cnt
    FROM tasks
    GROUP BY user_id
),
avg_count AS (
    SELECT AVG(cnt) AS avg_cnt FROM task_counts
)
SELECT u.username
FROM users u
JOIN task_counts tc ON u.id = tc.user_id
CROSS JOIN avg_count ac
WHERE tc.cnt > ac.avg_cnt;

Vibe Coding Insight: CTEs (WITH clauses) produce more readable SQL. When asking an AI to write complex queries, explicitly request: "Use CTEs instead of nested subqueries." The AI will usually comply, and the result will be much easier to verify and maintain.

Prompting AI for SQL Queries

Here is an effective prompt pattern for generating SQL:

I have a PostgreSQL database with these tables:
- users (id, username, email, created_at)
- tasks (id, user_id, title, description, completed, priority, created_at)
- tags (id, name)
- task_tags (task_id, tag_id)

Write a SQL query that finds the top 5 users by number of completed
high-priority tasks in the last 30 days. Include the user's email
and the count. Use CTEs for readability.

By providing the schema, specifying the database engine, and requesting a style preference, you give the AI enough context to produce accurate, usable SQL.


18.4 SQLAlchemy ORM: Python Database Access

While raw SQL is powerful, writing it directly in Python code is tedious, error-prone, and vulnerable to SQL injection attacks. An Object-Relational Mapper (ORM) lets you interact with your database using Python classes and objects instead of raw SQL strings.

SQLAlchemy is the most popular Python ORM, and it is the standard choice for serious Python applications. It provides two distinct APIs:

  • SQLAlchemy Core: A SQL expression language that gives you Pythonic access to SQL without full ORM overhead.
  • SQLAlchemy ORM: A full object-relational mapper that maps Python classes to database tables.

We will focus on the ORM, as it is the most common choice for web applications and the approach AI assistants default to.

Defining Models

A SQLAlchemy model is a Python class that represents a database table. Here is a complete example using the modern SQLAlchemy 2.0 style with type annotations:

from datetime import datetime
from sqlalchemy import String, Text, Boolean, ForeignKey, func
from sqlalchemy.orm import (
    DeclarativeBase,
    Mapped,
    mapped_column,
    relationship,
)


class Base(DeclarativeBase):
    """Base class for all models."""
    pass


class User(Base):
    __tablename__ = "users"

    id: Mapped[int] = mapped_column(primary_key=True)
    username: Mapped[str] = mapped_column(String(50), unique=True)
    email: Mapped[str] = mapped_column(String(255), unique=True)
    created_at: Mapped[datetime] = mapped_column(
        server_default=func.now()
    )

    # Relationship: one user has many tasks
    tasks: Mapped[list["Task"]] = relationship(
        back_populates="user", cascade="all, delete-orphan"
    )

    def __repr__(self) -> str:
        return f"<User(id={self.id}, username='{self.username}')>"


class Task(Base):
    __tablename__ = "tasks"

    id: Mapped[int] = mapped_column(primary_key=True)
    user_id: Mapped[int] = mapped_column(ForeignKey("users.id"))
    title: Mapped[str] = mapped_column(String(200))
    description: Mapped[str | None] = mapped_column(Text, default=None)
    completed: Mapped[bool] = mapped_column(Boolean, default=False)
    created_at: Mapped[datetime] = mapped_column(
        server_default=func.now()
    )

    # Relationship: each task belongs to one user
    user: Mapped["User"] = relationship(back_populates="tasks")

    def __repr__(self) -> str:
        return f"<Task(id={self.id}, title='{self.title}')>"

Notice several key features:

  • Type hints via Mapped[...]: SQLAlchemy 2.0 uses Python type annotations to define column types. Mapped[str] becomes a non-nullable string column; Mapped[str | None] becomes a nullable one.
  • mapped_column(): Configures column properties like constraints, defaults, and foreign keys.
  • relationship(): Defines how Python objects reference each other. user.tasks returns a list of Task objects; task.user returns the User object.
  • cascade="all, delete-orphan": When a user is deleted, all their tasks are deleted too.

Creating the Engine and Session

from sqlalchemy import create_engine
from sqlalchemy.orm import Session

# SQLite for development
engine = create_engine("sqlite:///app.db", echo=True)

# Create all tables
Base.metadata.create_all(engine)

# Create a session for database operations
with Session(engine) as session:
    # Create a new user
    user = User(username="alice", email="alice@example.com")
    session.add(user)
    session.commit()

The echo=True parameter logs all generated SQL, which is invaluable for learning and debugging.

Querying with the ORM

SQLAlchemy 2.0 uses a select() function for queries:

from sqlalchemy import select

with Session(engine) as session:
    # Get a single user by ID
    user = session.get(User, 1)

    # Query with filters
    stmt = select(User).where(User.username == "alice")
    user = session.scalars(stmt).first()

    # Query with joins
    stmt = (
        select(User, func.count(Task.id).label("task_count"))
        .join(Task, isouter=True)
        .group_by(User.id)
    )
    results = session.execute(stmt).all()

    # Query with ordering and limit
    stmt = (
        select(Task)
        .where(Task.completed == False)
        .order_by(Task.created_at.desc())
        .limit(10)
    )
    tasks = session.scalars(stmt).all()

Vibe Coding Insight: When asking AI to write SQLAlchemy code, always specify "SQLAlchemy 2.0 style" in your prompt. The 1.x style (using session.query()) is still common in AI training data and online examples, but the 2.0 style with select() is the modern standard. If the AI generates session.query(User).filter_by(...), that is the legacy API.

Working with Relationships

One of the ORM's greatest strengths is how naturally it handles relationships:

with Session(engine) as session:
    # Create user with tasks in one operation
    user = User(
        username="bob",
        email="bob@example.com",
        tasks=[
            Task(title="Buy groceries"),
            Task(title="Write report"),
        ],
    )
    session.add(user)
    session.commit()

    # Access related objects naturally
    for task in user.tasks:
        print(f"{task.title} - completed: {task.completed}")

    # Navigate the reverse relationship
    task = session.get(Task, 1)
    print(f"Task '{task.title}' belongs to {task.user.username}")

Session Management Patterns

Proper session management is critical. The session is a "unit of work" that tracks changes and writes them to the database on commit:

# Pattern 1: Context manager (recommended)
with Session(engine) as session:
    with session.begin():
        user = User(username="charlie", email="charlie@example.com")
        session.add(user)
    # Transaction is committed when the inner 'with' block exits
# Session is closed when the outer 'with' block exits

# Pattern 2: Explicit commit/rollback
session = Session(engine)
try:
    user = User(username="diana", email="diana@example.com")
    session.add(user)
    session.commit()
except Exception:
    session.rollback()
    raise
finally:
    session.close()

Warning

Never leave sessions open longer than necessary. A long-lived session holds database connections and can see stale data. For web applications, the standard pattern is one session per request, as we covered in Chapter 17.


18.5 Database Migrations with Alembic

Your database schema will evolve over time. You will add columns, create new tables, modify constraints, and remove obsolete fields. Migrations are versioned scripts that apply these changes in a controlled, repeatable way.

Alembic is the standard migration tool for SQLAlchemy. It tracks which migrations have been applied and can upgrade or downgrade your schema.

Setting Up Alembic

# Install Alembic
pip install alembic

# Initialize Alembic in your project
alembic init migrations

This creates a migrations/ directory with: - alembic.ini: Configuration file (database URL, etc.) - migrations/env.py: Script that runs migrations - migrations/versions/: Directory where migration scripts live

Configure env.py to use your SQLAlchemy models:

# In migrations/env.py
from app.models import Base
target_metadata = Base.metadata

Creating Migrations

Alembic can autogenerate migrations by comparing your models to the current database schema:

# Generate a migration based on model changes
alembic revision --autogenerate -m "add priority column to tasks"

This produces a migration file like:

"""add priority column to tasks

Revision ID: a1b2c3d4e5f6
Revises: 9z8y7x6w5v4u
Create Date: 2025-03-15 10:30:00.000000
"""
from alembic import op
import sqlalchemy as sa

revision = "a1b2c3d4e5f6"
down_revision = "9z8y7x6w5v4u"


def upgrade() -> None:
    op.add_column(
        "tasks",
        sa.Column("priority", sa.Integer(), nullable=True, server_default="0"),
    )


def downgrade() -> None:
    op.drop_column("tasks", "priority")

Running Migrations

# Apply all pending migrations
alembic upgrade head

# Upgrade one version at a time
alembic upgrade +1

# Downgrade one version
alembic downgrade -1

# See current version
alembic current

# See migration history
alembic history

Migration Best Practices

Vibe Coding Insight: AI assistants can generate Alembic migrations, but always review them before running. Common AI mistakes include: (1) forgetting to handle existing data when adding NOT NULL columns, (2) generating destructive downgrade functions that drop data, and (3) not including proper server_default values for new columns on tables with existing rows.

Rules to follow:

  1. Never edit a migration that has been applied to a shared database. Create a new migration instead.
  2. Always write downgrade functions. You will need them when things go wrong.
  3. Test migrations on a copy of production data before running them in production.
  4. Add data migrations when needed. Schema changes sometimes require transforming existing data.

Here is an example of a data migration:

def upgrade() -> None:
    # Step 1: Add column as nullable
    op.add_column(
        "tasks",
        sa.Column("priority", sa.Integer(), nullable=True),
    )

    # Step 2: Populate with default value
    op.execute("UPDATE tasks SET priority = 0 WHERE priority IS NULL")

    # Step 3: Make column non-nullable
    op.alter_column("tasks", "priority", nullable=False)

This three-step pattern -- add nullable, populate, make non-nullable -- is essential when adding required columns to tables that already contain data.


18.6 Schema Design Patterns

This section covers common patterns you will encounter when designing database schemas. Knowing these patterns helps you prompt AI more effectively and evaluate the schemas it generates.

One-to-Many with Self-Reference

An employee can manage other employees. This is a one-to-many relationship where both sides are in the same table:

class Employee(Base):
    __tablename__ = "employees"

    id: Mapped[int] = mapped_column(primary_key=True)
    name: Mapped[str] = mapped_column(String(100))
    manager_id: Mapped[int | None] = mapped_column(
        ForeignKey("employees.id")
    )

    manager: Mapped["Employee | None"] = relationship(
        remote_side=[id], back_populates="reports"
    )
    reports: Mapped[list["Employee"]] = relationship(
        back_populates="manager"
    )

Many-to-Many with Extra Data

When a junction table needs additional columns, use an association object pattern:

class Enrollment(Base):
    """A student's enrollment in a course with grade tracking."""
    __tablename__ = "enrollments"

    student_id: Mapped[int] = mapped_column(
        ForeignKey("students.id"), primary_key=True
    )
    course_id: Mapped[int] = mapped_column(
        ForeignKey("courses.id"), primary_key=True
    )
    grade: Mapped[str | None] = mapped_column(String(2))
    enrolled_at: Mapped[datetime] = mapped_column(
        server_default=func.now()
    )

    student: Mapped["Student"] = relationship(
        back_populates="enrollments"
    )
    course: Mapped["Course"] = relationship(
        back_populates="enrollments"
    )

This pattern is essential whenever your many-to-many relationship carries data -- an enrollment date, a role in a team, a quantity in an order.

Polymorphic Associations

Sometimes different types of objects share a common structure. For example, both blog posts and images can have comments. SQLAlchemy supports several polymorphic patterns.

Single Table Inheritance stores all types in one table with a discriminator column:

class Content(Base):
    __tablename__ = "content"

    id: Mapped[int] = mapped_column(primary_key=True)
    type: Mapped[str] = mapped_column(String(50))
    title: Mapped[str] = mapped_column(String(200))
    created_at: Mapped[datetime] = mapped_column(
        server_default=func.now()
    )

    __mapper_args__ = {
        "polymorphic_on": "type",
        "polymorphic_identity": "content",
    }


class BlogPost(Content):
    body: Mapped[str | None] = mapped_column(Text, default=None)

    __mapper_args__ = {
        "polymorphic_identity": "blog_post",
    }


class Image(Content):
    url: Mapped[str | None] = mapped_column(String(500), default=None)
    alt_text: Mapped[str | None] = mapped_column(
        String(200), default=None
    )

    __mapper_args__ = {
        "polymorphic_identity": "image",
    }

Trade-off: Single table inheritance is simple and fast (no JOINs needed), but it wastes space because columns specific to one type are NULL for other types. Use it when subtypes share most of their columns. Use joined table inheritance when subtypes have many distinct columns.

Soft Deletes

Instead of actually deleting records, mark them as deleted:

class SoftDeleteMixin:
    """Mixin that adds soft delete functionality."""
    deleted_at: Mapped[datetime | None] = mapped_column(default=None)

    @property
    def is_deleted(self) -> bool:
        return self.deleted_at is not None


class User(SoftDeleteMixin, Base):
    __tablename__ = "users"
    id: Mapped[int] = mapped_column(primary_key=True)
    username: Mapped[str] = mapped_column(String(50))
    # ... other columns

Soft deletes preserve data for auditing and allow recovery, but they require every query to include a WHERE deleted_at IS NULL filter. Many teams add a default query filter to handle this automatically.

Timestamps Mixin

Almost every table needs created and updated timestamps:

class TimestampMixin:
    """Mixin that adds created_at and updated_at columns."""
    created_at: Mapped[datetime] = mapped_column(
        server_default=func.now()
    )
    updated_at: Mapped[datetime] = mapped_column(
        server_default=func.now(),
        onupdate=func.now(),
    )

Practical Tip: Define mixins early in your project. When prompting AI, say: "All models should inherit from TimestampMixin for automatic timestamp tracking." This ensures consistency across your entire schema.


18.7 Indexing and Query Optimization

A database with millions of rows but no indexes is like a library with no catalog -- finding anything requires scanning every shelf. Indexes are data structures that make lookups fast.

How Indexes Work

An index is a sorted data structure (typically a B-tree) that maps column values to row locations. When you create an index on users.email, the database maintains a sorted list of all email values with pointers to their rows. Looking up a user by email goes from scanning millions of rows (O(n)) to a tree lookup (O(log n)).

When to Add Indexes

Create indexes on columns that are:

  1. Used in WHERE clauses: SELECT * FROM users WHERE email = 'alice@example.com'
  2. Used in JOIN conditions: ON users.id = tasks.user_id
  3. Used in ORDER BY: ORDER BY created_at DESC
  4. Used in GROUP BY: GROUP BY category_id

Primary keys and columns with UNIQUE constraints are automatically indexed.

Creating Indexes in SQLAlchemy

from sqlalchemy import Index

class Task(Base):
    __tablename__ = "tasks"

    id: Mapped[int] = mapped_column(primary_key=True)
    user_id: Mapped[int] = mapped_column(
        ForeignKey("users.id"), index=True  # Simple index
    )
    title: Mapped[str] = mapped_column(String(200))
    completed: Mapped[bool] = mapped_column(default=False)
    priority: Mapped[int] = mapped_column(default=0)
    created_at: Mapped[datetime] = mapped_column(
        server_default=func.now()
    )

    # Composite index for common query pattern
    __table_args__ = (
        Index("ix_tasks_user_completed", "user_id", "completed"),
        Index(
            "ix_tasks_priority_created",
            "priority",
            "created_at",
        ),
    )

The Cost of Indexes

Indexes are not free:

  • Write overhead: Every INSERT, UPDATE, or DELETE must also update all relevant indexes.
  • Storage space: Indexes consume disk space proportional to the indexed data.
  • Maintenance: Indexes can become fragmented over time and need periodic maintenance.

Rule of Thumb: Index columns you query on frequently. Do not index every column -- that slows writes and wastes storage. For a typical web application, start with indexes on foreign keys and commonly filtered columns, then add more based on actual query performance data.

Using EXPLAIN to Analyze Queries

The EXPLAIN command shows how the database plans to execute a query:

EXPLAIN ANALYZE
SELECT u.username, COUNT(t.id)
FROM users u
LEFT JOIN tasks t ON u.id = t.user_id
WHERE t.completed = true
GROUP BY u.username;

The output reveals whether the database is using indexes (good) or doing full table scans (potentially bad). Key things to look for:

  • Seq Scan: A sequential scan (reading every row). Fine for small tables, problematic for large ones.
  • Index Scan: Using an index. This is what you want for selective queries.
  • Sort: The database is sorting in memory. Consider an index if this is slow.
  • Hash Join / Nested Loop: Join strategies. Hash joins are generally faster for large data sets.

Vibe Coding Insight: You can paste EXPLAIN output directly into your AI prompt and ask: "How can I optimize this query? What indexes would help?" The AI is remarkably good at reading execution plans and suggesting improvements.

Common Optimization Patterns

Select only the columns you need:

# Bad: fetches all columns
stmt = select(User).where(User.id == 1)

# Better: fetches only needed columns
stmt = select(User.username, User.email).where(User.id == 1)

Avoid N+1 queries:

# Bad: N+1 queries (1 for users, N for each user's tasks)
users = session.scalars(select(User)).all()
for user in users:
    print(len(user.tasks))  # Each access triggers a query

# Good: eager loading
from sqlalchemy.orm import selectinload

stmt = select(User).options(selectinload(User.tasks))
users = session.scalars(stmt).all()
for user in users:
    print(len(user.tasks))  # No additional queries

The N+1 problem is the single most common performance issue in ORM-based applications. When you ask an AI to write code that iterates over related objects, always check whether it includes eager loading.


18.8 NoSQL Options: When and Why

Relational databases are the right choice for most applications, but they are not the only choice. This section covers when and why you might reach for a NoSQL database.

Document Databases: MongoDB

MongoDB stores data as flexible JSON-like documents (BSON). Each document can have a different structure, making it appealing for:

  • Rapidly evolving schemas: When your data model changes frequently during early development.
  • Hierarchical data: When data is naturally nested (e.g., a blog post with embedded comments).
  • Content management: Where each piece of content might have different fields.
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017")
db = client.myapp

# Insert a document -- no predefined schema needed
db.users.insert_one({
    "username": "alice",
    "email": "alice@example.com",
    "preferences": {
        "theme": "dark",
        "language": "en",
        "notifications": {"email": True, "sms": False},
    },
})

# Query with flexible filters
user = db.users.find_one({"preferences.theme": "dark"})

When NOT to use MongoDB: - When your data has complex relationships (use a relational database) - When you need ACID transactions across multiple documents (MongoDB has limited transaction support) - When you need complex JOINs and aggregations (SQL is far more capable)

Honest Assessment: MongoDB became popular partly due to marketing and the appeal of "schemaless" development. In practice, most applications benefit from the structure and guarantees of a relational database. Use MongoDB when you genuinely need document flexibility, not because it seems simpler.

Key-Value Stores: Redis

Redis is an in-memory data store that excels at:

  • Caching: Store frequently accessed data in memory for microsecond access times.
  • Session storage: Web session data that needs to be fast but can tolerate loss.
  • Rate limiting: Count API requests per user with automatic expiration.
  • Queues: Simple job queues and pub/sub messaging.
import redis

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

# Simple key-value
r.set("user:1:name", "alice")
name = r.get("user:1:name")

# Caching with expiration (60 seconds)
r.setex("api:weather:nyc", 60, '{"temp": 72, "condition": "sunny"}')

# Rate limiting
pipe = r.pipeline()
key = f"rate_limit:user:1:{datetime.now().minute}"
pipe.incr(key)
pipe.expire(key, 60)
count, _ = pipe.execute()
if count > 100:
    raise RateLimitExceeded()

Choosing the Right Database

Requirement Best Choice
Structured data with relationships PostgreSQL/MySQL
Rapid prototyping, file-based SQLite
Flexible documents, nested data MongoDB
Caching, sessions, rate limiting Redis
Time-series data TimescaleDB
Full-text search Elasticsearch
Graph relationships Neo4j

Practical Tip: For most vibe coding projects, start with PostgreSQL (or SQLite for local development). Add Redis for caching if and when performance requires it. You almost certainly do not need MongoDB unless you have a specific use case that demands document flexibility.


18.9 Data Validation and Integrity

Good data design enforces validity at multiple levels. This section covers the layers of defense against bad data.

Layer 1: Database Constraints

The database itself should be your last line of defense. Constraints in the schema prevent invalid data even if application code has bugs:

class User(Base):
    __tablename__ = "users"

    id: Mapped[int] = mapped_column(primary_key=True)
    username: Mapped[str] = mapped_column(
        String(50), unique=True, nullable=False
    )
    email: Mapped[str] = mapped_column(
        String(255), unique=True, nullable=False
    )
    age: Mapped[int | None] = mapped_column(
        CheckConstraint("age >= 0 AND age <= 150", name="valid_age")
    )

    __table_args__ = (
        CheckConstraint(
            "length(username) >= 3",
            name="username_min_length",
        ),
    )

Common database constraints:

  • NOT NULL: Column must have a value
  • UNIQUE: No two rows can have the same value in this column
  • CHECK: Arbitrary boolean expression that must be true
  • FOREIGN KEY: Value must exist in the referenced table
  • DEFAULT: Value used when none is specified

Layer 2: Application-Level Validation with Pydantic

Pydantic validates data before it reaches the database. This catches errors early and provides clear error messages:

from pydantic import BaseModel, EmailStr, Field, field_validator


class UserCreate(BaseModel):
    """Schema for creating a new user."""
    username: str = Field(
        min_length=3, max_length=50, pattern=r"^[a-zA-Z0-9_]+$"
    )
    email: EmailStr
    age: int | None = Field(default=None, ge=0, le=150)

    @field_validator("username")
    @classmethod
    def username_not_reserved(cls, v: str) -> str:
        reserved = {"admin", "root", "system", "null"}
        if v.lower() in reserved:
            raise ValueError(f"Username '{v}' is reserved")
        return v.lower()


class UserUpdate(BaseModel):
    """Schema for updating user fields (all optional)."""
    username: str | None = Field(
        default=None, min_length=3, max_length=50
    )
    email: EmailStr | None = None
    age: int | None = Field(default=None, ge=0, le=150)

This validation layer integrates seamlessly with the REST APIs from Chapter 17. A FastAPI endpoint using these schemas automatically validates request data before your database code ever sees it.

Layer 3: Business Logic Validation

Some validation rules require database queries -- for example, checking that a username is not already taken:

def create_user(session: Session, data: UserCreate) -> User:
    """Create a new user with full validation."""
    # Check uniqueness (business logic requiring DB access)
    existing = session.scalars(
        select(User).where(
            (User.username == data.username)
            | (User.email == data.email)
        )
    ).first()

    if existing:
        if existing.username == data.username:
            raise ValueError(f"Username '{data.username}' is taken")
        raise ValueError(f"Email '{data.email}' is already registered")

    user = User(
        username=data.username,
        email=data.email,
        age=data.age,
    )
    session.add(user)
    session.flush()  # Get the generated ID without committing
    return user

The Defense-in-Depth Principle

Request Data
    │
    ▼
[Pydantic Validation]  ← Catches type errors, format issues
    │
    ▼
[Business Logic]        ← Checks uniqueness, permissions, rules
    │
    ▼
[Database Constraints]  ← Final safety net catches anything missed

Vibe Coding Insight: When asking AI to build a data layer, explicitly request all three validation layers: "Validate input with Pydantic, check business rules in the service layer, and enforce constraints at the database level." Without this guidance, AI often implements only one layer, leaving your data vulnerable.


18.10 Building a Data Layer with AI

This final section ties everything together into a practical workflow for building a complete data layer using AI assistance.

Step 1: Describe Your Domain

Start by writing a clear description of your domain entities and relationships. This becomes your primary prompt:

I am building a project management application. The entities are:

- Users: have a username, email, and role (admin, manager, member)
- Projects: have a name, description, and owner (a user)
- Tasks: belong to a project and are assigned to a user. They have
  a title, description, status (todo, in_progress, done), priority
  (1-5), and due date.
- Comments: belong to a task, written by a user, contain text and
  a timestamp.

Relationships:
- A user can own multiple projects
- A project has multiple tasks
- A task is assigned to one user
- A task can have multiple comments
- A user can be a member of multiple projects (many-to-many)

Generate SQLAlchemy 2.0 models with:
- Type hints using Mapped[]
- Proper relationships with back_populates
- A TimestampMixin for created_at/updated_at
- Appropriate indexes
- __repr__ methods

Step 2: Review and Refine the Schema

When the AI generates models, check for:

  • Missing indexes on foreign keys: AI sometimes forgets index=True on foreign key columns.
  • Cascade behavior: Are orphaned records handled correctly when parents are deleted?
  • Nullable fields: Should that field really be optional, or should it be required?
  • String lengths: Are VARCHAR lengths appropriate for the data?
  • Missing constraints: Are there CHECK constraints for enum-like fields (e.g., status values)?

Step 3: Generate Pydantic Schemas

Prompt the AI to create matching Pydantic models for input validation:

Based on the SQLAlchemy models above, generate Pydantic schemas for:
- Creating each entity (required fields only)
- Updating each entity (all fields optional)
- Reading each entity (includes id and timestamps)

Use Pydantic v2 syntax with Field validators.

Step 4: Build Repository Functions

The repository pattern encapsulates database access behind a clean interface:

class TaskRepository:
    """Handles all database operations for tasks."""

    def __init__(self, session: Session) -> None:
        self.session = session

    def get_by_id(self, task_id: int) -> Task | None:
        return self.session.get(Task, task_id)

    def list_for_project(
        self,
        project_id: int,
        status: str | None = None,
        limit: int = 50,
        offset: int = 0,
    ) -> list[Task]:
        stmt = (
            select(Task)
            .where(Task.project_id == project_id)
            .order_by(Task.priority.desc(), Task.created_at)
            .limit(limit)
            .offset(offset)
        )
        if status:
            stmt = stmt.where(Task.status == status)
        return list(self.session.scalars(stmt).all())

    def create(self, data: TaskCreate) -> Task:
        task = Task(**data.model_dump())
        self.session.add(task)
        self.session.flush()
        return task

    def update(self, task: Task, data: TaskUpdate) -> Task:
        for field, value in data.model_dump(exclude_unset=True).items():
            setattr(task, field, value)
        self.session.flush()
        return task

    def delete(self, task: Task) -> None:
        self.session.delete(task)
        self.session.flush()

Step 5: Set Up Migrations

With models defined, initialize Alembic and create your initial migration:

alembic init migrations
# Configure env.py to point to your models
alembic revision --autogenerate -m "initial schema"
alembic upgrade head

Step 6: Integration with Your API

Connect the data layer to your FastAPI application from Chapter 17:

from fastapi import FastAPI, Depends, HTTPException
from sqlalchemy.orm import Session

app = FastAPI()


def get_session():
    """Dependency that provides a database session."""
    with Session(engine) as session:
        with session.begin():
            yield session


@app.post("/tasks", response_model=TaskRead)
def create_task(
    data: TaskCreate,
    session: Session = Depends(get_session),
):
    repo = TaskRepository(session)
    return repo.create(data)


@app.get("/projects/{project_id}/tasks", response_model=list[TaskRead])
def list_project_tasks(
    project_id: int,
    status: str | None = None,
    session: Session = Depends(get_session),
):
    repo = TaskRepository(session)
    return repo.list_for_project(project_id, status=status)

Vibe Coding Insight: This layered architecture -- Pydantic schemas for validation, repository classes for data access, FastAPI endpoints for HTTP handling -- is the standard pattern for Python web applications. When you ask an AI to build a feature, frame your request in terms of these layers: "Add a Pydantic schema for task filtering, a repository method for searching tasks, and a FastAPI endpoint that ties them together."

The Complete Workflow

Here is the full development cycle for building a data layer with AI:

  1. Design: Describe your domain entities and relationships to the AI
  2. Model: Have AI generate SQLAlchemy models; review for correctness
  3. Validate: Generate Pydantic schemas for input/output validation
  4. Access: Build repository classes for clean database access
  5. Migrate: Use Alembic to version your schema changes
  6. Integrate: Connect to your API layer with dependency injection
  7. Optimize: Add indexes based on actual query patterns; use EXPLAIN to verify
  8. Test: Write tests using an in-memory SQLite database for speed

Each of these steps is a natural prompting opportunity. Rather than asking for everything at once, work through them methodically. A focused prompt ("generate a repository class for tasks with methods for CRUD, filtering by status, and pagination") produces better results than a vague one ("build my data layer").


Chapter Summary

This chapter covered the full spectrum of database design and data modeling for vibe coders. You learned relational fundamentals -- tables, relationships, keys, and normalization -- that give you the vocabulary to communicate with AI about data design. You wrote SQL queries and understood JOINs, aggregations, and CTEs well enough to verify AI-generated SQL. You built Python data layers with SQLAlchemy 2.0, managed schema evolution with Alembic, and learned when to consider NoSQL alternatives.

Most importantly, you learned the layered approach to data integrity: Pydantic for input validation, business logic for domain rules, and database constraints as the final safety net. This defense-in-depth strategy protects your data even when individual layers have bugs.

The patterns in this chapter -- the repository pattern, the timestamps mixin, soft deletes, eager loading to avoid N+1 queries -- are the building blocks of professional data layers. In Chapter 19, you will combine the API skills from Chapter 17 with the database skills from this chapter to build full-stack applications with persistent data, authentication, and real-world deployment considerations.

Key Reminder: Database design decisions are among the hardest to change later. A poorly designed schema embedded in a running application with real user data is extremely expensive to fix. Invest the time to get your data model right early, even if it means spending an extra hour reviewing AI-generated schemas before writing a single line of application code.


Next chapter: Chapter 19 -- Full-Stack Development, where you combine frontend, backend, and database into complete applications.