> "Bad programmers worry about the code. Good programmers worry about data structures and their relationships." -- Linus Torvalds
In This Chapter
- Learning Objectives
- 18.1 Database Fundamentals for Vibe Coders
- 18.2 Relational Database Design
- 18.3 SQL Through AI: Queries, Joins, and Aggregations
- 18.4 SQLAlchemy ORM: Python Database Access
- 18.5 Database Migrations with Alembic
- 18.6 Schema Design Patterns
- 18.7 Indexing and Query Optimization
- 18.8 NoSQL Options: When and Why
- 18.9 Data Validation and Integrity
- 18.10 Building a Data Layer with AI
- Chapter Summary
Chapter 18: Database Design and Data Modeling
"Bad programmers worry about the code. Good programmers worry about data structures and their relationships." -- Linus Torvalds
In the preceding chapters, you built command-line tools (Chapter 15), web frontends (Chapter 16), and REST APIs (Chapter 17). But every one of those projects shares a common need: persistent, structured data. A task manager that forgets your tasks when you close the terminal is a toy. A web application that loses user registrations on restart is useless. A REST API without reliable data storage is a facade with nothing behind it.
This chapter teaches you how to design, build, and manage the data layer that underpins real software. You will learn relational database fundamentals, write SQL with AI assistance, use SQLAlchemy to interact with databases from Python, manage schema evolution with Alembic, and understand when relational databases are not the right tool. By the end, you will be able to ask an AI assistant to help you design a database schema and critically evaluate the result -- knowing which suggestions to keep, which to modify, and which to reject entirely.
Database design is one of the areas where AI assistants shine brightest and stumble most dangerously. An AI can generate a perfectly normalized schema in seconds, but it can also produce one with subtle flaws -- missing indexes, incorrect relationship cardinalities, or denormalization choices that will haunt you at scale. The vibe coder who understands the underlying principles will consistently get better results than one who accepts AI output uncritically.
Learning Objectives
By the end of this chapter, you will be able to:
- Define (Bloom's: Remember) the core components of a relational database: tables, rows, columns, primary keys, and foreign keys
- Explain (Bloom's: Understand) normalization through Third Normal Form and articulate why it matters for data integrity
- Write (Bloom's: Apply) SQL queries including SELECT, INSERT, UPDATE, DELETE, JOINs, GROUP BY, and aggregate functions with AI assistance
- Construct (Bloom's: Apply) SQLAlchemy ORM models with relationships, constraints, and type hints
- Manage (Bloom's: Apply) database schema changes using Alembic migrations
- Analyze (Bloom's: Analyze) schema design patterns including one-to-many, many-to-many, and polymorphic associations
- Evaluate (Bloom's: Evaluate) indexing strategies and their impact on query performance
- Compare (Bloom's: Analyze) relational and NoSQL databases to select the appropriate tool for a given problem
- Design (Bloom's: Create) a complete data layer with validation, constraints, and proper access patterns
18.1 Database Fundamentals for Vibe Coders
Before you can effectively prompt an AI to design your database, you need a mental model of how databases work. This section builds that model from first principles.
What Is a Database?
A database is an organized collection of data that supports efficient storage, retrieval, and modification. While you could store data in flat files, CSV spreadsheets, or JSON documents (as you did in Chapter 6), databases provide critical guarantees that files do not:
- Concurrent access: Multiple users or processes can read and write simultaneously without corrupting data.
- Durability: Once data is saved (committed), it survives crashes, power failures, and restarts.
- Integrity constraints: Rules enforced by the database itself prevent invalid data from being stored.
- Efficient querying: Indexes and query optimizers make it fast to find specific records among millions.
Relational Databases: The Workhorse
A relational database organizes data into tables (also called relations). Each table has:
- Columns (fields/attributes): Define the type of data stored. A
userstable might have columnsid,username,email, andcreated_at. - Rows (records/tuples): Each row is one entry in the table -- one user, one order, one product.
- Primary key: A column (or set of columns) that uniquely identifies each row. Typically an auto-incrementing integer
idor a UUID. - Foreign key: A column that references the primary key of another table, establishing a relationship between tables.
Here is a concrete example:
users table:
+----+----------+-------------------+---------------------+
| id | username | email | created_at |
+----+----------+-------------------+---------------------+
| 1 | alice | alice@example.com | 2025-01-15 09:30:00 |
| 2 | bob | bob@example.com | 2025-02-20 14:15:00 |
+----+----------+-------------------+---------------------+
tasks table:
+----+---------+----------------+-----------+
| id | user_id | title | completed |
+----+---------+----------------+-----------+
| 1 | 1 | Buy groceries | false |
| 2 | 1 | Write chapter | true |
| 3 | 2 | Fix bug #42 | false |
+----+---------+----------------+-----------+
The user_id column in the tasks table is a foreign key pointing to users.id. This relationship means "each task belongs to one user, and each user can have many tasks."
The ACID Guarantees
Relational databases provide four guarantees known by the acronym ACID:
- Atomicity: A transaction either completes entirely or has no effect. If you transfer money between accounts, both the debit and credit happen, or neither does.
- Consistency: Every transaction moves the database from one valid state to another. Constraints are never violated.
- Isolation: Concurrent transactions do not interfere with each other. Each transaction sees a consistent snapshot of the data.
- Durability: Once a transaction is committed, it is permanent -- even if the server crashes immediately afterward.
Vibe Coding Insight: When you ask an AI to "create a database for my application," these guarantees are what you are really asking for. Understanding ACID helps you evaluate whether an AI's suggestion of using a simple JSON file versus a real database is appropriate for your use case.
Popular Relational Databases
For vibe coding projects, you will most commonly encounter:
| Database | Best For | Key Trait |
|---|---|---|
| SQLite | Prototyping, small apps, embedded systems | Zero configuration, file-based |
| PostgreSQL | Production web apps, complex queries | Feature-rich, highly reliable |
| MySQL | Web applications, read-heavy workloads | Widely deployed, well-known |
SQLite is the ideal starting point for vibe coders. It requires no server setup -- the entire database lives in a single file. You can start building immediately and migrate to PostgreSQL when your application demands it.
Practical Tip: When prompting an AI to build your data layer, start with SQLite for development and specify PostgreSQL for production. SQLAlchemy (Section 18.4) makes switching between them nearly painless.
18.2 Relational Database Design
Designing a good database schema is part science and part art. This section teaches the science -- normalization -- and gives you the intuition to apply it effectively.
Thinking in Relationships
The first step in database design is identifying the entities (things) in your system and the relationships between them. Consider an online bookstore:
- Entities: Books, Authors, Customers, Orders, Reviews
- Relationships: A book has one or more authors. A customer places orders. An order contains books. A customer writes reviews about books.
Each entity becomes a table. Each relationship becomes either a foreign key or a junction table (more on this shortly).
Types of Relationships
There are three fundamental relationship types:
One-to-Many (1:N): The most common relationship. One user has many tasks. One department has many employees. Implemented with a foreign key on the "many" side.
departments employees
+----+--------+ +----+---------+---------------+
| id | name | | id | dept_id | name |
+----+--------+ +----+---------+---------------+
| 1 | Sales | | 1 | 1 | Alice |
| 2 | Eng | | 2 | 2 | Bob |
+----+--------+ | 3 | 2 | Charlie |
+----+---------+---------------+
One-to-One (1:1): Each row in table A corresponds to exactly one row in table B. A user has one profile. Implemented with a foreign key plus a unique constraint.
Many-to-Many (M:N): Books have multiple authors; authors write multiple books. Implemented with a junction table (also called an association table or bridge table):
books book_authors authors
+----+---------+ +---------+-----------+ +----+---------+
| id | title | | book_id | author_id | | id | name |
+----+---------+ +---------+-----------+ +----+---------+
| 1 | Book A | | 1 | 1 | | 1 | Smith |
| 2 | Book B | | 1 | 2 | | 2 | Jones |
+----+---------+ | 2 | 2 | +----+---------+
+---------+-----------+
Normalization: Eliminating Redundancy
Normalization is the process of organizing your tables to minimize data redundancy and dependency. There are several "normal forms," but for practical purposes, understanding the first three is sufficient.
First Normal Form (1NF): No Repeating Groups
A table is in 1NF if every column contains atomic (indivisible) values -- no lists, no sets, no comma-separated values in a single column.
Bad (violates 1NF):
+----+----------+-------------------------+
| id | name | phone_numbers |
+----+----------+-------------------------+
| 1 | Alice | 555-0101, 555-0102 |
+----+----------+-------------------------+
Good (1NF):
users: phone_numbers:
+----+-------+ +----+---------+----------+
| id | name | | id | user_id | number |
+----+-------+ +----+---------+----------+
| 1 | Alice | | 1 | 1 | 555-0101 |
+----+-------+ | 2 | 1 | 555-0102 |
+----+---------+----------+
Intuition: If you ever find yourself wanting to store a comma-separated list in a database column, that is a strong signal you need a separate table. The AI might sometimes suggest array columns (PostgreSQL supports them), but a separate table is almost always the better design for queryability and integrity.
Second Normal Form (2NF): No Partial Dependencies
A table is in 2NF if it is in 1NF and every non-key column depends on the entire primary key. This matters most when you have composite primary keys.
Bad (violates 2NF): In a table with composite key (student_id, course_id), if student_name depends only on student_id, that is a partial dependency.
+------------+-----------+--------------+-------+
| student_id | course_id | student_name | grade |
+------------+-----------+--------------+-------+
| 1 | 101 | Alice | A |
| 1 | 102 | Alice | B |
+------------+-----------+--------------+-------+
"Alice" is repeated because student_name depends only on student_id, not on the full key. Fix: move student_name to a separate students table.
Third Normal Form (3NF): No Transitive Dependencies
A table is in 3NF if it is in 2NF and no non-key column depends on another non-key column.
Bad (violates 3NF):
+----+---------+-----------+-------------+
| id | zip_code| city | state |
+----+---------+-----------+-------------+
| 1 | 10001 | New York | NY |
| 2 | 10001 | New York | NY |
+----+---------+-----------+-------------+
Here city and state depend on zip_code, not on id. Fix: create a separate zip_codes table.
Vibe Coding Insight: When you prompt an AI with "design a database for a bookstore," it will usually produce a reasonably normalized schema. But check for these common issues: (1) denormalized fields added for "convenience" that will cause update anomalies, (2) missing junction tables for many-to-many relationships, and (3) overly normalized designs that split data into too many tiny tables. The sweet spot for most applications is 3NF with strategic, intentional denormalization where performance demands it.
When to Break the Rules
Normalization is not an absolute mandate. Sometimes you deliberately denormalize for performance:
- Caching computed values: Storing
order_totalon the orders table instead of recalculating from line items every time. - Read-heavy workloads: Duplicating a username alongside a foreign key to avoid a JOIN in frequently-read views.
- Reporting tables: Creating flattened, denormalized tables specifically for analytics queries.
The key is that denormalization should be a conscious, documented decision -- not an accident.
18.3 SQL Through AI: Queries, Joins, and Aggregations
SQL (Structured Query Language) is the universal language for interacting with relational databases. Even if you use an ORM (Section 18.4), understanding SQL is essential because you need to verify what the ORM generates and write complex queries that ORMs handle poorly.
The Four Fundamental Operations (CRUD)
CREATE (INSERT):
INSERT INTO users (username, email, created_at)
VALUES ('alice', 'alice@example.com', '2025-01-15 09:30:00');
READ (SELECT):
SELECT username, email FROM users WHERE id = 1;
UPDATE:
UPDATE users SET email = 'newalice@example.com' WHERE id = 1;
DELETE:
DELETE FROM users WHERE id = 1;
Warning
A DELETE or UPDATE without a WHERE clause affects every row in the table. This is one of the most dangerous mistakes in SQL. Always double-check your WHERE clause, even when AI generates the query. Some teams enforce a policy of running SELECT with the same WHERE clause first to verify which rows will be affected.
Filtering and Sorting
-- Multiple conditions
SELECT * FROM tasks
WHERE completed = false AND user_id = 1
ORDER BY created_at DESC
LIMIT 10;
-- Pattern matching
SELECT * FROM users WHERE email LIKE '%@example.com';
-- Range queries
SELECT * FROM orders
WHERE total BETWEEN 100 AND 500
AND created_at >= '2025-01-01';
JOINs: Combining Data from Multiple Tables
JOINs are where SQL becomes powerful. They let you combine rows from different tables based on related columns.
INNER JOIN returns only rows that have matches in both tables:
SELECT users.username, tasks.title, tasks.completed
FROM users
INNER JOIN tasks ON users.id = tasks.user_id;
LEFT JOIN returns all rows from the left table and matched rows from the right. Unmatched right-side columns become NULL:
SELECT users.username, COUNT(tasks.id) AS task_count
FROM users
LEFT JOIN tasks ON users.id = tasks.user_id
GROUP BY users.username;
This query returns all users, even those with zero tasks. An INNER JOIN would exclude users without tasks.
Why This Matters for Vibe Coding: When you prompt an AI to "get all users with their task counts," the AI might generate either an INNER JOIN or LEFT JOIN. You need to know the difference to verify the output matches your intent. Do you want to see users with zero tasks? Then you need a LEFT JOIN.
Aggregation with GROUP BY
Aggregation functions (COUNT, SUM, AVG, MIN, MAX) summarize data across groups of rows:
-- Count tasks per user
SELECT user_id, COUNT(*) AS task_count
FROM tasks
GROUP BY user_id;
-- Average order total by month
SELECT
DATE_TRUNC('month', created_at) AS month,
AVG(total) AS avg_total,
COUNT(*) AS order_count
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;
-- Filter groups with HAVING
SELECT user_id, COUNT(*) AS task_count
FROM tasks
GROUP BY user_id
HAVING COUNT(*) > 5;
Important Distinction: WHERE filters individual rows before grouping. HAVING filters groups after aggregation. AI assistants sometimes confuse these -- always check that the filter is in the right clause.
Subqueries and Common Table Expressions
For more complex queries, you can nest queries inside each other:
-- Subquery: Find users with more than average tasks
SELECT username FROM users
WHERE id IN (
SELECT user_id FROM tasks
GROUP BY user_id
HAVING COUNT(*) > (SELECT AVG(task_count) FROM (
SELECT COUNT(*) AS task_count FROM tasks GROUP BY user_id
) AS counts)
);
-- CTE (Common Table Expression): Same query, more readable
WITH task_counts AS (
SELECT user_id, COUNT(*) AS cnt
FROM tasks
GROUP BY user_id
),
avg_count AS (
SELECT AVG(cnt) AS avg_cnt FROM task_counts
)
SELECT u.username
FROM users u
JOIN task_counts tc ON u.id = tc.user_id
CROSS JOIN avg_count ac
WHERE tc.cnt > ac.avg_cnt;
Vibe Coding Insight: CTEs (WITH clauses) produce more readable SQL. When asking an AI to write complex queries, explicitly request: "Use CTEs instead of nested subqueries." The AI will usually comply, and the result will be much easier to verify and maintain.
Prompting AI for SQL Queries
Here is an effective prompt pattern for generating SQL:
I have a PostgreSQL database with these tables:
- users (id, username, email, created_at)
- tasks (id, user_id, title, description, completed, priority, created_at)
- tags (id, name)
- task_tags (task_id, tag_id)
Write a SQL query that finds the top 5 users by number of completed
high-priority tasks in the last 30 days. Include the user's email
and the count. Use CTEs for readability.
By providing the schema, specifying the database engine, and requesting a style preference, you give the AI enough context to produce accurate, usable SQL.
18.4 SQLAlchemy ORM: Python Database Access
While raw SQL is powerful, writing it directly in Python code is tedious, error-prone, and vulnerable to SQL injection attacks. An Object-Relational Mapper (ORM) lets you interact with your database using Python classes and objects instead of raw SQL strings.
SQLAlchemy is the most popular Python ORM, and it is the standard choice for serious Python applications. It provides two distinct APIs:
- SQLAlchemy Core: A SQL expression language that gives you Pythonic access to SQL without full ORM overhead.
- SQLAlchemy ORM: A full object-relational mapper that maps Python classes to database tables.
We will focus on the ORM, as it is the most common choice for web applications and the approach AI assistants default to.
Defining Models
A SQLAlchemy model is a Python class that represents a database table. Here is a complete example using the modern SQLAlchemy 2.0 style with type annotations:
from datetime import datetime
from sqlalchemy import String, Text, Boolean, ForeignKey, func
from sqlalchemy.orm import (
DeclarativeBase,
Mapped,
mapped_column,
relationship,
)
class Base(DeclarativeBase):
"""Base class for all models."""
pass
class User(Base):
__tablename__ = "users"
id: Mapped[int] = mapped_column(primary_key=True)
username: Mapped[str] = mapped_column(String(50), unique=True)
email: Mapped[str] = mapped_column(String(255), unique=True)
created_at: Mapped[datetime] = mapped_column(
server_default=func.now()
)
# Relationship: one user has many tasks
tasks: Mapped[list["Task"]] = relationship(
back_populates="user", cascade="all, delete-orphan"
)
def __repr__(self) -> str:
return f"<User(id={self.id}, username='{self.username}')>"
class Task(Base):
__tablename__ = "tasks"
id: Mapped[int] = mapped_column(primary_key=True)
user_id: Mapped[int] = mapped_column(ForeignKey("users.id"))
title: Mapped[str] = mapped_column(String(200))
description: Mapped[str | None] = mapped_column(Text, default=None)
completed: Mapped[bool] = mapped_column(Boolean, default=False)
created_at: Mapped[datetime] = mapped_column(
server_default=func.now()
)
# Relationship: each task belongs to one user
user: Mapped["User"] = relationship(back_populates="tasks")
def __repr__(self) -> str:
return f"<Task(id={self.id}, title='{self.title}')>"
Notice several key features:
- Type hints via
Mapped[...]: SQLAlchemy 2.0 uses Python type annotations to define column types.Mapped[str]becomes a non-nullable string column;Mapped[str | None]becomes a nullable one. mapped_column(): Configures column properties like constraints, defaults, and foreign keys.relationship(): Defines how Python objects reference each other.user.tasksreturns a list of Task objects;task.userreturns the User object.cascade="all, delete-orphan": When a user is deleted, all their tasks are deleted too.
Creating the Engine and Session
from sqlalchemy import create_engine
from sqlalchemy.orm import Session
# SQLite for development
engine = create_engine("sqlite:///app.db", echo=True)
# Create all tables
Base.metadata.create_all(engine)
# Create a session for database operations
with Session(engine) as session:
# Create a new user
user = User(username="alice", email="alice@example.com")
session.add(user)
session.commit()
The echo=True parameter logs all generated SQL, which is invaluable for learning and debugging.
Querying with the ORM
SQLAlchemy 2.0 uses a select() function for queries:
from sqlalchemy import select
with Session(engine) as session:
# Get a single user by ID
user = session.get(User, 1)
# Query with filters
stmt = select(User).where(User.username == "alice")
user = session.scalars(stmt).first()
# Query with joins
stmt = (
select(User, func.count(Task.id).label("task_count"))
.join(Task, isouter=True)
.group_by(User.id)
)
results = session.execute(stmt).all()
# Query with ordering and limit
stmt = (
select(Task)
.where(Task.completed == False)
.order_by(Task.created_at.desc())
.limit(10)
)
tasks = session.scalars(stmt).all()
Vibe Coding Insight: When asking AI to write SQLAlchemy code, always specify "SQLAlchemy 2.0 style" in your prompt. The 1.x style (using
session.query()) is still common in AI training data and online examples, but the 2.0 style withselect()is the modern standard. If the AI generatessession.query(User).filter_by(...), that is the legacy API.
Working with Relationships
One of the ORM's greatest strengths is how naturally it handles relationships:
with Session(engine) as session:
# Create user with tasks in one operation
user = User(
username="bob",
email="bob@example.com",
tasks=[
Task(title="Buy groceries"),
Task(title="Write report"),
],
)
session.add(user)
session.commit()
# Access related objects naturally
for task in user.tasks:
print(f"{task.title} - completed: {task.completed}")
# Navigate the reverse relationship
task = session.get(Task, 1)
print(f"Task '{task.title}' belongs to {task.user.username}")
Session Management Patterns
Proper session management is critical. The session is a "unit of work" that tracks changes and writes them to the database on commit:
# Pattern 1: Context manager (recommended)
with Session(engine) as session:
with session.begin():
user = User(username="charlie", email="charlie@example.com")
session.add(user)
# Transaction is committed when the inner 'with' block exits
# Session is closed when the outer 'with' block exits
# Pattern 2: Explicit commit/rollback
session = Session(engine)
try:
user = User(username="diana", email="diana@example.com")
session.add(user)
session.commit()
except Exception:
session.rollback()
raise
finally:
session.close()
Warning
Never leave sessions open longer than necessary. A long-lived session holds database connections and can see stale data. For web applications, the standard pattern is one session per request, as we covered in Chapter 17.
18.5 Database Migrations with Alembic
Your database schema will evolve over time. You will add columns, create new tables, modify constraints, and remove obsolete fields. Migrations are versioned scripts that apply these changes in a controlled, repeatable way.
Alembic is the standard migration tool for SQLAlchemy. It tracks which migrations have been applied and can upgrade or downgrade your schema.
Setting Up Alembic
# Install Alembic
pip install alembic
# Initialize Alembic in your project
alembic init migrations
This creates a migrations/ directory with:
- alembic.ini: Configuration file (database URL, etc.)
- migrations/env.py: Script that runs migrations
- migrations/versions/: Directory where migration scripts live
Configure env.py to use your SQLAlchemy models:
# In migrations/env.py
from app.models import Base
target_metadata = Base.metadata
Creating Migrations
Alembic can autogenerate migrations by comparing your models to the current database schema:
# Generate a migration based on model changes
alembic revision --autogenerate -m "add priority column to tasks"
This produces a migration file like:
"""add priority column to tasks
Revision ID: a1b2c3d4e5f6
Revises: 9z8y7x6w5v4u
Create Date: 2025-03-15 10:30:00.000000
"""
from alembic import op
import sqlalchemy as sa
revision = "a1b2c3d4e5f6"
down_revision = "9z8y7x6w5v4u"
def upgrade() -> None:
op.add_column(
"tasks",
sa.Column("priority", sa.Integer(), nullable=True, server_default="0"),
)
def downgrade() -> None:
op.drop_column("tasks", "priority")
Running Migrations
# Apply all pending migrations
alembic upgrade head
# Upgrade one version at a time
alembic upgrade +1
# Downgrade one version
alembic downgrade -1
# See current version
alembic current
# See migration history
alembic history
Migration Best Practices
Vibe Coding Insight: AI assistants can generate Alembic migrations, but always review them before running. Common AI mistakes include: (1) forgetting to handle existing data when adding NOT NULL columns, (2) generating destructive downgrade functions that drop data, and (3) not including proper
server_defaultvalues for new columns on tables with existing rows.
Rules to follow:
- Never edit a migration that has been applied to a shared database. Create a new migration instead.
- Always write downgrade functions. You will need them when things go wrong.
- Test migrations on a copy of production data before running them in production.
- Add data migrations when needed. Schema changes sometimes require transforming existing data.
Here is an example of a data migration:
def upgrade() -> None:
# Step 1: Add column as nullable
op.add_column(
"tasks",
sa.Column("priority", sa.Integer(), nullable=True),
)
# Step 2: Populate with default value
op.execute("UPDATE tasks SET priority = 0 WHERE priority IS NULL")
# Step 3: Make column non-nullable
op.alter_column("tasks", "priority", nullable=False)
This three-step pattern -- add nullable, populate, make non-nullable -- is essential when adding required columns to tables that already contain data.
18.6 Schema Design Patterns
This section covers common patterns you will encounter when designing database schemas. Knowing these patterns helps you prompt AI more effectively and evaluate the schemas it generates.
One-to-Many with Self-Reference
An employee can manage other employees. This is a one-to-many relationship where both sides are in the same table:
class Employee(Base):
__tablename__ = "employees"
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
manager_id: Mapped[int | None] = mapped_column(
ForeignKey("employees.id")
)
manager: Mapped["Employee | None"] = relationship(
remote_side=[id], back_populates="reports"
)
reports: Mapped[list["Employee"]] = relationship(
back_populates="manager"
)
Many-to-Many with Extra Data
When a junction table needs additional columns, use an association object pattern:
class Enrollment(Base):
"""A student's enrollment in a course with grade tracking."""
__tablename__ = "enrollments"
student_id: Mapped[int] = mapped_column(
ForeignKey("students.id"), primary_key=True
)
course_id: Mapped[int] = mapped_column(
ForeignKey("courses.id"), primary_key=True
)
grade: Mapped[str | None] = mapped_column(String(2))
enrolled_at: Mapped[datetime] = mapped_column(
server_default=func.now()
)
student: Mapped["Student"] = relationship(
back_populates="enrollments"
)
course: Mapped["Course"] = relationship(
back_populates="enrollments"
)
This pattern is essential whenever your many-to-many relationship carries data -- an enrollment date, a role in a team, a quantity in an order.
Polymorphic Associations
Sometimes different types of objects share a common structure. For example, both blog posts and images can have comments. SQLAlchemy supports several polymorphic patterns.
Single Table Inheritance stores all types in one table with a discriminator column:
class Content(Base):
__tablename__ = "content"
id: Mapped[int] = mapped_column(primary_key=True)
type: Mapped[str] = mapped_column(String(50))
title: Mapped[str] = mapped_column(String(200))
created_at: Mapped[datetime] = mapped_column(
server_default=func.now()
)
__mapper_args__ = {
"polymorphic_on": "type",
"polymorphic_identity": "content",
}
class BlogPost(Content):
body: Mapped[str | None] = mapped_column(Text, default=None)
__mapper_args__ = {
"polymorphic_identity": "blog_post",
}
class Image(Content):
url: Mapped[str | None] = mapped_column(String(500), default=None)
alt_text: Mapped[str | None] = mapped_column(
String(200), default=None
)
__mapper_args__ = {
"polymorphic_identity": "image",
}
Trade-off: Single table inheritance is simple and fast (no JOINs needed), but it wastes space because columns specific to one type are NULL for other types. Use it when subtypes share most of their columns. Use joined table inheritance when subtypes have many distinct columns.
Soft Deletes
Instead of actually deleting records, mark them as deleted:
class SoftDeleteMixin:
"""Mixin that adds soft delete functionality."""
deleted_at: Mapped[datetime | None] = mapped_column(default=None)
@property
def is_deleted(self) -> bool:
return self.deleted_at is not None
class User(SoftDeleteMixin, Base):
__tablename__ = "users"
id: Mapped[int] = mapped_column(primary_key=True)
username: Mapped[str] = mapped_column(String(50))
# ... other columns
Soft deletes preserve data for auditing and allow recovery, but they require every query to include a WHERE deleted_at IS NULL filter. Many teams add a default query filter to handle this automatically.
Timestamps Mixin
Almost every table needs created and updated timestamps:
class TimestampMixin:
"""Mixin that adds created_at and updated_at columns."""
created_at: Mapped[datetime] = mapped_column(
server_default=func.now()
)
updated_at: Mapped[datetime] = mapped_column(
server_default=func.now(),
onupdate=func.now(),
)
Practical Tip: Define mixins early in your project. When prompting AI, say: "All models should inherit from TimestampMixin for automatic timestamp tracking." This ensures consistency across your entire schema.
18.7 Indexing and Query Optimization
A database with millions of rows but no indexes is like a library with no catalog -- finding anything requires scanning every shelf. Indexes are data structures that make lookups fast.
How Indexes Work
An index is a sorted data structure (typically a B-tree) that maps column values to row locations. When you create an index on users.email, the database maintains a sorted list of all email values with pointers to their rows. Looking up a user by email goes from scanning millions of rows (O(n)) to a tree lookup (O(log n)).
When to Add Indexes
Create indexes on columns that are:
- Used in WHERE clauses:
SELECT * FROM users WHERE email = 'alice@example.com' - Used in JOIN conditions:
ON users.id = tasks.user_id - Used in ORDER BY:
ORDER BY created_at DESC - Used in GROUP BY:
GROUP BY category_id
Primary keys and columns with UNIQUE constraints are automatically indexed.
Creating Indexes in SQLAlchemy
from sqlalchemy import Index
class Task(Base):
__tablename__ = "tasks"
id: Mapped[int] = mapped_column(primary_key=True)
user_id: Mapped[int] = mapped_column(
ForeignKey("users.id"), index=True # Simple index
)
title: Mapped[str] = mapped_column(String(200))
completed: Mapped[bool] = mapped_column(default=False)
priority: Mapped[int] = mapped_column(default=0)
created_at: Mapped[datetime] = mapped_column(
server_default=func.now()
)
# Composite index for common query pattern
__table_args__ = (
Index("ix_tasks_user_completed", "user_id", "completed"),
Index(
"ix_tasks_priority_created",
"priority",
"created_at",
),
)
The Cost of Indexes
Indexes are not free:
- Write overhead: Every INSERT, UPDATE, or DELETE must also update all relevant indexes.
- Storage space: Indexes consume disk space proportional to the indexed data.
- Maintenance: Indexes can become fragmented over time and need periodic maintenance.
Rule of Thumb: Index columns you query on frequently. Do not index every column -- that slows writes and wastes storage. For a typical web application, start with indexes on foreign keys and commonly filtered columns, then add more based on actual query performance data.
Using EXPLAIN to Analyze Queries
The EXPLAIN command shows how the database plans to execute a query:
EXPLAIN ANALYZE
SELECT u.username, COUNT(t.id)
FROM users u
LEFT JOIN tasks t ON u.id = t.user_id
WHERE t.completed = true
GROUP BY u.username;
The output reveals whether the database is using indexes (good) or doing full table scans (potentially bad). Key things to look for:
- Seq Scan: A sequential scan (reading every row). Fine for small tables, problematic for large ones.
- Index Scan: Using an index. This is what you want for selective queries.
- Sort: The database is sorting in memory. Consider an index if this is slow.
- Hash Join / Nested Loop: Join strategies. Hash joins are generally faster for large data sets.
Vibe Coding Insight: You can paste EXPLAIN output directly into your AI prompt and ask: "How can I optimize this query? What indexes would help?" The AI is remarkably good at reading execution plans and suggesting improvements.
Common Optimization Patterns
Select only the columns you need:
# Bad: fetches all columns
stmt = select(User).where(User.id == 1)
# Better: fetches only needed columns
stmt = select(User.username, User.email).where(User.id == 1)
Avoid N+1 queries:
# Bad: N+1 queries (1 for users, N for each user's tasks)
users = session.scalars(select(User)).all()
for user in users:
print(len(user.tasks)) # Each access triggers a query
# Good: eager loading
from sqlalchemy.orm import selectinload
stmt = select(User).options(selectinload(User.tasks))
users = session.scalars(stmt).all()
for user in users:
print(len(user.tasks)) # No additional queries
The N+1 problem is the single most common performance issue in ORM-based applications. When you ask an AI to write code that iterates over related objects, always check whether it includes eager loading.
18.8 NoSQL Options: When and Why
Relational databases are the right choice for most applications, but they are not the only choice. This section covers when and why you might reach for a NoSQL database.
Document Databases: MongoDB
MongoDB stores data as flexible JSON-like documents (BSON). Each document can have a different structure, making it appealing for:
- Rapidly evolving schemas: When your data model changes frequently during early development.
- Hierarchical data: When data is naturally nested (e.g., a blog post with embedded comments).
- Content management: Where each piece of content might have different fields.
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.myapp
# Insert a document -- no predefined schema needed
db.users.insert_one({
"username": "alice",
"email": "alice@example.com",
"preferences": {
"theme": "dark",
"language": "en",
"notifications": {"email": True, "sms": False},
},
})
# Query with flexible filters
user = db.users.find_one({"preferences.theme": "dark"})
When NOT to use MongoDB: - When your data has complex relationships (use a relational database) - When you need ACID transactions across multiple documents (MongoDB has limited transaction support) - When you need complex JOINs and aggregations (SQL is far more capable)
Honest Assessment: MongoDB became popular partly due to marketing and the appeal of "schemaless" development. In practice, most applications benefit from the structure and guarantees of a relational database. Use MongoDB when you genuinely need document flexibility, not because it seems simpler.
Key-Value Stores: Redis
Redis is an in-memory data store that excels at:
- Caching: Store frequently accessed data in memory for microsecond access times.
- Session storage: Web session data that needs to be fast but can tolerate loss.
- Rate limiting: Count API requests per user with automatic expiration.
- Queues: Simple job queues and pub/sub messaging.
import redis
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
# Simple key-value
r.set("user:1:name", "alice")
name = r.get("user:1:name")
# Caching with expiration (60 seconds)
r.setex("api:weather:nyc", 60, '{"temp": 72, "condition": "sunny"}')
# Rate limiting
pipe = r.pipeline()
key = f"rate_limit:user:1:{datetime.now().minute}"
pipe.incr(key)
pipe.expire(key, 60)
count, _ = pipe.execute()
if count > 100:
raise RateLimitExceeded()
Choosing the Right Database
| Requirement | Best Choice |
|---|---|
| Structured data with relationships | PostgreSQL/MySQL |
| Rapid prototyping, file-based | SQLite |
| Flexible documents, nested data | MongoDB |
| Caching, sessions, rate limiting | Redis |
| Time-series data | TimescaleDB |
| Full-text search | Elasticsearch |
| Graph relationships | Neo4j |
Practical Tip: For most vibe coding projects, start with PostgreSQL (or SQLite for local development). Add Redis for caching if and when performance requires it. You almost certainly do not need MongoDB unless you have a specific use case that demands document flexibility.
18.9 Data Validation and Integrity
Good data design enforces validity at multiple levels. This section covers the layers of defense against bad data.
Layer 1: Database Constraints
The database itself should be your last line of defense. Constraints in the schema prevent invalid data even if application code has bugs:
class User(Base):
__tablename__ = "users"
id: Mapped[int] = mapped_column(primary_key=True)
username: Mapped[str] = mapped_column(
String(50), unique=True, nullable=False
)
email: Mapped[str] = mapped_column(
String(255), unique=True, nullable=False
)
age: Mapped[int | None] = mapped_column(
CheckConstraint("age >= 0 AND age <= 150", name="valid_age")
)
__table_args__ = (
CheckConstraint(
"length(username) >= 3",
name="username_min_length",
),
)
Common database constraints:
- NOT NULL: Column must have a value
- UNIQUE: No two rows can have the same value in this column
- CHECK: Arbitrary boolean expression that must be true
- FOREIGN KEY: Value must exist in the referenced table
- DEFAULT: Value used when none is specified
Layer 2: Application-Level Validation with Pydantic
Pydantic validates data before it reaches the database. This catches errors early and provides clear error messages:
from pydantic import BaseModel, EmailStr, Field, field_validator
class UserCreate(BaseModel):
"""Schema for creating a new user."""
username: str = Field(
min_length=3, max_length=50, pattern=r"^[a-zA-Z0-9_]+$"
)
email: EmailStr
age: int | None = Field(default=None, ge=0, le=150)
@field_validator("username")
@classmethod
def username_not_reserved(cls, v: str) -> str:
reserved = {"admin", "root", "system", "null"}
if v.lower() in reserved:
raise ValueError(f"Username '{v}' is reserved")
return v.lower()
class UserUpdate(BaseModel):
"""Schema for updating user fields (all optional)."""
username: str | None = Field(
default=None, min_length=3, max_length=50
)
email: EmailStr | None = None
age: int | None = Field(default=None, ge=0, le=150)
This validation layer integrates seamlessly with the REST APIs from Chapter 17. A FastAPI endpoint using these schemas automatically validates request data before your database code ever sees it.
Layer 3: Business Logic Validation
Some validation rules require database queries -- for example, checking that a username is not already taken:
def create_user(session: Session, data: UserCreate) -> User:
"""Create a new user with full validation."""
# Check uniqueness (business logic requiring DB access)
existing = session.scalars(
select(User).where(
(User.username == data.username)
| (User.email == data.email)
)
).first()
if existing:
if existing.username == data.username:
raise ValueError(f"Username '{data.username}' is taken")
raise ValueError(f"Email '{data.email}' is already registered")
user = User(
username=data.username,
email=data.email,
age=data.age,
)
session.add(user)
session.flush() # Get the generated ID without committing
return user
The Defense-in-Depth Principle
Request Data
│
▼
[Pydantic Validation] ← Catches type errors, format issues
│
▼
[Business Logic] ← Checks uniqueness, permissions, rules
│
▼
[Database Constraints] ← Final safety net catches anything missed
Vibe Coding Insight: When asking AI to build a data layer, explicitly request all three validation layers: "Validate input with Pydantic, check business rules in the service layer, and enforce constraints at the database level." Without this guidance, AI often implements only one layer, leaving your data vulnerable.
18.10 Building a Data Layer with AI
This final section ties everything together into a practical workflow for building a complete data layer using AI assistance.
Step 1: Describe Your Domain
Start by writing a clear description of your domain entities and relationships. This becomes your primary prompt:
I am building a project management application. The entities are:
- Users: have a username, email, and role (admin, manager, member)
- Projects: have a name, description, and owner (a user)
- Tasks: belong to a project and are assigned to a user. They have
a title, description, status (todo, in_progress, done), priority
(1-5), and due date.
- Comments: belong to a task, written by a user, contain text and
a timestamp.
Relationships:
- A user can own multiple projects
- A project has multiple tasks
- A task is assigned to one user
- A task can have multiple comments
- A user can be a member of multiple projects (many-to-many)
Generate SQLAlchemy 2.0 models with:
- Type hints using Mapped[]
- Proper relationships with back_populates
- A TimestampMixin for created_at/updated_at
- Appropriate indexes
- __repr__ methods
Step 2: Review and Refine the Schema
When the AI generates models, check for:
- Missing indexes on foreign keys: AI sometimes forgets
index=Trueon foreign key columns. - Cascade behavior: Are orphaned records handled correctly when parents are deleted?
- Nullable fields: Should that field really be optional, or should it be required?
- String lengths: Are VARCHAR lengths appropriate for the data?
- Missing constraints: Are there CHECK constraints for enum-like fields (e.g., status values)?
Step 3: Generate Pydantic Schemas
Prompt the AI to create matching Pydantic models for input validation:
Based on the SQLAlchemy models above, generate Pydantic schemas for:
- Creating each entity (required fields only)
- Updating each entity (all fields optional)
- Reading each entity (includes id and timestamps)
Use Pydantic v2 syntax with Field validators.
Step 4: Build Repository Functions
The repository pattern encapsulates database access behind a clean interface:
class TaskRepository:
"""Handles all database operations for tasks."""
def __init__(self, session: Session) -> None:
self.session = session
def get_by_id(self, task_id: int) -> Task | None:
return self.session.get(Task, task_id)
def list_for_project(
self,
project_id: int,
status: str | None = None,
limit: int = 50,
offset: int = 0,
) -> list[Task]:
stmt = (
select(Task)
.where(Task.project_id == project_id)
.order_by(Task.priority.desc(), Task.created_at)
.limit(limit)
.offset(offset)
)
if status:
stmt = stmt.where(Task.status == status)
return list(self.session.scalars(stmt).all())
def create(self, data: TaskCreate) -> Task:
task = Task(**data.model_dump())
self.session.add(task)
self.session.flush()
return task
def update(self, task: Task, data: TaskUpdate) -> Task:
for field, value in data.model_dump(exclude_unset=True).items():
setattr(task, field, value)
self.session.flush()
return task
def delete(self, task: Task) -> None:
self.session.delete(task)
self.session.flush()
Step 5: Set Up Migrations
With models defined, initialize Alembic and create your initial migration:
alembic init migrations
# Configure env.py to point to your models
alembic revision --autogenerate -m "initial schema"
alembic upgrade head
Step 6: Integration with Your API
Connect the data layer to your FastAPI application from Chapter 17:
from fastapi import FastAPI, Depends, HTTPException
from sqlalchemy.orm import Session
app = FastAPI()
def get_session():
"""Dependency that provides a database session."""
with Session(engine) as session:
with session.begin():
yield session
@app.post("/tasks", response_model=TaskRead)
def create_task(
data: TaskCreate,
session: Session = Depends(get_session),
):
repo = TaskRepository(session)
return repo.create(data)
@app.get("/projects/{project_id}/tasks", response_model=list[TaskRead])
def list_project_tasks(
project_id: int,
status: str | None = None,
session: Session = Depends(get_session),
):
repo = TaskRepository(session)
return repo.list_for_project(project_id, status=status)
Vibe Coding Insight: This layered architecture -- Pydantic schemas for validation, repository classes for data access, FastAPI endpoints for HTTP handling -- is the standard pattern for Python web applications. When you ask an AI to build a feature, frame your request in terms of these layers: "Add a Pydantic schema for task filtering, a repository method for searching tasks, and a FastAPI endpoint that ties them together."
The Complete Workflow
Here is the full development cycle for building a data layer with AI:
- Design: Describe your domain entities and relationships to the AI
- Model: Have AI generate SQLAlchemy models; review for correctness
- Validate: Generate Pydantic schemas for input/output validation
- Access: Build repository classes for clean database access
- Migrate: Use Alembic to version your schema changes
- Integrate: Connect to your API layer with dependency injection
- Optimize: Add indexes based on actual query patterns; use EXPLAIN to verify
- Test: Write tests using an in-memory SQLite database for speed
Each of these steps is a natural prompting opportunity. Rather than asking for everything at once, work through them methodically. A focused prompt ("generate a repository class for tasks with methods for CRUD, filtering by status, and pagination") produces better results than a vague one ("build my data layer").
Chapter Summary
This chapter covered the full spectrum of database design and data modeling for vibe coders. You learned relational fundamentals -- tables, relationships, keys, and normalization -- that give you the vocabulary to communicate with AI about data design. You wrote SQL queries and understood JOINs, aggregations, and CTEs well enough to verify AI-generated SQL. You built Python data layers with SQLAlchemy 2.0, managed schema evolution with Alembic, and learned when to consider NoSQL alternatives.
Most importantly, you learned the layered approach to data integrity: Pydantic for input validation, business logic for domain rules, and database constraints as the final safety net. This defense-in-depth strategy protects your data even when individual layers have bugs.
The patterns in this chapter -- the repository pattern, the timestamps mixin, soft deletes, eager loading to avoid N+1 queries -- are the building blocks of professional data layers. In Chapter 19, you will combine the API skills from Chapter 17 with the database skills from this chapter to build full-stack applications with persistent data, authentication, and real-world deployment considerations.
Key Reminder: Database design decisions are among the hardest to change later. A poorly designed schema embedded in a running application with real user data is extremely expensive to fix. Invest the time to get your data model right early, even if it means spending an extra hour reviewing AI-generated schemas before writing a single line of application code.
Next chapter: Chapter 19 -- Full-Stack Development, where you combine frontend, backend, and database into complete applications.