Case Study 01: Architecting a Multi-Tenant SaaS Platform

Overview

This case study follows a development team as they use AI-assisted design conversations to architect a multi-tenant Software-as-a-Service (SaaS) platform for project management. The team consists of four developers with experience building web applications but limited experience with multi-tenancy, data isolation, and SaaS-specific architectural concerns.

The case study demonstrates how to use the RADIO framework, iterate through design proposals with an AI assistant, evaluate trade-offs, and produce Architecture Decision Records for the key choices.

The Business Context

Product: TaskFlow -- a project management platform for small-to-medium businesses. Customers (tenants) sign up, invite their team members, and use the platform to manage projects, tasks, deadlines, and collaboration.

Business requirements: - Support 500 organizations (tenants) in Year 1, growing to 5,000 in Year 3 - Each tenant has 10-200 users - Tenants must not see or access each other's data under any circumstances - Three pricing tiers: Free (up to 5 users), Pro (up to 50 users), Enterprise (unlimited users with SSO and audit logs) - Self-service signup and billing via Stripe - 99.9% uptime SLA for Pro and Enterprise tiers

Team: - 4 full-stack developers (2 senior, 2 mid-level) - 1 designer (part-time) - Budget: $5,000/month for infrastructure in Year 1

Phase 1: The Initial Design Conversation

The tech lead opens a conversation with an AI assistant using the RADIO framework:

I'm designing a multi-tenant SaaS platform called TaskFlow. Help me
think through the architecture.

Requirements:
- Multi-tenant project management (projects, tasks, milestones, comments)
- Three pricing tiers with different feature sets and user limits
- Real-time updates when team members change tasks
- File attachments (up to 50MB per file)
- REST API for third-party integrations
- Webhook support for Enterprise customers
- Full-text search across tasks and comments

Actors:
- Organization admins (manage billing, users, settings)
- Project managers (create projects, assign tasks, view reports)
- Team members (create/update tasks, comment, attach files)
- External systems (via API and webhooks)

Data:
- Each tenant's data must be completely isolated
- Expected data volume: ~1GB per tenant per year (mostly file attachments)
- 500 tenants in Year 1, 5,000 in Year 3
- Task updates happen ~50 times per minute per active tenant

Interfaces:
- Web application (React frontend)
- REST API with API key authentication
- Stripe integration for billing
- Email notifications (SendGrid)
- File storage (S3 or equivalent)

Operational constraints:
- 4-person development team
- $5,000/month infrastructure budget
- Deploying to AWS
- Team has experience with Python (Django/FastAPI), PostgreSQL, and React
- 6 months to MVP, 12 months to general availability

What architecture would you recommend for the multi-tenancy model, and
how should we structure the overall system?

The AI's Initial Proposal

The AI proposes a single-database, shared-schema multi-tenancy model with a well-structured monolithic backend:

Multi-tenancy model: All tenants share a single PostgreSQL database. Every table includes a tenant_id column. Row-level security (RLS) in PostgreSQL enforces data isolation at the database level, preventing tenants from accessing each other's data even if application code contains a bug.

Application structure:

taskflow/
    api/                  # REST API layer (FastAPI)
        routes/
        middleware/
        auth/
    core/                 # Business logic
        projects/
        tasks/
        comments/
        files/
        search/
    tenancy/              # Multi-tenancy infrastructure
        models.py         # Tenant, Subscription, Plan
        middleware.py      # Tenant context extraction
        rls.py            # Row-level security setup
    billing/              # Stripe integration
    notifications/        # Email and real-time notifications
    integrations/         # Webhooks and API key management
    data/                 # Database models and repositories
        models/
        repositories/
        migrations/

Key infrastructure: - FastAPI application deployed on AWS ECS (Elastic Container Service) - PostgreSQL on AWS RDS with Row-Level Security - Redis for caching, session storage, and real-time pub/sub - S3 for file storage with tenant-prefixed paths - Elasticsearch for full-text search - Celery with Redis for background tasks (emails, webhooks, search indexing)

Phase 2: Stress Testing the Proposal

The team challenges the proposal with follow-up questions:

Challenge 1: Data Isolation Risks

How confident can we be that Row-Level Security will prevent all data
leakage between tenants? What are the failure modes?

The AI explains that PostgreSQL RLS, when properly configured, enforces isolation at the database engine level. However, there are risks: - Superuser bypass: RLS policies do not apply to database superusers. Migration scripts and admin tools must use restricted roles. - Application-layer leaks: If any query bypasses the RLS context (e.g., a raw SQL query that forgets to set the tenant context), data could leak. The mitigation is to set the tenant context in database middleware that runs before every request. - Shared resources: Some data is inherently cross-tenant (e.g., plan definitions, feature flags). These tables must be clearly separated from tenant-scoped tables.

Challenge 2: Noisy Neighbor Problem

What happens if one Enterprise tenant with 200 users is generating
massive amounts of activity while Free tier tenants are trying to use the
system? How do we prevent one tenant from degrading the experience
for others?

The AI proposes several mitigations: - Rate limiting per tenant at the API gateway level - Connection pooling per tier (Enterprise tenants get a larger share of database connections) - Query timeout enforcement to prevent long-running queries from blocking the connection pool - Background task prioritization (Enterprise webhooks and notifications processed before Free tier) - Monitoring and alerting on per-tenant resource consumption

Challenge 3: Search Scalability

Elasticsearch is a significant operational burden. Is it necessary for MVP?
What would a simpler alternative look like?

The AI acknowledges that Elasticsearch is powerful but operationally heavy for a 4-person team. For MVP, it suggests: - Phase 1 (MVP): Use PostgreSQL full-text search with tsvector and GIN indexes. This handles moderate search volumes well and requires no additional infrastructure. - Phase 2 (Post-launch): If search performance becomes a bottleneck, migrate to Elasticsearch or a managed service like AWS OpenSearch. The abstraction layer in the search/ module means the application code does not change.

Phase 3: ADR for the Multi-Tenancy Decision

Based on the design conversation, the team writes their first ADR:

# ADR-001: Shared Database with Row-Level Security for Multi-Tenancy

## Status
Accepted

## Context
TaskFlow is a multi-tenant SaaS platform that must support 500 tenants
in Year 1 and 5,000 in Year 3. Tenant data isolation is a critical
security requirement. Our team has 4 developers and a $5,000/month
infrastructure budget.

## Decision
We will use a single PostgreSQL database with a shared schema. Every
tenant-scoped table will include a `tenant_id` column. PostgreSQL
Row-Level Security (RLS) will enforce data isolation at the database
engine level. The tenant context will be set in middleware that runs
before every database query.

## Alternatives Considered

### Database per tenant
- Pros: Complete physical isolation; easy to back up and restore
  individual tenants; no risk of cross-tenant data leakage
- Cons: Unmanageable at 5,000 tenants (5,000 databases); complex
  migration management; expensive infrastructure; connection pooling
  across thousands of databases is challenging
- Rejected because: Does not scale to our projected tenant count
  within budget

### Schema per tenant
- Pros: Good isolation within a single database; simpler than database
  per tenant
- Cons: Schema migrations must run on every schema; hundreds of schemas
  impact database performance; PostgreSQL has practical limits on
  schema count
- Rejected because: Schema proliferation creates operational complexity
  beyond our team's capacity

### Application-level isolation only (no RLS)
- Pros: Simpler to implement initially
- Cons: A single bug in any query could leak data between tenants;
  security depends entirely on application code correctness
- Rejected because: The risk of data leakage is too high for a SaaS
  product; RLS provides defense-in-depth

## Consequences
- All queries are automatically scoped to the current tenant by RLS
- We must configure RLS policies for every tenant-scoped table
- Migration scripts must use restricted database roles (not superuser)
- We need middleware to set the tenant context on every request
- Cross-tenant queries (for admin dashboards, billing) require a
  separate database role with RLS bypass
- Performance testing must validate RLS overhead at scale

## Decision Date
2025-04-01

Phase 4: Detailed Design of the Tenancy Module

The team asks the AI to help design the tenancy middleware and data model:

# Simplified version of the tenant context middleware
from contextvars import ContextVar
from fastapi import Request
from starlette.middleware.base import BaseHTTPMiddleware

# Thread-safe storage for the current tenant
current_tenant_id: ContextVar[str] = ContextVar("current_tenant_id")

class TenantMiddleware(BaseHTTPMiddleware):
    """Extracts tenant ID from the request and sets it in context."""

    async def dispatch(self, request: Request, call_next):
        # Extract tenant from JWT token or API key
        tenant_id = self.extract_tenant_id(request)
        token = current_tenant_id.set(tenant_id)
        try:
            # Set PostgreSQL RLS context
            await self.set_rls_context(tenant_id)
            response = await call_next(request)
            return response
        finally:
            current_tenant_id.reset(token)

The AI also helps design the database session management that sets the RLS context:

async def set_rls_context(self, tenant_id: str) -> None:
    """Set the current tenant in the database session for RLS."""
    async with get_db_session() as session:
        await session.execute(
            text("SET app.current_tenant_id = :tenant_id"),
            {"tenant_id": tenant_id},
        )

And the RLS policy creation in migrations:

-- Enable RLS on the tasks table
ALTER TABLE tasks ENABLE ROW LEVEL SECURITY;

-- Create policy that restricts rows to the current tenant
CREATE POLICY tenant_isolation ON tasks
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

Phase 5: Real-Time Architecture Decision

The team needs real-time updates when tasks change. The AI proposes a design using Redis Pub/Sub:

Tenant A, User 1 updates Task 42
    --> API saves to PostgreSQL
    --> API publishes to Redis channel "tenant:{tenant_id}:tasks"
    --> WebSocket connections for Tenant A receive the update
    --> Users 2, 3, 4 (same tenant) see the change in real time
    --> Tenant B users receive NOTHING (channel isolation)

The team writes ADR-002 documenting the choice of Redis Pub/Sub over alternatives (WebSocket-only fan-out, third-party services like Pusher/Ably, and polling).

Phase 6: File Storage Design

For file attachments, the AI recommends S3 with tenant-scoped paths:

s3://taskflow-files/{tenant_id}/{project_id}/{file_id}_{filename}

Key design decisions: - Pre-signed URLs: Files are uploaded directly from the browser to S3 using pre-signed URLs, avoiding the application server as a bottleneck - Tenant isolation: S3 bucket policies prevent cross-tenant access - Size limits: Enforced at the API level (50MB per file) and at the S3 presigned URL level - Virus scanning: Files are scanned asynchronously via a Lambda function triggered by S3 upload events

Outcomes and Lessons Learned

After building and launching TaskFlow, the team reflected on the architecture process:

What worked well: - The RADIO framework ensured the AI had enough context to make useful recommendations - Iterative stress-testing caught the Elasticsearch complexity issue early, saving weeks of work - ADRs helped onboard a new developer in Month 4 -- they could read the decision log and understand why the system was designed the way it was - PostgreSQL RLS provided strong tenant isolation with minimal performance overhead (measured at under 3% query time increase)

What they would do differently: - They underestimated the complexity of real-time updates. Redis Pub/Sub worked for MVP but needed to be replaced with a more robust solution (Redis Streams) when connection counts grew - They should have defined the billing module's interface more carefully upfront. The tight coupling between billing and tenancy caused several integration headaches - More ADRs would have been useful -- they only wrote 5 for the MVP but retrospectively identified 12 decisions that deserved documentation

Key takeaway: The architecture conversation with AI was not a one-time event. The team continued to use AI for design discussions throughout development, treating it as a senior architect who was always available, infinitely patient, and able to recall patterns from thousands of systems.