Case Study 2: Cloud-Native Prediction Market: Scaling from 100 to 100,000 Users

Overview

This case study follows MarketPulse, a startup prediction market platform, through three distinct scaling phases over 18 months. Starting as a monolithic Django application handling 100 beta users, MarketPulse grows to serve 100,000 active users across 500 markets. Each phase introduces new architectural patterns, infrastructure decisions, and operational challenges drawn from the scaling principles of Chapter 33.

Phase 1: Monolith (Months 1--6, 100--1,000 Users)

Starting Architecture

MarketPulse launches with the simplest possible architecture:

Single Django application running on a $40/month cloud VM (4 vCPUs, 8 GB RAM)
PostgreSQL 14 on the same VM
Nginx as a reverse proxy and static file server
Celery + Redis for background tasks (market resolution, email notifications)
SQLite for the matching engine's order book (simple enough at this scale)

Total monthly infrastructure cost: $65

What Works at This Scale

At 100--1,000 users, the monolith handles everything adequately:

Metric	Target SLO	Actual
Order latency (p50)	< 50 ms	12 ms
Order latency (p99)	< 200 ms	85 ms
Availability	99.5%	99.8%
Concurrent users supported	100	~500

The matching engine processes orders sequentially, which is trivially correct. The database handles 50--200 queries per second without breaking a sweat. Redis caches the 20 most active market prices.

Early Decisions That Pay Off Later

Even at this early stage, the team makes decisions informed by scaling principles:

Event log from day one: Every order, trade, and market lifecycle event is logged to an append-only events table. This is not yet a full event sourcing system, but the audit trail exists.
Stateless session handling: Sessions are stored in Redis, not Django's default database-backed sessions. This means the application server can be replaced without losing user sessions.
API-first design: The frontend communicates with the backend exclusively via a REST API. No server-rendered templates embed business logic.
Idempotent order submission: Every order has a client-generated idempotency key. Duplicate submissions (from retries or double-clicks) are detected and rejected.

First Growing Pain: The Weekend Spike

Six months in, a popular podcast mentions MarketPulse. Traffic spikes 10x on Saturday morning. The single VM's CPU hits 95%, PostgreSQL's connection pool (10 connections) is exhausted, and the site returns 502 errors for 45 minutes.

Lessons learned: - A single VM is a single point of failure. - Connection pooling must be sized for peak, not average. - There is no monitoring to detect issues before users complain.

Phase 2: Split and Scale (Months 7--12, 1,000--10,000 Users)

Architectural Changes

The team refactors to separate concerns and eliminate single points of failure:

1. Database separation - PostgreSQL moves to a managed cloud database (AWS RDS or equivalent) with automatic backups, failover, and read replicas. - Connection pooling via PgBouncer: 200 max client connections, 30 pool size. - One read replica handles all read-only queries (market listings, portfolio views, leaderboards).

2. Application server scaling - The Django application runs in Docker containers behind an application load balancer. - Two instances minimum, auto-scaling to six based on CPU utilization (threshold: 70%).

3. Event sourcing adoption - The events table is formalized into a proper event store with aggregate versioning and optimistic concurrency control (Section 33.2). - A MarketAggregate class rebuilds market state from events. - Snapshots are taken every 500 events per market.

4. CQRS introduction - The write path (order placement, matching) uses the event store as its source of truth. - Read models (order book, market summary, portfolio) are built as projections that subscribe to the event stream. - Projections run in a separate process, ensuring read-heavy traffic does not affect write performance.

5. Monitoring - Prometheus collects metrics from all services. - Grafana dashboards show the four golden signals: latency, traffic, errors, and saturation. - PagerDuty alerts fire when p99 latency exceeds 200 ms or error rate exceeds 1%.

Updated Architecture

Internet
    |
    v
[Load Balancer]
    |
    +---> [API Server 1]  --+
    +---> [API Server 2]  --+--> [Event Store (RDS Primary)]
    +---> [API Server 3]  --+        |
                                     v
                              [RDS Read Replica]
                                     |
    [Matching Engine] <--- [Redis Pub/Sub] ---> [Projection Service]
         |                                          |
         v                                          v
    [Event Store]                           [Read Model Cache (Redis)]

Performance at This Scale

Metric	Target SLO	Actual
Order latency (p50)	< 50 ms	18 ms
Order latency (p99)	< 200 ms	95 ms
Availability	99.9%	99.92%
Concurrent users supported	5,000	~8,000

Monthly infrastructure cost: $1,200

Growing Pain: Election Night Surprise

A local election drives unexpected traffic. 8,000 concurrent users arrive. The matching engine (still single-process) becomes a bottleneck: order processing latency spikes to 500 ms. The WebSocket server, running alongside the API, drops connections when memory pressure increases.

Root cause: The matching engine processes all markets sequentially. When 15 election markets receive heavy trading simultaneously, orders queue up.

Phase 3: Cloud-Native (Months 13--18, 10,000--100,000 Users)

Architectural Transformation

1. Matching engine sharding - Markets are distributed across 4 matching engine shards using consistent hashing. - Each shard handles 125 markets independently. - Sharding key: market_id (deterministic, no cross-shard coordination needed).

2. Dedicated WebSocket service - WebSocket handling is separated into its own service, subscribed to Redis Pub/Sub for price updates. - Three WebSocket server instances, each supporting 50,000 connections. - Connection load balancing via IP hash for session affinity.

3. Message queue backbone - All inter-service communication goes through a message queue (RabbitMQ or AWS SQS). - Priority queues: order matching (critical), notifications (normal), analytics (low). - Dead letter queues capture failed messages for manual inspection.

4. Multi-level caching - L1: In-process LRU cache (1-second TTL) for the hottest data. - L2: Redis cluster for shared cache (2--300 second TTL depending on data type). - L3: Database (source of truth). - Event-driven cache invalidation: when a trade executes, the affected market's cache is immediately invalidated.

5. Comprehensive monitoring and alerting - Custom Prometheus metrics for prediction-market-specific concerns: - market_spread (bid-ask spread per market) - matching_engine_queue_depth (orders waiting to be processed) - order_to_trade_latency (time from order placement to match) - websocket_connections_active (per-server connection count) - Anomaly detection: rolling z-score on order volume flags unusual activity. - Runbook automation: common incidents (cache failure, replica lag) have automated remediation.

6. Disaster recovery - Database: continuous WAL archiving to object storage, point-in-time recovery tested monthly. - Event store: all events also written to a cross-region backup. - Failover: automated database failover; matching engine shards have hot standby. - Recovery Time Objective (RTO): 5 minutes. Recovery Point Objective (RPO): 0 (no data loss).

Final Architecture

Internet
    |
    v
[CDN / Rate Limiting]
    |
    v
[Application Load Balancer]
    |
    +---> [API Servers (4-40 instances, auto-scaling)]
    |          |
    |          v
    |     [Message Queue (RabbitMQ)]
    |          |
    |     +----+----+----+----+
    |     v    v    v    v    v
    |    [MS1][MS2][MS3][MS4][Settlement]
    |     Matching Engine Shards
    |          |
    |          v
    |     [Event Store (Primary + Replica + Cross-Region)]
    |
    +---> [WebSocket Servers (3 instances)]
    |          |
    |          v
    |     [Redis Pub/Sub]
    |
    +---> [Projection Service]
              |
              v
         [Read Model (Redis Cluster)]

Performance at Full Scale

Metric	Target SLO	Actual
Order latency (p50)	< 50 ms	22 ms
Order latency (p99)	< 200 ms	110 ms
API availability	99.95%	99.97%
Concurrent users supported	100,000	~120,000
WebSocket connections	50,000	tested to 80,000
Peak orders per second	2,000	tested to 5,000
Event store throughput	10,000 events/sec	tested to 25,000

Monthly infrastructure cost: $8,500

Cost Per User Analysis

Phase	Users	Monthly Cost
Monolith	500	$65 \| $0.13
Split	5,000	$1,200 \| $0.24
Cloud-Native	50,000	$8,500 \| $0.17

The cost per user initially increases during the "split" phase (infrastructure overhead), then decreases as the cloud-native architecture achieves economies of scale.

Key Lessons

1. Don't Optimize Prematurely, But Plan Ahead

MarketPulse's monolith served well for six months. Premature microservices would have slowed development. But the early decisions (event log, stateless sessions, API-first) made the transition smooth.

2. Event Sourcing Is Worth the Investment

The transition from a traditional CRUD model to event sourcing was the single most valuable architectural change. It enabled: - CQRS (separate read/write scaling) - Complete audit trail (regulatory compliance) - Temporal queries (debugging, analytics) - Replay-based testing (run new code against production events)

3. Shard the Bottleneck, Not Everything

Only the matching engine needed sharding. The API servers are stateless and horizontally scalable without sharding. The database handles the write load with a single primary. Sharding everything would have added unnecessary complexity.

4. Monitor Business Metrics, Not Just System Metrics

CPU and memory are necessary but insufficient. The metrics that caught real issues were prediction-market-specific: matching queue depth, bid-ask spread widening, and order-to-trade latency. These are leading indicators of user-facing problems.

5. Test at Scale Before You Need It

MarketPulse load-tested to 2x their expected peak before every major event. The election night surprise in Phase 2 happened because they did not test the matching engine under concurrent multi-market load. After that incident, every component was tested independently and together.

Implementation

The full implementation supporting this case study is available in code/case-study-code.py. It includes the event sourcing system, CQRS projections, matching engine shard router, monitoring metrics, and a simulation of the three scaling phases.