Case Study 1: Scaling for Election Night

Background

PredictAll is a mid-sized prediction market platform with approximately 50,000 registered users and 200 active markets. On a typical day, the platform processes 5,000 trades and handles 100,000 API requests. The architecture consists of:

  • Two API servers behind an Nginx load balancer
  • A single PostgreSQL database (32 GB RAM, 8 cores)
  • A Redis instance for caching
  • A single matching engine process
  • WebSocket server for real-time price feeds

A major national election is three months away. PredictAll has 15 election-related markets that historically attract heavy interest. Based on the previous election cycle and the platform's growth trajectory, the engineering team estimates election night will bring:

  • 100,000+ concurrent users (a 100x increase)
  • 500,000+ trades over 6 hours
  • 2,000,000+ API requests over 6 hours
  • 50,000+ simultaneous WebSocket connections

The team has a budget of $50,000 for infrastructure scaling and three months to prepare.

Phase 1: Assessment (Month 1)

Load Testing the Current System

The team begins by load testing the current system to establish baselines and identify breaking points.

Test 1: Steady Ramp Using Locust, they simulate a steady increase from 100 to 5,000 concurrent users over 30 minutes.

Results:

Concurrent Users Avg Latency (ms) p99 Latency (ms) Error Rate Orders/sec
100 12 45 0% 15
500 18 85 0% 65
1,000 35 220 0.1% 110
2,000 95 890 2.5% 145
3,000 450 3,200 15% 120
5,000 timeout timeout 85% 30

Analysis: The system breaks down around 2,000 concurrent users. The primary bottleneck is database connection exhaustion: the PostgreSQL connection pool is configured for 20 connections, and at 2,000 concurrent users, every connection is occupied, causing new requests to queue and eventually time out.

Test 2: WebSocket Stress Simulating 10,000 WebSocket connections reveals that Nginx's default configuration allows only 1,024 connections per worker. The single WebSocket server runs out of file descriptors at approximately 8,000 connections.

Test 3: Matching Engine Feeding orders directly to the matching engine (bypassing the API) shows it can handle approximately 5,000 orders per second on the current hardware. This is adequate for election night, but only if the rest of the system does not bottleneck first.

Bottleneck Identification

  1. Database connections (critical): 20-connection pool is exhausted at modest load.
  2. WebSocket capacity (critical): Cannot support more than 8,000 simultaneous connections.
  3. Single API server (high): Two servers cannot handle the request volume.
  4. No auto-scaling (high): Cannot respond to rapid traffic increases.
  5. No read replicas (medium): All reads go to the primary database.
  6. Cache coverage (medium): Many frequently accessed queries are not cached.

Phase 2: Architecture Changes (Month 2)

Database Scaling

Connection pooling with PgBouncer: Deploy PgBouncer in front of PostgreSQL, configured in transaction pooling mode. This allows thousands of application connections to share a smaller number of actual database connections.

Configuration:

[pgbouncer]
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 50
reserve_pool_size = 10
reserve_pool_timeout = 3

Read replicas: Deploy two read replicas. Route all read-only queries (market listings, portfolio views, trade history) to replicas, reserving the primary for writes (order placement, matching, settlement).

Database indexes: Add the partial index on open orders and the composite index on trades by market and time. This reduces average query time from 8ms to 2ms for the most common queries.

API Server Scaling

Stateless refactoring: Move all session data to Redis. Remove any in-process caching that is not backed by Redis. This allows any API server to handle any request.

Container deployment: Package the API server as a Docker container and deploy to Kubernetes. Configure a Horizontal Pod Autoscaler (HPA) with the following rules: - Minimum: 4 pods - Maximum: 40 pods - Scale up when CPU > 60% for 2 minutes - Scale down when CPU < 30% for 10 minutes

WebSocket Infrastructure

Dedicated WebSocket servers: Separate WebSocket handling from the REST API. Deploy dedicated WebSocket servers that subscribe to a Redis Pub/Sub channel for price updates.

Fan-out architecture: The matching engine publishes price updates to Redis Pub/Sub. Each WebSocket server subscribes and pushes updates to its connected clients. This allows horizontal scaling of WebSocket servers independently.

Configuration for each WebSocket server: - Max connections: 50,000 (with OS-level file descriptor limits raised) - Deploy 3 instances for election night (150,000 connection capacity)

Caching

Comprehensive caching layer: Implement the multi-level caching strategy from Section 33.5: - Market list: 30-second TTL - Market prices: 2-second TTL with event-driven invalidation - Order book snapshots: 1-second TTL - User portfolios: 10-second TTL with event-driven invalidation - Leaderboard: 5-minute TTL

Cache warming: 15 minutes before election night coverage begins, run a cache warmer that pre-loads all election markets.

Monitoring

Prometheus and Grafana: Deploy Prometheus for metrics collection and Grafana for dashboards. Create the following dashboards: - System Overview: CPU, memory, network, disk for all instances - API Performance: Request rate, latency percentiles, error rate by endpoint - Matching Engine: Orders per second, match rate, queue depth - Database: Connection pool utilization, query latency, replication lag - Business Metrics: Active users, trades per minute, total volume

Alerting: Configure PagerDuty alerts for: - API error rate > 5% for 2 minutes (page on-call engineer) - Database connection pool > 80% for 5 minutes (page on-call engineer) - Matching engine queue depth > 1,000 (page on-call engineer) - Replication lag > 10 seconds (page database on-call)

Phase 3: Testing and Preparation (Month 3)

Full-Scale Load Test

Two weeks before the election, the team runs a full-scale load test simulating election night traffic:

Test scenario: Simulate 100,000 concurrent users over 4 hours with the election night task distribution (45% order placement, 25% market viewing, 15% portfolio checking, 10% browsing, 5% cancellation).

Results with the scaled architecture:

Concurrent Users Avg Latency (ms) p99 Latency (ms) Error Rate Orders/sec
10,000 8 35 0% 850
25,000 12 55 0% 1,900
50,000 18 95 0.01% 3,200
75,000 28 145 0.05% 4,100
100,000 42 195 0.1% 4,800
125,000 65 310 0.5% 5,100

The scaled system handles 100,000 concurrent users within SLO targets. At 125,000 users, p99 latency exceeds the 200ms target but the system remains stable.

Runbook Preparation

The team writes runbooks for the following scenarios: - Database primary failure during election night - Matching engine crash (specific market shard) - Redis cluster failure - WebSocket server out of connections - Sudden traffic 2x beyond projections - DDoS attack during election coverage

War Room Plan

For election night, the team establishes a war room with: - Staffing: 2 backend engineers, 1 database engineer, 1 SRE, 1 product manager - Shifts: 6 PM to 2 AM, with on-call backup until 6 AM - Communication: Dedicated Slack channel, shared Grafana dashboard on a large screen - Decision authority: Pre-authorized to spend up to $10,000 on additional infrastructure without management approval

Election Night: Execution

Timeline

5:00 PM --- Pre-Event - Cache warmer activated for all election markets. - Kubernetes HPA minimum raised from 4 to 15 pods. - WebSocket servers pre-scaled to 5 instances. - All team members in the war room. - Baseline metrics recorded.

6:00 PM --- Polls Close in East Coast States - Traffic increases 5x above normal. - HPA scales API pods from 15 to 22. - All systems green. Latency well within SLOs.

7:00 PM --- First Major Call - A key state is called for a candidate. Traffic spikes to 30x normal within 2 minutes. - Order volume surges to 2,500/second. - HPA scales to 30 pods. - WebSocket connections reach 45,000 across 5 servers. - Database connection pool utilization: 65%. - All systems green.

8:15 PM --- Surprising Result - An unexpected result in a major state causes a massive trading surge. - Traffic hits 80x normal. Order volume: 4,200/second. - HPA reaches 38 pods (near maximum of 40). - Action: Engineer increases HPA maximum to 60 pods. - Database connection pool: 78%. Warning alert triggered. - Action: Database engineer increases PgBouncer pool from 50 to 75. - p99 latency: 165ms. Within SLO but elevated.

9:30 PM --- Peak Traffic - Multiple states called simultaneously. - Traffic: 110x normal. Order volume: 5,500/second. - HPA at 48 pods. - WebSocket connections: 82,000. Sixth WebSocket server added. - Database replication lag: 3 seconds (elevated but not critical). - Redis memory: 72% utilized. - p99 latency: 188ms. Very close to 200ms SLO. - Action: Team activates "high-load mode" --- reduces non-essential API rate limits, disables leaderboard updates, increases cache TTLs.

10:45 PM --- Race Called - The overall election is called. One final surge of trading. - Peak order volume: 6,800/second (exceeds load test maximum). - p99 latency spikes to 240ms for approximately 90 seconds. - Error rate: 0.3% (brief connection timeouts). - Matching engine queue depth: 2,800 (briefly). - System recovers within 2 minutes as initial surge subsides.

11:30 PM --- Cooldown - Traffic drops to 20x normal. - HPA begins scaling down. - Team reviews the 90-second SLO breach.

1:00 AM --- Markets Suspended - Election markets suspended pending official certification. - Traffic at 3x normal. - HPA scaled down to 10 pods. - War room winds down; on-call engineer takes over.

Results Summary

Metric Target Actual
Peak concurrent users 100,000 ~110,000
Peak orders/second 5,000 6,800
p99 latency (sustained) < 200ms 188ms
p99 latency (peak spike) < 200ms 240ms (90 seconds)
Error rate < 0.1% 0.08% (excl. spike: 0.3%)
Availability 99.95% 99.97%
Total trades 500,000 680,000
WebSocket connections (peak) 100,000 82,000
Infrastructure cost $50,000 budget | $38,000 actual

Post-Mortem

What Went Well

  1. Capacity planning was accurate: The load test predictions matched reality within 20%.
  2. Auto-scaling worked: Kubernetes HPA responded to traffic increases within 2--3 minutes.
  3. Monitoring provided visibility: The team saw problems developing before they became critical.
  4. Pre-authorized decision authority: Engineers could act immediately without waiting for management approval.
  5. Cache warming: Pre-loading election market data eliminated cold-start latency.

What Could Be Improved

  1. 90-second SLO breach: The final surge exceeded load test capacity. Future capacity planning should test at 150% of projections, not 125%.
  2. HPA ceiling hit: The initial maximum of 40 pods was too low. The manual increase to 60 introduced risk. Solution: Set higher initial maximum with cost alerts.
  3. Manual WebSocket scaling: WebSocket servers were scaled manually. Solution: Automate WebSocket server scaling based on connection count.
  4. Replication lag during peak: 3-second lag on read replicas caused some users to see slightly stale prices. Solution: Use synchronous replication for critical read replicas, or route critical reads to primary.
  5. No pre-scaling for known events: The team knew election night would be high traffic but relied on reactive auto-scaling. Solution: Pre-scale to expected peak 30 minutes before the event.

Action Items

  1. Implement pre-scaling automation for scheduled high-traffic events.
  2. Increase default HPA maximum to account for unexpected surges.
  3. Automate WebSocket server scaling.
  4. Run load tests at 200% of projected peak.
  5. Investigate synchronous replication for critical read paths.
  6. Document the "high-load mode" playbook for future events.

Lessons Learned

  1. Prediction markets have the most extreme traffic patterns of any application type. The 110x peak-to-normal ratio exceeded what most infrastructure is designed to handle. Plan accordingly.

  2. Load testing is the most valuable investment. The $5,000 spent on load testing infrastructure saved the platform from catastrophic failure during the most important event of the year.

  3. Monitoring is not optional. Without real-time dashboards, the team would not have been able to respond to the p99 latency increase before it became an outage.

  4. Human judgment is still essential. Auto-scaling handled the gradual ramp, but the sudden spike required human intervention (increasing HPA limits, adding WebSocket servers, activating high-load mode).

  5. Practice the war room. The team ran two practice war room sessions with simulated incidents. This meant that on election night, everyone knew their role and could act without confusion.


See code/case-study-code.py for implementations of the capacity planner, load test scenarios, and monitoring dashboard configurations used in this case study.