Chapter 26 Key Takeaways: Real-Time Analytics Systems
Quick Reference Summary
System Architecture Fundamentals
┌─────────────────────────────────────────────────────────────────────┐
│ REAL-TIME ANALYTICS PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ DATA SOURCES INGESTION PROCESSING DELIVERY │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Play-by │───────▶│ Message │────▶│ Stream │────▶│WebSocket│ │
│ │ Play │ │ Queue │ │Processor│ │ Push │ │
│ └─────────┘ │ (Kafka) │ │ │ └─────────┘ │
│ ┌─────────┐ │ │ │ ┌─────┐ │ ┌─────────┐ │
│ │Tracking │───────▶│ │────▶│ │Model│ │────▶│ API │ │
│ │ Data │ │ │ │ └─────┘ │ │ REST │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Redis │ │PostgreSQL│ │Dashboard│ │
│ │ (Cache) │ │(Archive) │ │ (React) │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
The Five Layers of Real-Time Systems
| Layer | Purpose | Technologies | Latency Target |
|---|---|---|---|
| Ingestion | Receive and queue data | Kafka, RabbitMQ | < 10ms |
| Processing | Validate, transform, compute | Python, Flink | < 50ms |
| Storage | Cache hot data, archive history | Redis, PostgreSQL | < 5ms read |
| Delivery | Push updates to clients | WebSocket, SSE | < 20ms |
| Presentation | Visualize and interact | React, D3.js | < 16ms render |
Essential Design Patterns
1. Event-Driven Architecture
# Events flow through the system asynchronously
class PlayEvent:
game_id: str
play_id: str
event_type: str # 'play_start', 'play_end', 'score_change'
timestamp: datetime
data: Dict
# Subscribers react to events they care about
engine.subscribe('play_end', update_win_probability)
engine.subscribe('play_end', update_player_stats)
engine.subscribe('score_change', broadcast_to_clients)
2. Backpressure Management
# Slow down producers when consumers can't keep up
if queue.size() > MAX_QUEUE_SIZE:
# Apply backpressure
producer.pause()
logger.warning(f"Queue depth {queue.size()} - applying backpressure")
3. Circuit Breaker
# Prevent cascading failures
class CircuitBreaker:
def call(self, func, *args):
if self.state == 'open':
if time.time() - self.last_failure > self.timeout:
self.state = 'half-open'
else:
raise CircuitOpenError()
try:
result = func(*args)
self.failure_count = 0
self.state = 'closed'
return result
except Exception as e:
self.failure_count += 1
if self.failure_count >= self.threshold:
self.state = 'open'
self.last_failure = time.time()
raise
Data Validation Checklist
| Validation Type | What to Check | Example |
|---|---|---|
| Schema | Required fields present | play_id, game_id, timestamp |
| Type | Correct data types | score is integer, time is string |
| Range | Values within bounds | 0 ≤ yard_line ≤ 100 |
| Logical | Business rules | Score only increases, time only decreases |
| Temporal | Ordering correct | Events in chronological order |
| Completeness | All expected data | 22 players on field for tracking |
Win Probability Model - Quick Reference
Key Features for Live Win Probability: 1. Score differential (adjusted for remaining time) 2. Time remaining (seconds) 3. Field position (yards from end zone) 4. Down and distance 5. Possession indicator 6. Timeouts remaining
Win Probability Added (WPA):
WPA = WP_after_play - WP_before_play
Leverage Index:
LI = |WP_swing_potential| / average_swing
- High leverage (>2.0): Critical situation
- Normal leverage (0.5-2.0): Standard situation
- Low leverage (<0.5): Game largely decided
Fourth-Down Decision Framework
┌────────────────────────────────────────────────────────────────┐
│ FOURTH-DOWN DECISION TREE │
├────────────────────────────────────────────────────────────────┤
│ │
│ Current Situation: 4th and X at Y yard line │
│ │
│ Option 1: GO FOR IT │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ E[WP] = P(convert) × WP(1st down) │ │
│ │ + P(fail) × WP(turnover at Y) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Option 2: FIELD GOAL (if in range) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ E[WP] = P(make) × WP(kickoff from 35) │ │
│ │ + P(miss) × WP(opp ball at Y) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Option 3: PUNT │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ E[WP] = WP(opp ball at expected punt distance) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ RECOMMENDATION: Option with highest E[WP] │
│ │
└────────────────────────────────────────────────────────────────┘
Latency Budget Breakdown
For a 100ms end-to-end target:
| Stage | Budget | Typical Actual |
|---|---|---|
| Event ingestion | 10ms | 2-5ms |
| Validation | 5ms | 1-2ms |
| Feature engineering | 15ms | 5-10ms |
| Model prediction | 20ms | 10-15ms |
| Cache update | 5ms | 1-2ms |
| WebSocket broadcast | 10ms | 3-5ms |
| Network transit | 20ms | 5-15ms |
| Client render | 15ms | 10-16ms |
| Total | 100ms | 37-70ms |
Production Monitoring Essentials
Key Metrics to Track:
CRITICAL_METRICS = {
'latency_p99': 'Processing time 99th percentile',
'throughput': 'Events processed per second',
'error_rate': 'Percentage of failed events',
'queue_depth': 'Messages waiting to process',
'active_connections': 'Connected WebSocket clients',
'memory_usage': 'RAM consumption percentage',
'cpu_usage': 'CPU utilization percentage',
'data_quality_score': 'Percentage passing validation'
}
ALERT_THRESHOLDS = {
'latency_p99': 500, # ms - alert if > 500ms
'error_rate': 0.01, # alert if > 1%
'queue_depth': 10000, # alert if backlog > 10k
'memory_usage': 0.85, # alert if > 85%
'cpu_usage': 0.80, # alert if > 80%
'data_quality': 0.95 # alert if < 95%
}
Scaling Strategies
| Load Pattern | Strategy | Implementation |
|---|---|---|
| Predictable spike | Pre-scale | Schedule capacity increase before games |
| Variable load | Auto-scale | Kubernetes HPA based on CPU/queue depth |
| Geographic | CDN/Edge | Deploy processing close to data sources |
| Data volume | Partition | Shard by game_id or region |
Technology Stack Recommendations
For College Football Analytics:
| Component | Recommended | Alternative |
|---|---|---|
| Message Queue | Apache Kafka | RabbitMQ, AWS Kinesis |
| Stream Processing | Apache Flink | Kafka Streams, Spark Streaming |
| Cache | Redis | Memcached |
| Database | PostgreSQL | TimescaleDB, InfluxDB |
| WebSocket Server | Python asyncio | Node.js, Go |
| Dashboard | React + D3.js | Vue + Chart.js |
| Container | Docker | Podman |
| Orchestration | Kubernetes | Docker Swarm |
Common Pitfalls to Avoid
| Pitfall | Problem | Solution |
|---|---|---|
| Synchronous processing | Blocks on slow operations | Use async/await everywhere |
| No backpressure | System overwhelmed | Implement queue limits and flow control |
| Missing validation | Bad data propagates | Validate at ingestion |
| Single point of failure | System goes down | Redundancy at every layer |
| No graceful degradation | All or nothing | Circuit breakers, fallbacks |
| Polling instead of push | High latency, wasted resources | WebSockets for real-time |
| No monitoring | Blind to problems | Comprehensive metrics and alerts |
Code Quality Checklist
- [ ] All events have unique IDs for idempotency
- [ ] Timestamps use UTC consistently
- [ ] Error handling at every boundary
- [ ] Logging includes correlation IDs
- [ ] Health check endpoints exposed
- [ ] Graceful shutdown implemented
- [ ] Configuration externalized
- [ ] Secrets managed securely
- [ ] Unit tests for business logic
- [ ] Integration tests for pipelines
Quick Formulas
Throughput Capacity:
max_throughput = num_workers × events_per_second_per_worker
Queue Wait Time:
avg_wait = queue_depth / processing_rate
Required Replicas for Availability:
replicas = ceiling(1 / (1 - target_availability))
# For 99.9% availability with 99% per-instance: need 3 replicas
Data Quality Score:
quality_score = valid_events / total_events
Summary: The Real-Time Analytics Mindset
- Design for Failure - Assume components will fail; build resilience
- Measure Everything - You can't improve what you don't measure
- Latency is a Feature - Every millisecond matters in live sports
- Data Quality First - Bad data produces bad insights
- Scale Horizontally - Add capacity by adding machines, not upgrading
- Automate Operations - Manual processes don't scale
- Test Under Load - Performance testing before production
- Iterate Rapidly - Deploy small changes frequently
Key Terms Quick Reference
| Term | Definition |
|---|---|
| Backpressure | Mechanism to slow producers when consumers can't keep up |
| Circuit Breaker | Pattern to prevent cascading failures |
| Event Sourcing | Storing state changes as a sequence of events |
| Idempotency | Processing the same event multiple times has the same effect |
| Leverage Index | Measure of situation importance in a game |
| Streaming | Processing data continuously as it arrives |
| WebSocket | Protocol for bidirectional real-time communication |
| WPA | Win Probability Added - impact of a play on win probability |