Chapter 27 Quiz: Building a Complete Analytics System

Instructions

  • 40 questions total
  • Mix of multiple choice, true/false, and short answer
  • Time limit: 50 minutes
  • Passing score: 70%

Section 1: System Requirements and Design (12 questions)

Question 1

In a stakeholder analysis for a football analytics system, which group typically requires the LOWEST response time?

A) Head coach during games B) Recruiting coordinator reviewing prospects C) Athletic director reviewing quarterly reports D) Analytics staff during game preparation


Question 2

True or False: Non-functional requirements are less important than functional requirements in analytics systems.


Question 3

What is the recommended maximum API response time for real-time coaching dashboards?

A) 5 seconds B) 1 second C) 500 milliseconds D) 50 milliseconds


Question 4

Which architectural pattern is BEST suited for decoupling data ingestion from processing?

A) Monolithic architecture B) Message queue / event-driven architecture C) Direct database writes D) Synchronous API calls


Question 5

True or False: A single database can efficiently serve both real-time queries and historical analytics in a production system.


Question 6

Short Answer: List three non-functional requirements that are critical for a game-day analytics system.


Question 7

In a microservices architecture, what component typically handles communication between services?

A) Shared database B) Message broker (Kafka, RabbitMQ) C) Direct HTTP calls only D) File system


Question 8

Which storage technology is most appropriate for caching frequently accessed game state data?

A) PostgreSQL B) Redis C) S3 D) SQLite


Question 9

True or False: Role-based access control (RBAC) should restrict recruiting data to only recruiting staff and analytics personnel.


Question 10

What is the primary purpose of the "repository pattern" in system design?

A) Store data in multiple databases B) Abstract data access from business logic C) Improve query performance D) Handle user authentication


Question 11

Short Answer: Explain why horizontal scaling is preferred over vertical scaling for game-day analytics loads.


Question 12

A "service registry" in a microservices architecture is used for:

A) Registering user accounts B) Managing service discovery and dependency injection C) Storing service logs D) Scheduling background jobs


Section 2: Data Pipeline (10 questions)

Question 13

When ingesting data from external APIs, what should be the FIRST validation step?

A) Calculate EPA values B) Check data freshness C) Validate schema (required fields present) D) Compare to historical averages


Question 14

True or False: Data ingestion pipelines should immediately stop processing when a single record fails validation.


Question 15

What is "backpressure" in the context of data pipelines?

A) Storing data in reverse order B) Mechanism to slow producers when consumers can't keep up C) Compressing data for storage D) Prioritizing certain data types


Question 16

When calculating EPA, what determines the expected points BEFORE a play?

A) The play result B) Down, distance, and field position C) Player statistics D) Win probability


Question 17

Short Answer: Why is it important to log data quality issues even when they don't prevent processing?


Question 18

An "idempotent" data ingestion process means:

A) Data is processed faster B) Processing the same data multiple times produces the same result C) Data is compressed D) Processing happens in parallel


Question 19

True or False: Data transformations should be applied during ingestion rather than at query time for performance.


Question 20

What is the purpose of a "data quality score" in an analytics pipeline?

A) Ranking data sources by cost B) Measuring the completeness and accuracy of ingested data C) Prioritizing which data to process first D) Determining storage requirements


Question 21

When processing play-by-play data, EPA for a touchdown should approximately equal:

A) 1.0 B) 3.0 C) 7.0 D) Variable based on field position


Question 22

Which approach is BEST for handling late-arriving data in a streaming pipeline?

A) Reject all late data B) Use watermarks and late data handling windows C) Process late data with the next batch D) Store late data in a separate table


Section 3: Analytics Implementation (10 questions)

Question 23

In a win probability model, which feature typically has the LARGEST impact on predictions?

A) Home field advantage B) Score differential (adjusted for time) C) Current down and distance D) Weather conditions


Question 24

True or False: Win probability for the home and away teams should always sum to exactly 1.0.


Question 25

Short Answer: Describe how Win Probability Added (WPA) is calculated for a single play.


Question 26

For fourth-down decisions, the "expected win probability" of going for it equals:

A) The conversion probability B) Win probability if successful minus win probability if failed C) (P(convert) × WP if convert) + (P(fail) × WP if fail) D) Win probability after the decision


Question 27

A "leverage index" of 2.5 indicates:

A) The team is losing by 2.5 touchdowns B) The situation is 2.5x more important than average C) There are 2.5 quarters remaining D) The conversion probability is 25%


Question 28

True or False: A well-calibrated win probability model should show that teams with 80% win probability actually win approximately 80% of the time.


Question 29

When generating opponent scouting reports, which analysis should be broken down by game situation?

A) Player heights and weights B) Run/pass tendencies by down and field position C) Historical win/loss records D) Stadium capacity


Question 30

The success rate metric considers a first-down play successful if it gains:

A) Any positive yards B) At least 40% of the needed yards C) 10 or more yards D) More than the defense expected


Question 31

What is the purpose of caching model predictions in a real-time system?

A) Reduce storage costs B) Improve latency for repeated queries C) Ensure predictions are consistent D) Track model accuracy


Question 32

True or False: EPA can be negative for a play that gains positive yards.


Section 4: Operations and Deployment (8 questions)

Question 33

In a Docker deployment, which file defines multi-container applications?

A) Dockerfile B) docker-compose.yml C) package.json D) requirements.txt


Question 34

True or False: Health check endpoints should only verify database connectivity.


Question 35

What is the primary purpose of Kubernetes Horizontal Pod Autoscaler (HPA)?

A) Automatically deploy new code B) Scale pods up/down based on metrics C) Manage database connections D) Handle SSL certificates


Question 36

Short Answer: List four metrics that should be monitored for a production analytics system.


Question 37

The "circuit breaker" pattern in distributed systems is used to:

A) Physically disconnect servers B) Prevent cascading failures by stopping calls to failing services C) Encrypt data in transit D) Balance load across servers


Question 38

True or False: Production systems should log all API requests including full request bodies for debugging.


Question 39

What is the purpose of a "blue-green deployment" strategy?

A) Color-coding different environments B) Zero-downtime deployments by switching between two identical environments C) Deploying to multiple geographic regions D) Running tests before deployment


Question 40

When should automated alerts be triggered for a game-day analytics system?

A) Only when the system is completely down B) When latency, error rate, or data freshness exceed thresholds C) Every hour during games D) Only after receiving user complaints


Answer Key

Section 1: System Requirements and Design

  1. C) Athletic director reviewing quarterly reports - Executive reports have the longest acceptable response times as they are used for strategic planning rather than real-time decisions.

  2. False - Non-functional requirements (performance, reliability, security) are equally critical, especially for real-time systems where game-day uptime is essential.

  3. C) 500 milliseconds - Real-time coaching dashboards should respond quickly enough that users don't perceive delay, typically under 500ms.

  4. B) Message queue / event-driven architecture - Message queues decouple producers from consumers, allowing each to scale independently.

  5. False - Production systems typically use separate storage solutions optimized for different query patterns (OLTP vs. OLAP).

  6. Sample Answer: Three critical non-functional requirements: (1) 99.9% uptime during games; (2) Response time under 500ms for dashboard queries; (3) Support for 50+ concurrent users during peak game-day loads.

  7. B) Message broker (Kafka, RabbitMQ) - Message brokers enable asynchronous, decoupled communication between services.

  8. B) Redis - Redis provides sub-millisecond latency for key-value lookups, ideal for caching frequently accessed data.

  9. True - Recruiting data is sensitive competitive information that should be restricted to personnel who need it.

  10. B) Abstract data access from business logic - The repository pattern provides a clean separation between data persistence and business logic.

  11. Sample Answer: Horizontal scaling (adding more servers) is preferred because: (1) It allows adding capacity on-demand for game-day spikes; (2) It's more cost-effective than continually upgrading individual servers; (3) It provides better fault tolerance through redundancy; (4) It enables geographic distribution for lower latency.

  12. B) Managing service discovery and dependency injection - Service registries allow services to find each other and manage dependencies.

Section 2: Data Pipeline

  1. C) Validate schema (required fields present) - Schema validation should happen first to ensure basic data structure before more complex checks.

  2. False - Pipelines should log errors and continue processing valid records; stopping for single failures would make systems fragile.

  3. B) Mechanism to slow producers when consumers can't keep up - Backpressure prevents system overload by coordinating flow rates.

  4. B) Down, distance, and field position - Expected points before a play depends on the game situation, not the play result.

  5. Sample Answer: Logging quality issues is important because: (1) Allows trend analysis to detect degrading data sources; (2) Provides context for debugging analytics anomalies; (3) Enables proactive outreach to data providers; (4) Creates an audit trail for data lineage.

  6. B) Processing the same data multiple times produces the same result - Idempotency enables safe retries and exactly-once semantics.

  7. True - Pre-computing transformations during ingestion improves query performance at the cost of some storage.

  8. B) Measuring the completeness and accuracy of ingested data - Quality scores quantify how reliable the data is.

  9. D) Variable based on field position - EPA for a touchdown is approximately 7 minus the expected points at the starting field position.

  10. B) Use watermarks and late data handling windows - Modern streaming systems use watermarks to handle out-of-order data gracefully.

Section 3: Analytics Implementation

  1. B) Score differential (adjusted for time) - Score differential, especially late in games, is the strongest predictor of win probability.

  2. True - This is the basic property of probabilities for mutually exclusive, exhaustive outcomes.

  3. Sample Answer: WPA = Win Probability after the play - Win Probability before the play. It quantifies how much a single play changed the team's likelihood of winning.

  4. C) (P(convert) × WP if convert) + (P(fail) × WP if fail) - Expected value is the probability-weighted average of all outcomes.

  5. B) The situation is 2.5x more important than average - Leverage index measures how much more impactful than average the current situation is.

  6. True - This is the definition of calibration - predicted probabilities should match observed frequencies.

  7. B) Run/pass tendencies by down and field position - Situational tendencies are crucial for game planning.

  8. B) At least 40% of the needed yards - Success rate uses different thresholds by down (40% on 1st, 60% on 2nd, 100% on 3rd/4th).

  9. B) Improve latency for repeated queries - Caching avoids redundant computations for identical inputs.

  10. True - EPA can be negative for positive-yard plays if the situation worsened (e.g., 2nd & 10 becomes 3rd & 8).

Section 4: Operations and Deployment

  1. B) docker-compose.yml - Docker Compose defines and runs multi-container applications.

  2. False - Health checks should verify all critical dependencies (database, cache, external APIs, disk space, etc.).

  3. B) Scale pods up/down based on metrics - HPA automatically adjusts replica count based on CPU, memory, or custom metrics.

  4. Sample Answer: Four metrics to monitor: (1) API response latency (avg, p95, p99); (2) Error rate percentage; (3) Database connection pool utilization; (4) Data freshness (time since last update).

  5. B) Prevent cascading failures by stopping calls to failing services - Circuit breakers allow systems to degrade gracefully.

  6. False - Logging full request bodies can expose sensitive data and create storage issues; log judiciously.

  7. B) Zero-downtime deployments by switching between two identical environments - Blue-green allows instant rollback and testing before switching traffic.

  8. B) When latency, error rate, or data freshness exceed thresholds - Proactive alerting catches issues before users notice.


Scoring Guide

Score Grade Feedback
36-40 A Excellent systems understanding, ready for production work
32-35 B Strong grasp of concepts, review deployment topics
28-31 C Satisfactory, focus on data pipeline design
24-27 D Needs improvement in architecture concepts
<24 F Re-study chapter material thoroughly