Chapter 17: Quiz - Introduction to Predictive Analytics

Instructions

  • 30 questions total
  • Mix of multiple choice, true/false, and short answer
  • Estimated completion time: 45 minutes

Section 1: Fundamentals of Prediction (Questions 1-8)

Question 1 (Multiple Choice)

Predicting whether a team will win their next game is an example of:

A) Regression B) Classification C) Clustering D) Dimensionality reduction


Question 2 (Multiple Choice)

Predicting the exact score of a football game is an example of:

A) Binary classification B) Multi-class classification C) Regression D) Probability estimation


Question 3 (True/False)

In predictive modeling, the irreducible error ($\epsilon$) represents mistakes made by the model that could be corrected with more training data.


Question 4 (Multiple Choice)

Which of the following is NOT a valid target variable for a classification problem?

A) Win/Loss outcome B) Defensive coverage type (Cover 2, Cover 3, etc.) C) Yards gained on a play D) Whether a fourth down was converted


Question 5 (Short Answer)

Explain the difference between descriptive, predictive, and prescriptive analytics. Provide a football example for each.


Question 6 (Multiple Choice)

When training a model to predict game outcomes, using the final score of the game as a feature would be an example of:

A) Feature engineering B) Data leakage C) Regularization D) Cross-validation


Question 7 (True/False)

A model that achieves 100% accuracy on the training set is always a good model.


Question 8 (Multiple Choice)

What is the primary purpose of a train-test split?

A) To reduce computational time B) To estimate how the model will perform on new, unseen data C) To increase model accuracy D) To reduce the number of features


Section 2: Machine Learning Workflow (Questions 9-15)

Question 9 (Multiple Choice)

In time series prediction problems like game outcome prediction, what is the correct way to split training and test data?

A) Random split B) Stratified split by win/loss C) Temporal split (train on past, test on future) D) Split by team


Question 10 (Short Answer)

List three important questions you should answer during the "Problem Definition" stage of the ML workflow.


Question 11 (Multiple Choice)

Feature scaling is particularly important when using which type of model?

A) Decision Trees B) Random Forests C) Logistic Regression D) All of the above


Question 12 (True/False)

In cross-validation, the model is trained and evaluated on the same data to get more stable performance estimates.


Question 13 (Multiple Choice)

Which cross-validation strategy is most appropriate for football game prediction when games are ordered by date?

A) Standard K-Fold B) Stratified K-Fold C) Time Series Split D) Leave-One-Out


Question 14 (Short Answer)

Explain why you should fit the StandardScaler only on training data, not on the combined training and test data.


Question 15 (Multiple Choice)

What does a "baseline model" help you understand?

A) The maximum possible accuracy B) The minimum acceptable performance your model should beat C) The optimal hyperparameters D) The best features to use


Section 3: Evaluation Metrics (Questions 16-22)

Question 16 (Multiple Choice)

For a game prediction model, if we want to minimize predicting losses as wins, we should optimize for:

A) Precision B) Recall C) F1 Score D) AUC-ROC


Question 17 (True/False)

A model with 60% accuracy is always better than random guessing for a binary classification problem.


Question 18 (Multiple Choice)

The Brier Score measures:

A) Classification accuracy B) Probability calibration C) Feature importance D) Model complexity


Question 19 (Short Answer)

A win prediction model has 70% accuracy, but the home team wins 62% of games in the dataset. Calculate the model's improvement over the "always predict home win" baseline.


Question 20 (Multiple Choice)

What does AUC-ROC of 0.5 indicate?

A) Perfect model B) Model performs no better than random C) Model is perfectly calibrated D) Model has no false positives


Question 21 (True/False)

For probability predictions, being well-calibrated means that when the model predicts 70% win probability, the team wins approximately 70% of such games.


Question 22 (Multiple Choice)

RMSE (Root Mean Squared Error) is appropriate for evaluating:

A) Classification models B) Regression models C) Both classification and regression models D) Neither


Section 4: Common Pitfalls (Questions 23-27)

Question 23 (Multiple Choice)

Which of the following is an example of temporal leakage?

A) Using 2022 games in training and 2023 games in testing B) Using 2023 games in training and 2022 games in testing C) Using home field advantage as a feature D) Using team rankings from last season


Question 24 (Short Answer)

A college football team plays 12 regular season games. If you want to build a model with 20 features using only that team's games, explain why this might be problematic.


Question 25 (Multiple Choice)

Selection bias in football prediction might occur when:

A) Using only games with available tracking data B) Analyzing only playoff teams C) Excluding games where key players were injured D) All of the above


Question 26 (True/False)

When predicting whether a play will result in a first down, it's acceptable to use the yards gained on that play as a feature.


Question 27 (Multiple Choice)

To avoid overfitting, which strategy would NOT be helpful?

A) Using regularization B) Reducing the number of features C) Collecting more training data D) Increasing model complexity


Section 5: Practical Applications (Questions 28-30)

Question 28 (Short Answer)

Describe two ways you could measure whether a win probability model is well-calibrated.


Question 29 (Multiple Choice)

For real-time in-game predictions (like win probability), which is typically MORE important?

A) Maximum model accuracy B) Low prediction latency C) Using as many features as possible D) Perfect probability calibration


Question 30 (Short Answer)

A team wants to deploy a game prediction model. List three things they should monitor after deployment to ensure the model continues to perform well.


Answer Key

Section 1: Fundamentals of Prediction

Question 1: B) Classification Predicting a binary outcome (win/loss) is classification.

Question 2: C) Regression Predicting a continuous value (exact score) is regression.

Question 3: False Irreducible error represents inherent randomness in the outcome that cannot be eliminated regardless of the model or data. It exists because the outcome has some natural unpredictability.

Question 4: C) Yards gained on a play This is continuous and would be regression. The others are categorical outcomes.

Question 5: Sample answer: Descriptive analytics: Summarizing what happened (e.g., "Team X averaged 4.5 yards per carry this season") Predictive analytics: Forecasting what will happen (e.g., "Team X has a 65% chance of winning their next game") Prescriptive analytics: Recommending actions (e.g., "Team X should go for it on 4th and 2 because the expected value is +1.2 points")

Question 6: B) Data leakage Using information that wouldn't be available at prediction time is data leakage.

Question 7: False 100% training accuracy often indicates overfitting - the model may have memorized the training data rather than learning generalizable patterns.

Question 8: B) To estimate how the model will perform on new, unseen data The test set simulates new data to give an unbiased estimate of model performance.

Section 2: Machine Learning Workflow

Question 9: C) Temporal split (train on past, test on future) For time series data, you must train on past data and test on future data to avoid temporal leakage.

Question 10: Sample answer: 1. What exactly are we trying to predict? 2. When do we need the prediction (timing)? 3. What decisions will this prediction inform? 4. How accurate does the prediction need to be to be useful? 5. What is the cost of different types of errors?

Question 11: C) Logistic Regression Logistic regression and other models that use distance-based calculations (SVM, neural networks) are sensitive to feature scales. Tree-based models (Decision Trees, Random Forests) are not.

Question 12: False In cross-validation, the model is trained on different folds and evaluated on held-out data in each iteration. The model never trains and evaluates on the same data in a single fold.

Question 13: C) Time Series Split Time Series Split ensures that training data always comes before test data chronologically, preventing temporal leakage.

Question 14: Sample answer: Fitting the scaler on combined data causes data leakage - the scaler would use information from the test set (mean and standard deviation) during training. In production, you won't have access to future data when scaling new observations, so training should simulate this.

Question 15: B) The minimum acceptable performance your model should beat Baselines establish a floor - if your model can't beat simple approaches like predicting the majority class, it's not useful.

Section 3: Evaluation Metrics

Question 16: A) Precision Precision measures what fraction of predicted wins are actually wins. High precision minimizes false positives (predicting win when actual loss).

Question 17: False If one class comprises more than 50% of the data, predicting that class every time would exceed 50% accuracy. A 60% accurate model might still be worse than this simple baseline.

Question 18: B) Probability calibration Brier Score measures the mean squared difference between predicted probabilities and actual outcomes, assessing how well-calibrated probabilities are.

Question 19: Sample answer: Model accuracy: 70% Baseline accuracy: 62% Improvement: 70% - 62% = 8 percentage points Relative improvement: 8% / (100% - 62%) = 8% / 38% = 21.1% reduction in error

Question 20: B) Model performs no better than random AUC-ROC of 0.5 indicates the model cannot distinguish between classes better than random guessing. Perfect is 1.0.

Question 21: True This is the definition of probability calibration - predicted probabilities should match observed frequencies.

Question 22: B) Regression models RMSE measures the magnitude of prediction errors for continuous outcomes, making it appropriate for regression, not classification.

Section 4: Common Pitfalls

Question 23: B) Using 2023 games in training and 2022 games in testing This is temporal leakage - training on future data to predict past events.

Question 24: Sample answer: With only 12 games and 20 features, the ratio of observations to features is 0.6. The rule of thumb is to have at least 10-20 observations per feature. With 12 games, you have severe risk of overfitting - the model can easily find patterns that fit the training data but don't generalize. Solutions include reducing features to 1-2, or aggregating data across multiple seasons/teams.

Question 25: D) All of the above All three examples could cause selection bias by analyzing a non-representative subset of games.

Question 26: False This is data leakage - yards gained determines the outcome, so using it as a feature would be using information that wouldn't be available at prediction time.

Question 27: D) Increasing model complexity Increasing complexity typically makes overfitting worse, not better. The other strategies all help prevent overfitting.

Section 5: Practical Applications

Question 28: Sample answer: 1. Calibration curve: Plot predicted probability vs actual frequency. A well-calibrated model shows points close to the diagonal line. 2. Expected Calibration Error (ECE): Calculate the weighted average difference between predicted probability and actual frequency across probability bins. Lower is better. 3. Brier Score: Mean squared error of probability predictions. Lower indicates better calibration.

Question 29: B) Low prediction latency For real-time applications, predictions must be delivered quickly enough to be useful. A slightly less accurate model that responds in milliseconds is more valuable than a perfect model that takes seconds.

Question 30: Sample answer: 1. Prediction accuracy over time (rolling accuracy) 2. Calibration drift (are probabilities still accurate?) 3. Feature distribution changes (are input patterns changing?) 4. Prediction latency (is performance acceptable?) 5. Error patterns (are certain game types being predicted poorly?)


Scoring Guide

Score Grade Interpretation
27-30 A Excellent understanding of predictive analytics
24-26 B Good grasp of concepts with minor gaps
21-23 C Adequate understanding, review recommended
18-20 D Significant gaps in understanding
<18 F Review chapter material thoroughly

Topics to Review Based on Incorrect Answers

  • Questions 1-8 wrong: Review "Fundamentals of Prediction" section
  • Questions 9-15 wrong: Review "Machine Learning Workflow" section
  • Questions 16-22 wrong: Review "Evaluation Metrics" section
  • Questions 23-27 wrong: Review "Common Pitfalls" section
  • Questions 28-30 wrong: Review "Practical Applications" section