Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA

Contributors to Introduction to Data Science

Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA

Tier 3 — Illustrative/Composite Example: Priya is a composite character introduced in Chapter 1. The NBA statistics and analysis described here are constructed for pedagogical purposes, drawing on publicly known patterns in basketball analytics. All specific data points, model outputs, and article excerpts are fictional. No specific teams, players, or publications are represented.

The Setting

Priya has been covering the NBA for six years, and she's watched the league's analytics revolution from the press box. Teams now employ entire analytics departments. Broadcasters flash "win probability" numbers on screen in real time. Fans argue about PER, true shooting percentage, and net rating on social media.

But Priya's editor has challenged her to go beyond reporting what the analytics departments say. "Build your own model," he told her. "Predict game outcomes, track your accuracy over a month, and write about what you learn — both about basketball and about how these models actually work."

Priya completed the first four parts of this textbook last semester. She's comfortable with pandas, matplotlib, and basic modeling. Now she's putting it all together with the tools from Part V.

The Question

Priya asks three questions:

Can publicly available pre-game statistics predict which team will win an NBA game? (The feasibility question.)
What factors matter most for predicting wins? (The insight question — this becomes the article.)
How accurate can a model be, and what does that tell us about basketball's inherent unpredictability? (The meta question — this is what makes it interesting to readers.)

The Data

Priya gathers game-level data for three full NBA seasons (about 3,690 regular-season games). For each game, she calculates pre-game features using only information available before tip-off:

Team-level features (computed from season-to-date stats): - home_win_pct: Home team's winning percentage this season - away_win_pct: Away team's winning percentage - home_avg_pts: Home team's average points per game - away_avg_pts: Away team's average points per game - home_avg_pts_allowed: Home team's average points allowed - away_avg_pts_allowed: Away team's average points allowed - home_rest_days: Days since the home team's last game - away_rest_days: Days since the away team's last game - home_streak: Home team's current winning/losing streak (positive = winning) - away_streak: Away team's current winning/losing streak

Matchup features: - win_pct_diff: home_win_pct - away_win_pct - net_rating_diff: (home offensive rating - home defensive rating) - (away offensive rating - away defensive rating)

Target: - home_win: 1 if the home team won, 0 if the away team won

The historical home-team win rate is about 58% — the well-documented home-court advantage.

Building the Model

Attempt 1: A Single Decision Tree

Priya starts simple:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text

games = pd.read_csv('nba_games_features.csv')

features = ['home_win_pct', 'away_win_pct', 'home_avg_pts',
            'away_avg_pts', 'home_rest_days', 'away_rest_days',
            'win_pct_diff', 'net_rating_diff']
X = games[features]
y = games['home_win']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)
print(f"Test accuracy: {dt.score(X_test, y_test):.3f}")

Test accuracy: 0.654

65.4%. Better than the 58% baseline (always picking the home team), but not by a lot. Priya prints the tree rules:

print(export_text(dt, feature_names=features, max_depth=3))

|--- win_pct_diff <= -0.06
|   |--- net_rating_diff <= -3.41
|   |   |--- class: away win
|   |--- net_rating_diff > -3.41
|   |   |--- home_rest_days <= 0.50
|   |   |   |--- class: away win
|   |   |--- home_rest_days > 0.50
|   |   |   |--- class: home win
|--- win_pct_diff > -0.06
|   |--- net_rating_diff <= -5.12
|   |   |--- class: away win
|   |--- net_rating_diff > -5.12
|   |   |--- class: home win

The tree's story is clear: the difference in winning percentages and net ratings between the two teams is what matters most. If the home team has a significantly better record (win_pct_diff > -0.06, meaning the home team isn't much worse), and they're not dramatically outmatched in net rating, predict a home win.

This is intuitive — almost too intuitive. "Good teams beat bad teams" isn't exactly breaking news.

But there's a nugget: rest days appear in the tree. When the home team is the underdog but playing on rest while the away team is not, the model gives the home team a chance. That's a story Priya can write about.

Attempt 2: A Random Forest

Priya wants better accuracy, so she switches to a random forest:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=8,
    max_features='sqrt',
    random_state=42
)
rf.fit(X_train, y_train)
print(f"Test accuracy: {rf.score(X_test, y_test):.3f}")

Test accuracy: 0.687

68.7%. An improvement, but still modest. Priya is initially disappointed. Then she puts it in perspective.

The Reality Check

Priya does some research and discovers that professional sports prediction models — the ones used by Vegas oddsmakers and sophisticated analytics sites — typically achieve around 65-70% accuracy for NBA games. Her random forest, built in an afternoon with publicly available data, is in that range.

Why can't models do better? Because basketball has enormous inherent randomness. A last-second three-pointer, a key player rolling an ankle in warm-ups, a referee's controversial call — these events are unpredictable from pre-game statistics. No model, no matter how sophisticated, can predict whether a buzzer-beater will go in.

This becomes the central insight of Priya's article: the ceiling for pre-game prediction models is surprisingly low, and that's actually what makes basketball exciting.

Feature Importance: The Story

Priya extracts feature importances from the random forest:

import pandas as pd
import matplotlib.pyplot as plt

importances = pd.Series(rf.feature_importances_, index=features)
importances = importances.sort_values(ascending=True)

plt.figure(figsize=(8, 5))
importances.plot(kind='barh', color='darkorange')
plt.xlabel('Feature Importance')
plt.title('What Predicts NBA Wins?')
plt.tight_layout()
plt.show()

The results:

Feature	Importance
`net_rating_diff`	0.28
`win_pct_diff`	0.22
`home_win_pct`	0.13
`away_win_pct`	0.12
`home_avg_pts`	0.07
`away_avg_pts`	0.06
`home_rest_days`	0.06
`away_rest_days`	0.06

The story writes itself. Priya drafts:

"When you strip away the narratives, the revenge games, and the emotional storylines that make the NBA appointment television, what actually predicts who wins? According to our model, it comes down to two things: which team has been better overall (their season record) and by how much (their net rating — the gap between how many points they score and how many they allow).

Rest days? They matter, but less than you'd think. Our model assigns them about 6% of its predictive weight — enough to tip a close game, but nowhere near as important as simply being the better team. The narrative that 'tired legs' decide games may be overstated.

The bigger story is what the model CAN'T predict. Even with a season's worth of data on both teams, the best we can do is predict the winner about 69% of the time. The other 31%? That's where basketball lives — in the chaos, the upsets, the performances that defy statistics. That 31% is why we watch."

A Deeper Dive: Stability Testing

Priya wonders: does her model's accuracy hold up across different seasons? She trains on two seasons and tests on the third, rotating which season is held out:

results = []
for test_season in [2021, 2022, 2023]:
    train = games[games['season'] != test_season]
    test = games[games['season'] == test_season]

    rf_temp = RandomForestClassifier(n_estimators=300, max_depth=8,
                                      random_state=42)
    rf_temp.fit(train[features], train['home_win'])
    acc = rf_temp.score(test[features], test['home_win'])
    results.append({'test_season': test_season, 'accuracy': acc})
    print(f"Test season {test_season}: accuracy = {acc:.3f}")

Test season 2021: accuracy = 0.661
Test season 2022: accuracy = 0.693
Test season 2023: accuracy = 0.678

The accuracy is consistent across seasons — ranging from 66% to 69%. This tells Priya that her model isn't overfitting to a particular season and that the underlying predictive factors are stable over time. The slight variation is expected: some seasons are more predictable than others (fewer upsets, more dominant teams).

What Priya Learned (and What She Wrote)

Priya's final article weaves together the technical and human sides of prediction:

Technical insights: - Net rating difference and winning percentage difference are the two most important predictors — together accounting for 50% of the model's decisions. - Rest days have a measurable but small effect — about 6% of predictive weight. - The model is 68-69% accurate across seasons, matching professional prediction models. - The 31% that can't be predicted isn't a model failure — it's basketball's inherent unpredictability.

Writing insights: - The decision tree was more useful for storytelling than the random forest. The tree's simple rules ("if the home team has a better record AND a better net rating, they'll probably win") are easy to explain in an article. The random forest's improved accuracy came at the cost of narrative clarity. - Feature importance gave her the framework for the article's structure: lead with what matters most, acknowledge what matters less, and close with what can't be predicted. - The model's limitations were actually the most interesting part of the article. "Here's what we can't predict" is a better story than "here's what we can predict."

Discussion Questions

Priya's model achieves 69% accuracy. Is this good or bad? How does comparing to the baseline (58%) and to professional models (65-70%) help you evaluate this number?
The model's top two features — net_rating_diff and win_pct_diff — are highly correlated with each other. Why might the model still benefit from having both? What problem could this correlation cause for feature importance interpretation?
Priya excluded player-level features (injuries, individual stats) from her model. How might including them improve accuracy? What practical challenges would they introduce?
If Priya used her model to make bets, she'd need to beat the Vegas line, not just predict winners. Why is this a much harder problem? (Hint: think about what information Vegas already incorporates.)
Priya says the "31% that can't be predicted is why we watch." Is this just poetic writing, or does it reflect something real about the nature of prediction? Can you think of other domains where inherent unpredictability is a feature, not a bug?