Case Study 1: Combining Player Stats and Team Records — Priya Merges NBA Datasets
Tier 3 — Illustrative/Composite Example: This case study uses a fictional but realistic NBA dataset constructed for pedagogical purposes. The player names, team names, specific statistics, and all numerical values are invented. No specific real players or teams are represented. The analytical workflow mirrors what a sports journalist would actually perform.
The Setting
Priya is four months into her data science journey, and she's finally working on the question that got her interested in data science in the first place: Has the rise of three-point shooting fundamentally changed which teams win in the NBA?
Her editor at the campus sports magazine wants a data-driven article for the upcoming March issue. The pitch: compare teams that live and die by the three-pointer against teams that rely on traditional inside scoring, and see which style produces more wins. The editor wants numbers, not opinions.
Priya has been collecting data all semester. She now has two CSV files:
File 1: player_stats.csv — Per-game statistics for 150 players across 30 teams.
| Column | Description | Example |
|---|---|---|
player_name |
Player's name | "DeShawn Williams" |
team_id |
Three-letter team code | "ATL" |
position |
Playing position | "SG" (shooting guard) |
games_played |
Games played this season | 72 |
pts_per_game |
Points per game | 18.4 |
three_pt_attempts |
Three-point attempts per game | 6.2 |
three_pt_pct |
Three-point shooting percentage | 0.381 |
two_pt_attempts |
Two-point attempts per game | 8.1 |
File 2: team_records.csv — Season records for all 30 teams.
| Column | Description | Example |
|---|---|---|
team_code |
Three-letter team code | "ATL" |
team_name |
Full team name | "Atlanta Hawks" |
wins |
Wins this season | 41 |
losses |
Losses this season | 41 |
conference |
Conference | "East" |
Priya notices something immediately: the key column is called team_id in the player stats but team_code in the team records. She'll need to handle that in the merge.
Step 1: Loading and Inspecting
Priya loads both files and does her standard inspection:
import pandas as pd
players = pd.read_csv('player_stats.csv')
teams = pd.read_csv('team_records.csv')
print(f"Players: {players.shape[0]} rows, {players.shape[1]} columns")
print(f"Teams: {teams.shape[0]} rows, {teams.shape[1]} columns")
print(f"\nPlayer team_ids: {players['team_id'].nunique()} unique")
print(f"Team codes: {teams['team_code'].nunique()} unique")
Players: 150 rows, 8 columns
Teams: 30 rows, 5 columns
Player team_ids: 30 unique
Team codes: 30 unique
Good — 30 unique teams on each side. But Priya knows to check alignment, not just counts:
# Are all player team_ids present in the team records?
missing = set(players['team_id']) - set(teams['team_code'])
print(f"Player teams not in team records: {missing}")
Player teams not in team records: set()
No mismatches. She also checks data types:
print(f"player team_id dtype: {players['team_id'].dtype}")
print(f"team team_code dtype: {teams['team_code'].dtype}")
player team_id dtype: object
team team_code dtype: object
Both strings. She's clear to merge.
Step 2: The Merge
Priya wants every player to have their team's win-loss record attached. She uses a left join with different key column names:
merged = pd.merge(
players,
teams,
left_on='team_id',
right_on='team_code',
how='left',
validate='many_to_one' # many players per team, one team record
)
print(f"Rows after merge: {merged.shape[0]}")
print(f"Columns after merge: {merged.shape[1]}")
print(f"Any NaN in wins? {merged['wins'].isna().sum()}")
Rows after merge: 150
Columns after merge: 12
Any NaN in wins? 0
150 rows in, 150 rows out. The validate='many_to_one' didn't complain, confirming that each team has exactly one record. No NaN values in the wins column means every player successfully matched to a team.
Priya cleans up the redundant column:
merged = merged.drop(columns='team_code')
Step 3: Creating the Analysis Metric
Priya's hypothesis is about three-point reliance. She creates a metric she calls "three-point dependency" — the fraction of a player's shot attempts that are three-pointers:
merged['three_pt_dependency'] = (
merged['three_pt_attempts'] /
(merged['three_pt_attempts'] + merged['two_pt_attempts'])
)
print(merged[['player_name', 'team_id', 'three_pt_dependency']].head())
player_name team_id three_pt_dependency
0 DeShawn Williams ATL 0.433
1 Marcus Torres ATL 0.512
2 Jamal Harper ATL 0.287
3 Kevin O'Donnell BOS 0.601
4 Tyler Nguyen BOS 0.489
Step 4: GroupBy to Characterize Teams
Now she uses groupby to compute team-level statistics:
team_profile = (
merged
.groupby('team_id')
.agg(
avg_three_pt_dep=('three_pt_dependency', 'mean'),
avg_three_pt_pct=('three_pt_pct', 'mean'),
avg_pts=('pts_per_game', 'mean'),
roster_size=('player_name', 'count'),
wins=('wins', 'first'), # same for all players on a team
losses=('losses', 'first')
)
.reset_index()
)
team_profile['win_pct'] = team_profile['wins'] / (team_profile['wins'] + team_profile['losses'])
print(team_profile[['team_id', 'avg_three_pt_dep', 'win_pct']].head(10))
team_id avg_three_pt_dep win_pct
0 ATL 0.411 0.500
1 BOS 0.523 0.695
2 BRK 0.478 0.402
3 CHI 0.389 0.488
4 CLE 0.461 0.622
5 DAL 0.501 0.610
6 DEN 0.432 0.659
7 DET 0.356 0.317
8 GSW 0.548 0.549
9 HOU 0.515 0.512
Priya notices something interesting: teams with higher three-point dependency tend to have better winning percentages, but the relationship isn't perfect. Detroit has the lowest three-point dependency (0.356) and the worst record (0.317), while Boston has the highest (0.523) and one of the best records (0.695).
Step 5: Categorizing Teams
To make the comparison clearer, Priya categorizes teams into "high three-point" and "low three-point" groups based on the median dependency:
median_dep = team_profile['avg_three_pt_dep'].median()
print(f"Median three-point dependency: {median_dep:.3f}")
team_profile['shooting_style'] = (
team_profile['avg_three_pt_dep']
.apply(lambda x: 'Three-Point Heavy' if x >= median_dep else 'Traditional')
)
style_comparison = (
team_profile
.groupby('shooting_style')
.agg(
avg_win_pct=('win_pct', 'mean'),
avg_pts_per_game=('avg_pts', 'mean'),
num_teams=('team_id', 'count')
)
.round(3)
)
print(style_comparison)
avg_win_pct avg_pts_per_game num_teams
shooting_style
Three-Point Heavy 0.567 22.314 15
Traditional 0.433 20.876 15
There it is. Teams with above-median three-point dependency won 56.7% of their games on average, while traditional teams won 43.3%. The three-point-heavy teams also scored more points per game.
Step 6: The Transform Insight
Priya wants one more piece of analysis: for each player, how does their three-point dependency compare to their team's average? This will show whether the team's style is driven by a few specialists or is team-wide:
merged['team_avg_dep'] = (
merged
.groupby('team_id')['three_pt_dependency']
.transform('mean')
)
merged['dep_vs_team'] = merged['three_pt_dependency'] - merged['team_avg_dep']
# Find the biggest outliers — players who deviate most from their team's style
outliers = merged.nlargest(5, 'dep_vs_team')[['player_name', 'team_id', 'three_pt_dependency', 'team_avg_dep', 'dep_vs_team']]
print(outliers)
player_name team_id three_pt_dependency team_avg_dep dep_vs_team
23 Tony Ramirez DET 0.621 0.356 0.265
87 Chris Okafor MEM 0.589 0.372 0.217
112 Andre Mitchell ORL 0.602 0.401 0.201
5 Marcus Torres ATL 0.512 0.411 0.101
45 David Kim CLE 0.555 0.461 0.094
Interesting — Tony Ramirez on Detroit has a three-point dependency of 0.621 while his team averages only 0.356. He's a three-point specialist on a team that otherwise plays a traditional style. This is exactly the kind of individual story that makes a sports article compelling.
The Findings
Priya writes up her article with three key findings:
-
Teams that rely more heavily on three-point shooting won more games — 56.7% win rate vs. 43.3% for traditional teams. The difference is substantial (13.4 percentage points).
-
The relationship isn't universal. Denver had a relatively low three-point dependency (0.432) but the third-best record in the league (0.659), driven by exceptional two-point efficiency and defense.
-
Three-point dependency is mostly a team-wide strategy, not a specialist effect. Most players on high-three-point teams shoot more threes. The outliers — like Tony Ramirez shooting 62% threes on a team that averages 36% — are rare.
The Data Science Takeaway
Priya's analysis used every major technique from Chapter 9:
- Merge (left join with different key names) to combine player stats with team records
- GroupBy with named aggregation to compute team-level profiles
- Transform to add team averages to individual player rows
- Method chaining to build readable analysis pipelines
But the most important skill wasn't technical — it was knowing which join type to use (many-to-one: many players, one team record), checking for key alignment before merging, and using validate to catch unexpected duplicates. The merge itself took one line of code. The thinking that preceded it took five minutes of careful inspection.
Priya also learned something about the limits of her analysis. Correlation isn't causation — she can't say three-point shooting causes winning. Teams with more talent might both shoot more threes and win more games. And her sample is just one season. Still, for a campus sports article, the data tells a compelling and honest story. Her editor publishes it with the headline: "The Numbers Don't Lie: Three-Point Teams Are Winning Teams."