Case Study 1: Combining Player Stats and Team Records — Priya Merges NBA Datasets

Contributors to Introduction to Data Science

Case Study 1: Combining Player Stats and Team Records — Priya Merges NBA Datasets

Tier 3 — Illustrative/Composite Example: This case study uses a fictional but realistic NBA dataset constructed for pedagogical purposes. The player names, team names, specific statistics, and all numerical values are invented. No specific real players or teams are represented. The analytical workflow mirrors what a sports journalist would actually perform.

The Setting

Priya is four months into her data science journey, and she's finally working on the question that got her interested in data science in the first place: Has the rise of three-point shooting fundamentally changed which teams win in the NBA?

Her editor at the campus sports magazine wants a data-driven article for the upcoming March issue. The pitch: compare teams that live and die by the three-pointer against teams that rely on traditional inside scoring, and see which style produces more wins. The editor wants numbers, not opinions.

Priya has been collecting data all semester. She now has two CSV files:

File 1: player_stats.csv — Per-game statistics for 150 players across 30 teams.

Column	Description	Example
`player_name`	Player's name	"DeShawn Williams"
`team_id`	Three-letter team code	"ATL"
`position`	Playing position	"SG" (shooting guard)
`games_played`	Games played this season	72
`pts_per_game`	Points per game	18.4
`three_pt_attempts`	Three-point attempts per game	6.2
`three_pt_pct`	Three-point shooting percentage	0.381
`two_pt_attempts`	Two-point attempts per game	8.1

File 2: team_records.csv — Season records for all 30 teams.

Column	Description	Example
`team_code`	Three-letter team code	"ATL"
`team_name`	Full team name	"Atlanta Hawks"
`wins`	Wins this season	41
`losses`	Losses this season	41
`conference`	Conference	"East"

Priya notices something immediately: the key column is called team_id in the player stats but team_code in the team records. She'll need to handle that in the merge.

Step 1: Loading and Inspecting

Priya loads both files and does her standard inspection:

import pandas as pd

players = pd.read_csv('player_stats.csv')
teams = pd.read_csv('team_records.csv')

print(f"Players: {players.shape[0]} rows, {players.shape[1]} columns")
print(f"Teams: {teams.shape[0]} rows, {teams.shape[1]} columns")
print(f"\nPlayer team_ids: {players['team_id'].nunique()} unique")
print(f"Team codes: {teams['team_code'].nunique()} unique")

Players: 150 rows, 8 columns
Teams: 30 rows, 5 columns

Player team_ids: 30 unique
Team codes: 30 unique

Good — 30 unique teams on each side. But Priya knows to check alignment, not just counts:

# Are all player team_ids present in the team records?
missing = set(players['team_id']) - set(teams['team_code'])
print(f"Player teams not in team records: {missing}")

Player teams not in team records: set()

No mismatches. She also checks data types:

print(f"player team_id dtype: {players['team_id'].dtype}")
print(f"team team_code dtype: {teams['team_code'].dtype}")

player team_id dtype: object
team team_code dtype: object

Both strings. She's clear to merge.

Step 2: The Merge

Priya wants every player to have their team's win-loss record attached. She uses a left join with different key column names:

merged = pd.merge(
    players,
    teams,
    left_on='team_id',
    right_on='team_code',
    how='left',
    validate='many_to_one'  # many players per team, one team record
)

print(f"Rows after merge: {merged.shape[0]}")
print(f"Columns after merge: {merged.shape[1]}")
print(f"Any NaN in wins? {merged['wins'].isna().sum()}")

Rows after merge: 150
Columns after merge: 12
Any NaN in wins? 0

150 rows in, 150 rows out. The validate='many_to_one' didn't complain, confirming that each team has exactly one record. No NaN values in the wins column means every player successfully matched to a team.

Priya cleans up the redundant column:

merged = merged.drop(columns='team_code')

Step 3: Creating the Analysis Metric

Priya's hypothesis is about three-point reliance. She creates a metric she calls "three-point dependency" — the fraction of a player's shot attempts that are three-pointers:

merged['three_pt_dependency'] = (
    merged['three_pt_attempts'] /
    (merged['three_pt_attempts'] + merged['two_pt_attempts'])
)

print(merged[['player_name', 'team_id', 'three_pt_dependency']].head())

       player_name team_id  three_pt_dependency
0  DeShawn Williams     ATL             0.433
1     Marcus Torres     ATL             0.512
2      Jamal Harper     ATL             0.287
3    Kevin O'Donnell     BOS             0.601
4      Tyler Nguyen     BOS             0.489

Step 4: GroupBy to Characterize Teams

Now she uses groupby to compute team-level statistics:

team_profile = (
    merged
    .groupby('team_id')
    .agg(
        avg_three_pt_dep=('three_pt_dependency', 'mean'),
        avg_three_pt_pct=('three_pt_pct', 'mean'),
        avg_pts=('pts_per_game', 'mean'),
        roster_size=('player_name', 'count'),
        wins=('wins', 'first'),       # same for all players on a team
        losses=('losses', 'first')
    )
    .reset_index()
)

team_profile['win_pct'] = team_profile['wins'] / (team_profile['wins'] + team_profile['losses'])

print(team_profile[['team_id', 'avg_three_pt_dep', 'win_pct']].head(10))

  team_id  avg_three_pt_dep  win_pct
0     ATL             0.411    0.500
1     BOS             0.523    0.695
2     BRK             0.478    0.402
3     CHI             0.389    0.488
4     CLE             0.461    0.622
5     DAL             0.501    0.610
6     DEN             0.432    0.659
7     DET             0.356    0.317
8     GSW             0.548    0.549
9     HOU             0.515    0.512

Priya notices something interesting: teams with higher three-point dependency tend to have better winning percentages, but the relationship isn't perfect. Detroit has the lowest three-point dependency (0.356) and the worst record (0.317), while Boston has the highest (0.523) and one of the best records (0.695).

Step 5: Categorizing Teams

To make the comparison clearer, Priya categorizes teams into "high three-point" and "low three-point" groups based on the median dependency:

median_dep = team_profile['avg_three_pt_dep'].median()
print(f"Median three-point dependency: {median_dep:.3f}")

team_profile['shooting_style'] = (
    team_profile['avg_three_pt_dep']
    .apply(lambda x: 'Three-Point Heavy' if x >= median_dep else 'Traditional')
)

style_comparison = (
    team_profile
    .groupby('shooting_style')
    .agg(
        avg_win_pct=('win_pct', 'mean'),
        avg_pts_per_game=('avg_pts', 'mean'),
        num_teams=('team_id', 'count')
    )
    .round(3)
)

print(style_comparison)

                   avg_win_pct  avg_pts_per_game  num_teams
shooting_style
Three-Point Heavy        0.567            22.314         15
Traditional              0.433            20.876         15

There it is. Teams with above-median three-point dependency won 56.7% of their games on average, while traditional teams won 43.3%. The three-point-heavy teams also scored more points per game.

Step 6: The Transform Insight

Priya wants one more piece of analysis: for each player, how does their three-point dependency compare to their team's average? This will show whether the team's style is driven by a few specialists or is team-wide:

merged['team_avg_dep'] = (
    merged
    .groupby('team_id')['three_pt_dependency']
    .transform('mean')
)

merged['dep_vs_team'] = merged['three_pt_dependency'] - merged['team_avg_dep']

# Find the biggest outliers — players who deviate most from their team's style
outliers = merged.nlargest(5, 'dep_vs_team')[['player_name', 'team_id', 'three_pt_dependency', 'team_avg_dep', 'dep_vs_team']]
print(outliers)

       player_name team_id  three_pt_dependency  team_avg_dep  dep_vs_team
23     Tony Ramirez     DET                0.621         0.356        0.265
87    Chris Okafor      MEM                0.589         0.372        0.217
112   Andre Mitchell    ORL                0.602         0.401        0.201
5     Marcus Torres     ATL                0.512         0.411        0.101
45    David Kim         CLE                0.555         0.461        0.094

Interesting — Tony Ramirez on Detroit has a three-point dependency of 0.621 while his team averages only 0.356. He's a three-point specialist on a team that otherwise plays a traditional style. This is exactly the kind of individual story that makes a sports article compelling.

The Findings

Priya writes up her article with three key findings:

Teams that rely more heavily on three-point shooting won more games — 56.7% win rate vs. 43.3% for traditional teams. The difference is substantial (13.4 percentage points).
The relationship isn't universal. Denver had a relatively low three-point dependency (0.432) but the third-best record in the league (0.659), driven by exceptional two-point efficiency and defense.
Three-point dependency is mostly a team-wide strategy, not a specialist effect. Most players on high-three-point teams shoot more threes. The outliers — like Tony Ramirez shooting 62% threes on a team that averages 36% — are rare.

The Data Science Takeaway

Priya's analysis used every major technique from Chapter 9:

Merge (left join with different key names) to combine player stats with team records
GroupBy with named aggregation to compute team-level profiles
Transform to add team averages to individual player rows
Method chaining to build readable analysis pipelines

But the most important skill wasn't technical — it was knowing which join type to use (many-to-one: many players, one team record), checking for key alignment before merging, and using validate to catch unexpected duplicates. The merge itself took one line of code. The thinking that preceded it took five minutes of careful inspection.

Priya also learned something about the limits of her analysis. Correlation isn't causation — she can't say three-point shooting causes winning. Teams with more talent might both shoot more threes and win more games. And her sample is just one season. Still, for a campus sports article, the data tells a compelling and honest story. Her editor publishes it with the headline: "The Numbers Don't Lie: Three-Point Teams Are Winning Teams."