Case Study 2: Finding the Story in NBA Statistics with seaborn

Contributors to Introduction to Data Science

Case Study 2: Finding the Story in NBA Statistics with seaborn

Tier 3 — Illustrative/Composite Example: This case study follows Jordan, a sports journalism student, as he uses seaborn to explore NBA player statistics. All player statistics, team names used generically, and specific numerical values are fictional composites inspired by typical patterns in professional basketball data. The dataset structure mirrors what basketball-reference.com provides, but no real player data is quoted. The analytical workflow and visualization choices are representative of real sports analytics.

The Setting

Jordan is a junior at a university journalism program with a minor in data science. His editor at the campus newspaper wants a data-driven feature article: "What makes an NBA player worth their salary?" Jordan has a CSV file of player statistics from a recent season — 450 players, 22 columns — and two days to produce five charts with written analysis.

Jordan knows matplotlib from Chapter 15, but his editor wants charts that look polished without a graphic designer's intervention. He decides to use seaborn.

The Data

Jordan's dataset, nba_players_2024.csv, has the following key columns:

Column	Description	Example
`player`	Player name	"J. Williams"
`team`	Team abbreviation	"LAL"
`position`	Position (PG, SG, SF, PF, C)	"SF"
`age`	Player age	27
`games_played`	Games played	72
`minutes_per_game`	Minutes per game	34.5
`points_per_game`	Points per game	22.3
`rebounds_per_game`	Rebounds per game	7.1
`assists_per_game`	Assists per game	4.8
`fg_pct`	Field goal percentage	0.485
`three_pct`	Three-point percentage	0.372
`salary_millions`	Annual salary in millions USD	28.5
`experience_years`	Years in the league	6
`player_efficiency`	Player efficiency rating (PER)	21.4

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid", palette="muted",
              context="notebook")

df = pd.read_csv("nba_players_2024.csv")
print(f"{df.shape[0]} players, {df.shape[1]} columns")
df.head()

Chart 1: What Does the Salary Distribution Look Like?

Jordan starts with the most obvious question — how are salaries distributed?

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(df["salary_millions"], bins=30,
             kde=True, ax=axes[0])
axes[0].set_title("Salary Distribution")
axes[0].set_xlabel("Salary (millions USD)")

sns.boxplot(data=df, x="position",
            y="salary_millions",
            order=["PG", "SG", "SF", "PF", "C"],
            ax=axes[1])
axes[1].set_title("Salary by Position")
axes[1].set_xlabel("Position")
axes[1].set_ylabel("Salary (millions USD)")

plt.tight_layout()

The histogram reveals a heavily right-skewed distribution. The vast majority of players earn between 1 and 10 million dollars, with a long tail extending past 40 million. The KDE overlay emphasizes this shape — the peak is at the low end, and the curve drops off steeply but has a long, gradual decline.

The box plot by position shows that all positions have similar median salaries (around 5-8 million) but different spreads. Centers and power forwards have slightly lower medians than guards and small forwards, and small forwards have the highest outliers (the superstar max contracts).

Jordan writes in his draft: "NBA salaries are deeply unequal. The median player earns about $6 million, but the top earners make seven times that. Position matters less than you'd think — the real dividing line is between stars and everyone else."

Chart 2: Points vs. Salary — Does Scoring Pay?

Jordan's editor specifically asked about the relationship between performance and pay:

sns.lmplot(data=df, x="points_per_game",
           y="salary_millions",
           hue="position", height=6, aspect=1.3,
           scatter_kws={"alpha": 0.6, "s": 30})
plt.xlabel("Points per Game")
plt.ylabel("Salary (millions USD)")

The scatter plot shows a positive relationship, but it is not as clean as Jordan expected. There is a clear upward trend — players who score more earn more — but the scatter is wide. Several players score 15+ points per game but earn under 5 million (young players on rookie contracts), and some players who score under 10 points per game earn over 20 million (aging veterans on old contracts).

The regression lines by position reveal something interesting: the slope is steepest for guards (PG, SG) and shallowest for centers. A center who scores 15 points per game earns roughly the same as one who scores 10, but a point guard who scores 15 earns significantly more than one who scores 10. Scoring is valued differently by position.

Jordan highlights a cluster of dots in the upper-left: high salary, low scoring. He makes a mental note to investigate — these might be defensive specialists, injured players, or veterans on legacy contracts.

Chart 3: The Complete Picture — Pair Plot

Before diving deeper, Jordan creates a pair plot of the key performance metrics:

cols = ["points_per_game", "rebounds_per_game",
        "assists_per_game", "player_efficiency",
        "salary_millions", "position"]

sns.pairplot(df[cols], hue="position",
             diag_kind="kde",
             plot_kws={"alpha": 0.4, "s": 15},
             height=2)

The 5x5 grid is dense, but patterns emerge:

Player efficiency (PER) is the strongest predictor of salary — the PER-salary scatter plot shows the tightest positive correlation of any pair.
Positions cluster differently on different axes. Centers dominate rebounds; point guards dominate assists. But on PER and salary, positions overlap heavily.
Assists and rebounds are negatively correlated — players who grab many rebounds tend not to dish many assists, and vice versa. This is a position effect (centers rebound, guards assist).

Jordan marks the PER-salary relationship for his next chart.

Chart 4: Player Efficiency Rating — The Best Predictor

Jordan creates a detailed regression plot with faceting:

g = sns.lmplot(data=df, x="player_efficiency",
               y="salary_millions",
               col="position", col_wrap=3,
               height=3.5, aspect=1.1,
               scatter_kws={"alpha": 0.5, "s": 20})
g.set_axis_labels("Player Efficiency Rating",
                  "Salary (millions USD)")
g.fig.suptitle("Does Efficiency Pay? PER vs. Salary "
               "by Position", y=1.02, fontsize=13)

The faceted view confirms the pair plot's suggestion: PER is the best single predictor of salary across all positions. The slopes are similar — about 1.5 million dollars per point of PER — but the intercepts differ. At the same PER, small forwards and point guards tend to earn slightly more than centers.

Jordan also notices that the confidence bands widen at high PER values. There are very few elite players, so the salary prediction becomes less precise at the top. This makes sense — supermax contracts are negotiated individually, not determined by statistics alone.

Chart 5: Age and Experience — The Career Arc

For his final chart, Jordan explores how player careers evolve:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: violin plot of PER by experience bracket
df["exp_bracket"] = pd.cut(
    df["experience_years"],
    bins=[0, 3, 7, 12, 22],
    labels=["Rookie (0-3)", "Prime (4-7)",
            "Veteran (8-12)", "Elder (13+)"]
)

sns.violinplot(data=df, x="exp_bracket",
               y="player_efficiency",
               inner="quartile", ax=axes[0])
axes[0].set_title("Efficiency by Career Stage")
axes[0].set_xlabel("Experience Bracket")
axes[0].set_ylabel("Player Efficiency Rating")

# Right: scatter of age vs salary, colored by PER
scatter = axes[1].scatter(
    df["age"], df["salary_millions"],
    c=df["player_efficiency"],
    cmap="YlOrRd", alpha=0.6, s=30)
plt.colorbar(scatter, ax=axes[1],
             label="PER")
axes[1].set_title("Age vs. Salary (colored by PER)")
axes[1].set_xlabel("Age")
axes[1].set_ylabel("Salary (millions USD)")

plt.tight_layout()

The violin plot tells the career-arc story beautifully. Rookies (0-3 years) have a wide distribution — some are immediate stars, others are bench warmers learning the game. Prime-age players (4-7 years) have the highest median PER — this is when players peak. Veterans (8-12) maintain good efficiency but with more spread. Elders (13+) have a distribution that is bimodal: a few aging superstars maintaining elite PER, and many role players hanging on at low efficiency.

The scatter plot on the right shows a triangular shape: young players cluster at low salary (rookie contracts) regardless of PER. Between ages 25-32, salaries and PER both peak. After 32, some players maintain high salaries despite declining PER (those legacy contracts Jordan spotted earlier), while others see both salary and efficiency decline together.

Jordan's Article

Jordan writes his feature article around the five charts:

What Makes an NBA Player Worth Their Salary?

The NBA spends over $4 billion annually on player salaries. But what determines who gets paid? The answer, according to the data, is more nuanced than "the best players earn the most."

Scoring matters — but not equally for all positions. A point guard who can put up 20 points per game commands a significant premium over one who scores 12. For centers, the scoring premium is smaller; the league values their rebounding and defense more.

The single best statistical predictor of salary is Player Efficiency Rating, a composite metric that captures scoring, rebounding, assists, and defensive contributions in one number. Each additional point of PER is worth roughly $1.5 million in annual salary.

But statistics are not destiny. Young players on rookie contracts are systematically underpaid relative to their performance — a league rule that keeps costs predictable for teams. And aging veterans on legacy contracts are often overpaid relative to their declining output. The salary-performance relationship is strongest for players between 25 and 32 years old — the "prime earning window."

The career arc is clear in the data. Players peak in efficiency between their 4th and 7th NBA seasons, and those who reach that peak with strong numbers secure the max contracts that define the top of the salary distribution. The few who maintain elite performance past age 32 become the most valuable players in the league — not because they are the best they have ever been, but because sustaining excellence is rare.

Pedagogical Reflection

This case study demonstrates several important seaborn practices:

Start with distributions (histogram, box plot) to understand the shape of your key variable before exploring relationships.
Use pair plots as a roadmap to identify which pairwise relationships are worth investigating in detail.
Facet by category (position, experience bracket) to check whether aggregate patterns hold within subgroups.
Combine seaborn and matplotlib when needed — the age-salary scatter with a continuous color scale used matplotlib's scatter() with a colorbar because seaborn's hue works best with categorical variables.
Let visualizations raise questions. Jordan did not plan five specific charts in advance. Each chart suggested the next question: distribution led to position comparison, which led to pair plot exploration, which led to PER investigation, which led to career-arc analysis.
Write for the reader. Each chart in Jordan's article answers a specific question in plain language. The reader does not need to know what a "KDE" or "violin plot" is — they see the chart and read the interpretation.