Case Study 1: Organizing NBA Player Statistics — From Spreadsheet to Code

Contributors to Introduction to Data Science

Case Study 1: Organizing NBA Player Statistics — From Spreadsheet to Code

Tier 3 — Illustrative/Composite Example: This case study uses publicly available NBA statistical categories and realistic values, but the specific dataset, analysis workflow, and narrative are constructed for pedagogical purposes. Priya is a fictional character from the book's anchor examples. Player names and approximate statistics are drawn from publicly reported NBA data; the analysis process is designed to illustrate Chapter 5 concepts.

The Assignment

Priya lands her first real data assignment at the sports desk. Her editor drops a spreadsheet on her desk — well, emails her a CSV file — and says: "I need a piece on which players have been the most efficient scorers this season. You've got two days."

The file contains 30 rows — one for each of 30 NBA players — with columns for player name, team abbreviation, games played, minutes per game, points per game, field goal percentage, three-point percentage, and free throw percentage. It is small enough to scan by eye in a spreadsheet, but Priya has been learning Python and sees an opportunity to practice what she has learned in her data science course.

Her question is specific: Which players score the most points per game while also shooting efficiently (above-average field goal percentage)? This is the kind of question that separates casual fans from data-driven analysts — raw scoring totals can be misleading if a player takes a huge number of inefficient shots.

Step 1: Representing a Player as a Dictionary

Priya's first decision is how to represent each player in Python. She could use a list:

# Player as a list: name, team, games, mpg, ppg, fg_pct, three_pct, ft_pct
player_list = ["Jayson Tatum", "BOS", 74, 35.8, 26.9, 0.471, 0.376, 0.855]

But what does player_list[4] mean? She would have to remember that index 4 is points per game. If she adds a column later (say, rebounds), all the indices shift.

A dictionary is better:

player = {
    "name": "Jayson Tatum",
    "team": "BOS",
    "games_played": 74,
    "minutes_per_game": 35.8,
    "points_per_game": 26.9,
    "fg_pct": 0.471,
    "three_pct": 0.376,
    "ft_pct": 0.855
}

Now player["points_per_game"] is self-documenting. No guessing, no memorizing index positions.

Step 2: Building the Full Dataset as a List of Dictionaries

Each player is a dictionary. The full dataset is a list of those dictionaries — one per player:

players = [
    {"name": "Jayson Tatum", "team": "BOS", "games_played": 74,
     "minutes_per_game": 35.8, "points_per_game": 26.9,
     "fg_pct": 0.471, "three_pct": 0.376, "ft_pct": 0.855},

    {"name": "Luka Doncic", "team": "DAL", "games_played": 70,
     "minutes_per_game": 33.8, "points_per_game": 33.9,
     "fg_pct": 0.487, "three_pct": 0.356, "ft_pct": 0.786},

    {"name": "Giannis Antetokounmpo", "team": "MIL", "games_played": 73,
     "minutes_per_game": 35.2, "points_per_game": 30.4,
     "fg_pct": 0.611, "three_pct": 0.274, "ft_pct": 0.657},

    {"name": "Shai Gilgeous-Alexander", "team": "OKC", "games_played": 75,
     "minutes_per_game": 34.0, "points_per_game": 30.1,
     "fg_pct": 0.535, "three_pct": 0.353, "ft_pct": 0.874},

    {"name": "Kevin Durant", "team": "PHX", "games_played": 75,
     "minutes_per_game": 37.2, "points_per_game": 27.1,
     "fg_pct": 0.523, "three_pct": 0.413, "ft_pct": 0.856},
    # ... (25 more players in the full dataset)
]

Priya types in five players to test her code, then plans to read the rest from the CSV file.

Step 3: Reading the Data from CSV

The CSV file nba_stats.csv looks like this:

name,team,games_played,minutes_per_game,points_per_game,fg_pct,three_pct,ft_pct
Jayson Tatum,BOS,74,35.8,26.9,0.471,0.376,0.855
Luka Doncic,DAL,70,33.8,33.9,0.487,0.356,0.786
...

Priya writes the reading code:

import csv

players = []
with open("nba_stats.csv", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Convert numeric strings to actual numbers
        player = {
            "name": row["name"],
            "team": row["team"],
            "games_played": int(row["games_played"]),
            "minutes_per_game": float(row["minutes_per_game"]),
            "points_per_game": float(row["points_per_game"]),
            "fg_pct": float(row["fg_pct"]),
            "three_pct": float(row["three_pct"]),
            "ft_pct": float(row["ft_pct"]),
        }
        players.append(player)

print(f"Loaded {len(players)} players")

She almost forgets the type conversion. Her first attempt compares row["points_per_game"] > 25 and gets nonsensical results because "9.3" > "25.0" is True in string comparison (the character "9" comes after "2" in ASCII order). She debugs it by printing type(row["points_per_game"]) — it is <class 'str'>. Lesson learned: CSV values are always strings.

Step 4: Analysis with Comprehensions

With the data loaded correctly, Priya answers her question.

Finding the average field goal percentage:

avg_fg = sum(p["fg_pct"] for p in players) / len(players)
print(f"Average FG%: {avg_fg:.3f}")  # e.g., Average FG%: 0.478

Identifying efficient scorers (above-average FG% AND 20+ PPG):

efficient_scorers = [
    p for p in players
    if p["fg_pct"] > avg_fg and p["points_per_game"] >= 20
]

print(f"\nEfficient high scorers ({len(efficient_scorers)} players):")
for p in sorted(efficient_scorers, key=lambda x: x["points_per_game"], reverse=True):
    print(f"  {p['name']} ({p['team']}): "
          f"{p['points_per_game']} PPG, "
          f"{p['fg_pct']:.1%} FG")

This gives Priya her story angle: players like Giannis Antetokounmpo and Shai Gilgeous-Alexander score at elite levels while also shooting efficiently — a combination that is rarer than casual fans might think.

Building a quick lookup dictionary:

player_lookup = {p["name"]: p for p in players}

# Now she can quickly check any player
kevin = player_lookup["Kevin Durant"]
print(f"KD: {kevin['points_per_game']} PPG on {kevin['fg_pct']:.1%} shooting")

Grouping players by team:

teams = {}
for p in players:
    team = p["team"]
    if team not in teams:
        teams[team] = []
    teams[team].append(p["name"])

# How many players per team in the dataset?
for team, names in sorted(teams.items()):
    print(f"{team}: {len(names)} player(s)")

Step 5: Writing the Results

Priya writes her filtered results to a new CSV for her editor:

import csv

with open("efficient_scorers.csv", "w", newline="") as f:
    fieldnames = ["name", "team", "points_per_game", "fg_pct"]
    writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
    writer.writeheader()
    for p in efficient_scorers:
        writer.writerow(p)

print("Results saved to efficient_scorers.csv")

The extrasaction="ignore" parameter tells DictWriter to skip any keys in the dictionary that are not in fieldnames — a handy option when you want to write only a subset of columns.

What Priya Learned

Looking back at her two days of work, Priya realizes she used almost every concept from Chapter 5:

Concept	How She Used It
Dictionary	Represented each player as a named record
List of dictionaries	Represented the full dataset as a table
List comprehension	Filtered for efficient scorers in one line
Dictionary comprehension	Built a name-to-player lookup
csv.DictReader	Loaded data from the CSV file
csv.DictWriter	Saved filtered results to a new CSV
Type conversion	Converted string values from CSV to int/float
sorted() with key	Ranked players by points per game

The most important lesson was not about any single Python feature. It was about the shift in thinking: she stopped seeing data as rows in a spreadsheet and started seeing it as structured collections that she could query, filter, and transform programmatically. That shift — from spreadsheet consumer to data programmer — is the threshold concept of this chapter.

Discussion Questions

Structure choice. Priya chose a list of dictionaries to represent her dataset. What would change if she had used a dictionary of lists (column-oriented) instead? Which operations would become easier, and which would become harder?
The type conversion trap. Priya's first bug came from comparing strings to numbers. Why does Python's csv module return everything as strings? Would it be better if the module tried to guess the types automatically? What could go wrong with automatic type guessing?
Efficiency beyond FG%. Priya used field goal percentage as her efficiency metric, but basketball analytics has more sophisticated measures (true shooting percentage, player efficiency rating). How would you modify the data structure to include multiple efficiency metrics? Would the overall approach change?
From 30 to 30,000. Priya's dataset had 30 rows — manageable even by hand. What if she had 30,000 rows (every player-season in NBA history)? Would her approach still work? What would need to change? (This question previews why pandas exists.)

Try It Yourself

Using Priya's approach as a template, create your own mini-dataset of 8-10 items from a domain you care about — movies, songs, recipes, countries, athletes from a different sport, or anything else. For each item, create a dictionary with at least 5 fields including at least one numeric field.

Hard-code 3-4 records, then write the rest to a CSV and read them in
Calculate the average of your numeric field
Filter for items above the average
Write the filtered results to a new CSV

Pay attention to the moment when the data clicks into place — when you see your real-world domain as structured collections of key-value pairs. That is the thinking shift this chapter is about.