Case Study 1: Calculating a Batting Average — Sports Data with Python Basics

Contributors to Introduction to Data Science

Case Study 1: Calculating a Batting Average — Sports Data with Python Basics

Tier 3 — Illustrative Example: This case study uses Priya, one of our anchor characters, in a simplified scenario constructed for pedagogical purposes. The basketball statistics described are realistic in structure but are fictional numbers chosen to illustrate Python concepts. No specific NBA player or season is represented.

The Setting

Priya is a sports journalist who covers the NBA for an online publication. She's just finished Chapter 3 and has a working Jupyter notebook with Python basics under her belt: variables, arithmetic, strings, f-strings, and type conversion.

Today she's got a very specific task. Her editor wants a "by the numbers" sidebar for an article about three players on a local team. For each player, Priya needs to calculate several performance statistics and format them into a clean, readable summary. The stats she needs are:

Points per game (PPG): total points divided by games played
Field goal percentage (FG%): field goals made divided by field goals attempted, times 100
Three-point percentage (3P%): three-pointers made divided by three-pointers attempted, times 100
Free throw percentage (FT%): free throws made divided by free throws attempted, times 100

She could do this on a calculator. She's done it before — typing in numbers, writing results on sticky notes, hoping she doesn't transpose a digit. But with three players and four calculations each, that's twelve separate computations, each one a chance for a typo. And if her editor says "actually, can you also add rebounds per game?" she has to start over.

Let's see how Python makes this faster, less error-prone, and repeatable.

The Data

Priya has the following season statistics for three players. (In a future chapter, she'll load this from a CSV file. For now, she's typing it in.)

Player	Games	Points	FG Made	FG Att	3P Made	3P Att	FT Made	FT Att
Amara Johnson	72	1584	576	1210	144	398	288	331
DeShaun Williams	68	1122	408	892	102	295	204	240
Kenji Nakamura	79	987	372	814	81	248	162	194

Step 1: Storing the Data in Variables

Priya opens her Jupyter notebook and creates variables for the first player:

# Player 1: Amara Johnson
p1_name = "Amara Johnson"
p1_games = 72
p1_points = 1584
p1_fg_made = 576
p1_fg_att = 1210
p1_3p_made = 144
p1_3p_att = 398
p1_ft_made = 288
p1_ft_att = 331

She runs the cell. No output — that's expected. The variables are stored in memory, waiting to be used.

A few things to notice about her variable names:

They use snake_case (p1_fg_made, not p1FgMade).
They have a consistent prefix (p1_ for player 1) so she can keep track of which player each variable belongs to.
They're descriptive enough that someone reading the code can understand what each one is without checking back.
p1_3p_made starts with p1_, not 3 — because variable names can't start with a number.

She does the same for players 2 and 3:

# Player 2: DeShaun Williams
p2_name = "DeShaun Williams"
p2_games = 68
p2_points = 1122
p2_fg_made = 408
p2_fg_att = 892
p2_3p_made = 102
p2_3p_att = 295
p2_ft_made = 204
p2_ft_att = 240

# Player 3: Kenji Nakamura
p3_name = "Kenji Nakamura"
p3_games = 79
p3_points = 987
p3_fg_made = 372
p3_fg_att = 814
p3_3p_made = 81
p3_3p_att = 248
p3_ft_made = 162
p3_ft_att = 194

What Priya notices: Typing all these variables is tedious. Three players, nine variables each — that's 27 variables. She can already see that this approach won't scale to an entire roster of 15 players, let alone a league of 450. In Chapter 5, she'll learn about dictionaries and lists, which will let her organize this data much more cleanly. For now, this works.

Step 2: Computing the Statistics

Now Priya writes the calculations for Player 1:

# Calculations for Amara Johnson
p1_ppg = p1_points / p1_games
p1_fg_pct = p1_fg_made / p1_fg_att * 100
p1_3p_pct = p1_3p_made / p1_3p_att * 100
p1_ft_pct = p1_ft_made / p1_ft_att * 100

She prints the results to check:

print(f"Points per game: {p1_ppg}")
print(f"FG%: {p1_fg_pct}")
print(f"3P%: {p1_3p_pct}")
print(f"FT%: {p1_ft_pct}")

Points per game: 22.0
FG%: 47.60330578512397
3P%: 36.18090452261306
FT%: 86.9789318600906

The math is right, but those decimal places are ugly. Priya uses f-string formatting:

print(f"{p1_name}")
print(f"  PPG:  {p1_ppg:.1f}")
print(f"  FG%:  {p1_fg_pct:.1f}%")
print(f"  3P%:  {p1_3p_pct:.1f}%")
print(f"  FT%:  {p1_ft_pct:.1f}%")

Amara Johnson
  PPG:  22.0
  FG%:  47.6%
  3P%:  36.2%
  FT%:  87.0%

Much better. She repeats the same pattern for players 2 and 3:

# Player 2 calculations
p2_ppg = p2_points / p2_games
p2_fg_pct = p2_fg_made / p2_fg_att * 100
p2_3p_pct = p2_3p_made / p2_3p_att * 100
p2_ft_pct = p2_ft_made / p2_ft_att * 100

# Player 3 calculations
p3_ppg = p3_points / p3_games
p3_fg_pct = p3_fg_made / p3_fg_att * 100
p3_3p_pct = p3_3p_made / p3_3p_att * 100
p3_ft_pct = p3_ft_made / p3_ft_att * 100

Step 3: Formatting the Report

Priya's editor wants a clean sidebar. She builds a formatted output:

print("=" * 40)
print("PLAYER PERFORMANCE SUMMARY")
print("=" * 40)

print(f"\n{p1_name}")
print(f"  PPG: {p1_ppg:.1f} | FG: {p1_fg_pct:.1f}%"
      f" | 3P: {p1_3p_pct:.1f}% | FT: {p1_ft_pct:.1f}%")

print(f"\n{p2_name}")
print(f"  PPG: {p2_ppg:.1f} | FG: {p2_fg_pct:.1f}%"
      f" | 3P: {p2_3p_pct:.1f}% | FT: {p2_ft_pct:.1f}%")

print(f"\n{p3_name}")
print(f"  PPG: {p3_ppg:.1f} | FG: {p3_fg_pct:.1f}%"
      f" | 3P: {p3_3p_pct:.1f}% | FT: {p3_ft_pct:.1f}%")

print("\n" + "=" * 40)

========================================
PLAYER PERFORMANCE SUMMARY
========================================

Amara Johnson
  PPG: 22.0 | FG: 47.6% | 3P: 36.2% | FT: 87.0%

DeShaun Williams
  PPG: 16.5 | FG: 45.7% | 3P: 34.6% | FT: 85.0%

Kenji Nakamura
  PPG: 12.5 | FG: 45.7% | 3P: 32.7% | FT: 83.5%

========================================

Step 4: Adding Comparisons

Priya's editor calls back: "Can you tell me which player has the best three-point percentage?" She adds some boolean comparisons:

p1_best_3p = p1_3p_pct > p2_3p_pct and p1_3p_pct > p3_3p_pct
p2_best_3p = p2_3p_pct > p1_3p_pct and p2_3p_pct > p3_3p_pct
p3_best_3p = p3_3p_pct > p1_3p_pct and p3_3p_pct > p2_3p_pct

print(f"{p1_name} has best 3P%: {p1_best_3p}")
print(f"{p2_name} has best 3P%: {p2_best_3p}")
print(f"{p3_name} has best 3P%: {p3_best_3p}")

Amara Johnson has best 3P%: True
DeShaun Williams has best 3P%: False
Kenji Nakamura has best 3P%: False

It works, but Priya can see the problem: this comparison approach becomes unwieldy with more players. In Chapter 4, she'll learn about if/elif/else to handle this more elegantly. And in Part II, she'll use pandas to compute these statistics for entire rosters in a single line of code.

What Priya Learned

Looking back at her notebook, Priya realizes she's used almost every concept from Chapter 3:

Variables to store player data and computed results
Integers for counts (games, shots made, shots attempted)
Floats for computed percentages (the result of dividing two integers)
Strings for player names
Arithmetic operators (/ for division, * for multiplication)
f-strings with format specifiers (:.1f) for clean output
Booleans and comparison operators (>, and) to identify the best performer
String repetition ("=" * 40) for visual formatting

More importantly, she understands why Python is better than a calculator for this task:

Transparency. Every calculation is visible. If the FG% formula is wrong, she can see it and fix it. On a calculator, the wrong keystroke is gone forever.
Repeatability. If a player's stats get updated, she changes one number and re-runs the cells. Every calculation updates automatically.
Scalability. Adding a fourth player means copying the same pattern. (And in later chapters, even the copying becomes unnecessary.)
Communication. The notebook itself — code, output, and explanatory Markdown cells — is a document she can share with her editor. It shows how she got the numbers, not just what they are.

Discussion Questions

Priya used the naming convention p1_fg_made, p2_fg_made, etc. What are the advantages and disadvantages of this approach compared to just using fg_made_1, fg_made_2?
When Priya computed field goal percentage, she wrote p1_fg_made / p1_fg_att * 100. The order of operations means this is evaluated as (p1_fg_made / p1_fg_att) * 100, which is correct. What would happen if she wrote p1_fg_made / (p1_fg_att * 100) instead? Would Python throw an error, or would it silently give the wrong answer?
Imagine a player attempted zero three-point shots. What would happen when Priya tries to calculate their three-point percentage? What type of error would Python raise? (Hint: think about what 0 does in a denominator.)
The code is repetitive — the same four-line calculation block appears three times with slightly different variable names. In Chapter 4, you'll learn about functions that let you write the calculation once and reuse it. What would you want the function to take as input, and what would it return?