Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks

Contributors

31 min read

> "The computer is incredibly fast, accurate, and stupid. Man is incredibly slow, inaccurate, and brilliant. The marriage of the two is a force beyond calculation."

Learning Objectives

Set up a Python environment with Jupyter notebooks for statistical analysis
Perform basic data operations using pandas (load, view, filter, sort)
Use Excel/Google Sheets for basic data entry and simple calculations
Navigate between Python and spreadsheet approaches for common tasks
Write and run simple Python code for data exploration

In This Chapter

Chapter Overview
3.1 Why Do We Need Tools At All?
3.2 Setting Up Your Workspace: Google Colab
3.3 Your First Python Cell: Hello, Statistics!
3.4 Libraries: Standing on the Shoulders of Giants
3.5 Your First Real Dataset
3.6 Asking Questions: Filtering and Sorting
3.7 Loading Alex Rivera's StreamVibe Data
3.8 Debugging Spotlight: When Things Go Wrong
3.9 The Spreadsheet Side: Excel and Google Sheets
3.10 Sam Okafor's Basketball Data: Putting It All Together
3.11 Local Installation: Anaconda and JupyterLab (Optional)
3.12 Project Checkpoint: Loading Your Data Detective Dataset
Chapter Summary
Spaced Review: Strengthening Previous Learning
What's Next

Chapter 3: Your Data Toolkit: Python, Excel, and Jupyter Notebooks

"The computer is incredibly fast, accurate, and stupid. Man is incredibly slow, inaccurate, and brilliant. The marriage of the two is a force beyond calculation." — Attributed to Leo Cherne

Chapter Overview

Okay, here's the moment things get real.

In Chapter 1, I promised you that statistics is a superpower — a way of seeing through noise, spotting patterns, and making better decisions. In Chapter 2, you learned the vocabulary: categorical and numerical variables, observational units, data dictionaries. You've been building the thinking side of statistics.

Now it's time to pick up the tools.

In this chapter, you're going to write your first line of Python code, load a real dataset, and start asking it questions. If you've never programmed before, I know that might sound intimidating. I've seen the look on students' faces when I say "we're going to code today" — it's somewhere between "I didn't sign up for this" and pure dread.

But here's what I want you to know: you don't need to become a programmer. You need to become someone who can use a programming tool to explore data. That's a much smaller ask. It's the difference between becoming a mechanic and learning to drive a car. By the end of this chapter, you'll be driving.

We'll also talk about Excel and Google Sheets, because those tools are genuinely useful for certain tasks. The goal isn't to pick sides — it's to understand when each tool shines and when it struggles, so you can grab the right one for the job.

In this chapter, you will learn to: - Set up a Python environment using Google Colab (zero installation required) - Write and run Python code in a Jupyter notebook - Load a CSV file into pandas and explore it - Filter, sort, and summarize data with a few lines of code - Compare the Python approach with spreadsheets for common tasks

Fast Track: If you already have Python experience, skim Sections 3.1-3.3 and jump to Section 3.5 ("Your First Real Dataset"). Complete quiz questions 1, 8, and 15 to verify your foundation.

Deep Dive: After this chapter, work through Case Study 1 (exploring public health data with pandas) for extended practice. Then read Case Study 2 to sharpen your intuition about when to use Python vs. spreadsheets.

3.1 Why Do We Need Tools At All?

Let's start with a question that might be on your mind: why can't we just do statistics by hand?

Short answer: you can. For tiny datasets. People did statistics by hand for centuries — and they made incredible discoveries doing it. Florence Nightingale created some of the most influential data visualizations in history using nothing but pen, paper, and a ruler. Ronald Fisher developed foundational statistical methods while hand-calculating results from agricultural experiments.

But here's the thing: the datasets you'll work with in this course have hundreds or thousands of rows. Real-world datasets have millions. Nobody is going to calculate the mean of 10,000 numbers by hand. Nobody should.

Tools let you: - Work faster. What takes an hour by hand takes a second with code. - Make fewer mistakes. Computers don't get tired, skip rows, or misread numbers. - Reproduce your work. If you write code, anyone can re-run it and get the same result. Try reproducing a series of Excel clicks from memory three months later. - Scale up. The code that analyzes 100 rows works identically on 100,000 rows.

In Chapter 1, we talked about statistical thinking — the ability to see variation, uncertainty, and patterns in data. Tools don't replace that thinking. They amplify it. A calculator doesn't make you smarter at math, but it frees your brain to focus on the interesting parts instead of the arithmetic.

The Two Main Paths: Code vs. Spreadsheets

For data analysis, you have two main options:

Spreadsheets (Excel, Google Sheets): You see your data laid out in a grid. You click on cells, write formulas, drag things around. It's visual, intuitive, and most people have at least some experience with it.

Code (Python, R): You write text instructions that tell the computer what to do with your data. It's less visual at first, but much more powerful and reproducible once you learn the basics.

This course uses Python as the primary tool, but we won't ignore spreadsheets. Both have their place, and a good data analyst knows when to use each. Here's the quick version:

Task	Spreadsheet	Python
Quick data entry (< 50 rows)	Excellent	Overkill
Simple calculations on small data	Great	Fine
Exploring a large dataset	Slow, clunky	Fast, elegant
Reproducible analysis	Difficult	Built-in
Creating publication-quality graphs	Decent	Excellent
Sharing your analysis steps	Hard	Easy (Jupyter)
Cleaning messy data	Painful	Manageable
Statistical tests	Limited	Comprehensive

Intuition: Think of a spreadsheet as a Swiss Army knife — handy, portable, good for many small tasks. Python is more like a full workshop — takes a bit of setup, but once you're in, you can build anything.

3.2 Setting Up Your Workspace: Google Colab

Here's the best news in this entire chapter: you don't need to install anything.

Google Colab (short for Colaboratory) is a free, browser-based tool that gives you a Jupyter notebook environment with Python already set up. If you have a Google account and a web browser, you're ready to go. No downloads, no installation wizards, no "dependency conflicts" that make you want to throw your laptop out a window.

Step-by-Step: Your First Notebook

Step 1: Open your web browser and go to colab.research.google.com.

Step 2: Sign in with your Google account (any Gmail account works).

Step 3: Click "New Notebook" in the bottom-right of the dialog box that appears. If no dialog appears, go to File > New notebook.

That's it. You're now looking at a Jupyter notebook — the tool you'll use for the rest of this course.

What Is a Jupyter Notebook?

A Jupyter notebook is an interactive document that combines code, text, and output all in one place. It's like a lab notebook for data analysis. You write code in chunks called cells, run each cell, and see the results immediately below it.

Here's why this matters: in a traditional programming setup, you'd write an entire script, run it, and hope for the best. In a Jupyter notebook, you work cell by cell. Write a little code, run it, see what happens, adjust, repeat. It's conversational — almost like having a dialogue with your data.

The name "Jupyter" comes from three programming languages: Julia, Python, and R. But we'll be using Python exclusively.

Anatomy of a Notebook

Your new notebook has a few key parts:

The title bar at the top — click "Untitled0.ipynb" to rename your notebook (try "Ch3_My_First_Notebook")
The menu bar — File, Edit, View, Insert, Runtime, Tools, Help
The toolbar — buttons for common actions
Cells — the rectangular boxes where you write code or text

There are two types of cells: 1. Code cells — where you write Python code (they have a play button on the left) 2. Text cells — where you write notes and explanations using Markdown formatting

The Kernel: Your Invisible Assistant

Behind the scenes, your notebook is connected to a kernel — a running instance of Python that executes your code. Think of it as a calculator that's always on, waiting for your instructions.

When you run a code cell, the kernel processes it and returns the result. The kernel remembers everything you've done in previous cells, so if you define a variable in cell 1, you can use it in cell 5.

Sometimes the kernel gets confused or stuck. If that happens: - Runtime > Restart runtime will restart the kernel (like rebooting your calculator) - After restarting, you'll need to re-run your cells from the top

Math Anxiety Note: If all of this sounds like a lot, take a breath. You don't need to memorize the anatomy of a notebook. You need to know three things: (1) type code in a cell, (2) press the play button (or Shift+Enter), (3) read the output. Everything else you'll pick up naturally as we go.

3.3 Your First Python Cell: Hello, Statistics!

Let's write some code. For real. Right now.

In your Google Colab notebook, click on the empty code cell. Type exactly this:

print("Hello, Statistics!")

Now press Shift+Enter (or click the play button on the left side of the cell).

You should see this appear below the cell:

Hello, Statistics!

Congratulations. You just wrote and ran your first Python program. That's not a small thing — millions of people talk about learning to code and never actually do it. You just did.

Let's try a few more things. Click on the next cell (or press the + Code button to create one) and type:

2 + 3

Press Shift+Enter. You'll see 5 appear below. Python is now your calculator.

Try this one:

100 * 0.38

That gives you 38.0. If that calculation looks familiar, it should — that's 38% of 100, the kind of percentage calculation you'll do constantly in statistics.

Variables: Giving Names to Numbers

In statistics, we work with data that has meaning. Let's give our numbers names:

sample_size = 65
makes = 25
shooting_pct = makes / sample_size
print(shooting_pct)

Output:

0.38461538461538464

Look at that — we just calculated a shooting percentage. If Sam Okafor were working with Daria's three-point data from Chapter 1, this is exactly the kind of calculation he'd start with. The variable shooting_pct now holds the value 0.385, and we can use it in later cells.

A few things to notice about Python: - Variable names can't have spaces (use underscores instead: sample_size, not sample size) - Python uses * for multiplication (not x or times) - Python uses / for division - You don't need to declare variable types — Python figures it out

Round That Number

That shooting percentage has way too many decimal places. Let's clean it up:

print(round(shooting_pct, 3))

Output:

0.385

The round() function takes two arguments: the number you want to round, and the number of decimal places. Much better.

Check Your Understanding (try to answer before looking)

What would print(2 ** 3) produce? (Hint: ** means "to the power of")

If you type x = 10 in one cell and run it, then type print(x + 5) in the next cell and run it, what happens?

What happens if you restart the kernel and then run only the second cell?

Verify

It produces 8 (2 to the power of 3 = 2 x 2 x 2 = 8).

It prints 15. The kernel remembers that x = 10 from the previous cell.

You get a NameError because x was never defined — restarting the kernel erased all variables. You'd need to re-run the first cell before the second one will work.

3.4 Libraries: Standing on the Shoulders of Giants

Python on its own is a general-purpose programming language. It can do math, manipulate text, and run programs — but it doesn't know anything about statistics or data analysis by default. That's where libraries come in.

A library (also called a package) is a collection of pre-written code that adds new capabilities to Python. Instead of writing hundreds of lines of code to load a CSV file, calculate an average, or create a graph, you import a library that someone else already built and tested.

Here are the three libraries you'll use most in this course:

Library	What It Does	When You'll Use It
pandas	Data loading, manipulation, and analysis	Every chapter from here on
matplotlib	Creating graphs and visualizations	Starting in Chapter 5
scipy	Statistical tests and probability distributions	Starting in Chapter 8

For today, we only need pandas.

What Is pandas?

pandas is a Python library specifically designed for working with data. It was created by Wes McKinney in 2008 while he was working at a financial firm and got frustrated with the limitations of Excel for serious data analysis. (Yes, really — pandas exists because Excel made a Wall Street analyst angry enough to build something better.)

pandas gives you a powerful data structure called a DataFrame — think of it as a supercharged spreadsheet that lives inside your Python code. A DataFrame has rows and columns, just like Excel, but you can manipulate it with code instead of mouse clicks.

Importing pandas

To use pandas, you need to import it — tell Python "I want access to this library." Here's how:

import pandas as pd

That's it. One line. Let's break it down: - import pandas loads the pandas library - as pd gives it a shorter nickname so you can type pd instead of pandas everywhere

This pd nickname is a universal convention. Every data scientist on the planet writes import pandas as pd. If you see pd. something in code, it means "this came from pandas."

Why pd and not pandas? Because you'll type it hundreds of times. Saving five characters each time adds up. pd.read_csv() is easier to type than pandas.read_csv(). Convention matters in coding — when everyone uses the same abbreviation, code becomes easier to read across teams and projects.

3.5 Your First Real Dataset

Enough setup. Let's load some actual data.

We'll start with a dataset that Dr. Maya Chen might use in her public health work: data from the CDC's Behavioral Risk Factor Surveillance System (BRFSS), which surveys hundreds of thousands of Americans about their health behaviors and conditions.

For this chapter, I've prepared a simplified version with 500 rows and 8 columns. Here's how to load it in Google Colab:

import pandas as pd

url = "https://raw.githubusercontent.com/intro-stats-data/datasets/main/brfss_sample_500.csv"
health = pd.read_csv(url)

Let's break down what just happened: 1. We imported pandas (you only need to do this once per notebook) 2. We stored the URL of our data file in a variable called url 3. pd.read_csv(url) told pandas to read the CSV file at that URL and turn it into a DataFrame 4. We stored the result in a variable called health

A CSV (Comma-Separated Values) file is the most common format for sharing data. It's a plain text file where each row is a line, and columns are separated by commas. Think of it as the universal language that all data tools can speak.

What if the URL doesn't work? Don't panic. Your instructor may provide a different URL or a downloadable file. If you have a CSV file on your computer, you can upload it to Colab (File > Upload) and load it with pd.read_csv("filename.csv"). See the Debugging Spotlight in Section 3.8 for common file-loading errors.

Looking at Your Data: .head() and .tail()

The first thing you should always do with a new dataset is look at it. Let's see the first five rows:

health.head()

Output:

   age  sex     bmi  smoker  exercise  gen_health  sleep_hours  state
0   45    F   27.3       0         1           3            7     CA
1   62    M   31.5       1         0           4            6     TX
2   33    F   22.1       0         1           2            8     NY
3   57    M   28.8       0         1           3            7     FL
4   28    F   24.6       0         1           1            7     CA

This is your data. Each row is one observational unit — one person who responded to the survey (remember that term from Chapter 2?). Each column is a variable.

If you want to see the last five rows instead:

health.tail()

You can also specify how many rows to show:

health.head(10)  # first 10 rows

How Big Is This Dataset? .shape

health.shape

Output:

(500, 8)

That means 500 rows (observations) and 8 columns (variables). The .shape attribute gives you a quick size check — useful when you're not sure how large your dataset is.

What Types of Data Are We Working With? .dtypes

Remember from Chapter 2 how important it is to know your variable types? Here's how pandas sees them:

health.dtypes

Output:

age            int64
sex           object
bmi          float64
smoker         int64
exercise       int64
gen_health     int64
sleep_hours    int64
state         object
dtype: object

Let's translate: - int64 means integer (whole number) — pandas thinks these are numerical - float64 means decimal number — also numerical - object means text — pandas's way of saying "this is categorical"

But wait — look at smoker and exercise. They're coded as integers (0 and 1), but they're actually categorical variables! A 0 means "no" and a 1 means "yes." This is exactly the kind of classification challenge we discussed in Chapter 2: the computer's type and the statistical type aren't always the same.

Similarly, gen_health is coded 1-5 (Excellent to Poor), making it an ordinal variable stored as numbers. You, the analyst, need to bring your Chapter 2 knowledge to the table — the computer won't catch these distinctions for you.

The Full Picture: .info()

For a comprehensive overview of your dataset, use .info():

health.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   age          500 non-null    int64
 1   sex          500 non-null    object
 2   bmi          487 non-null    float64
 3   smoker       500 non-null    int64
 4   exercise     498 non-null    int64
 5   gen_health   500 non-null    int64
 6   sleep_hours  500 non-null    int64
 7   state        500 non-null    object
dtypes: float64(1), int64(5), object(2)
memory usage: 31.4+ KB

This tells you everything at a glance: 500 rows, 8 columns, the data type of each column, and — critically — the non-null count. Notice that bmi has only 487 non-null values (meaning 13 are missing) and exercise has 498 (2 missing). Missing data is a fact of life in real datasets. We'll learn how to handle it properly in Chapter 7.

Quick Statistics: .describe()

Here's where it gets fun. One command gives you a statistical summary of every numerical column:

health.describe()

Output:

              age         bmi      smoker    exercise  gen_health  sleep_hours
count  500.000000  487.000000  500.000000  498.000000  500.000000   500.000000
mean    47.820000   28.143530    0.178000    0.725904    2.834000     6.918000
std     16.543210    6.217845    0.382891    0.446434    1.057321     1.234567
min     18.000000   15.200000    0.000000    0.000000    1.000000     3.000000
25%     34.000000   23.800000    0.000000    0.000000    2.000000     6.000000
50%     47.000000   27.400000    0.000000    1.000000    3.000000     7.000000
75%     61.000000   31.600000    0.000000    1.000000    4.000000     8.000000
max     89.000000   52.300000    1.000000    1.000000    5.000000    12.000000

In one line of code, you just got the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum for every numerical column. We'll learn exactly what each of these means in Chapter 6. For now, just appreciate the power: six summary statistics for six variables, instantly.

Look at the bmi column: the count is 487, not 500, because .describe() automatically excludes missing values. The mean BMI is about 28.1, the youngest respondent is 18, the oldest is 89, and the average age is about 48.

Check Your Understanding (try to answer before looking)

Using the .describe() output above, what's the median age in this dataset?

How many respondents are missing BMI data?

The smoker variable has a mean of 0.178. What does that tell you, given that smoker is coded as 0 (no) or 1 (yes)?

Verify

The median age is 47 (the "50%" row represents the median — the middle value when data is sorted).

500 total rows minus 487 non-null BMI values = 13 missing BMI values.

When a variable is coded 0/1, its mean equals the proportion of 1s. So 0.178 means about 17.8% of respondents are smokers. This is a clever trick — remember it!

3.6 Asking Questions: Filtering and Sorting

Loading data and computing summaries is useful, but the real power comes when you start asking specific questions. Let's explore two essential skills: filtering and sorting.

Filtering: "Show Me Only..."

Filtering means selecting only the rows that meet a specific condition. The syntax looks a little strange at first, but you'll get used to it fast.

Question: Which respondents are older than 65?

seniors = health[health['age'] > 65]
print(seniors.shape)

Output:

(98, 8)

There are 98 respondents over age 65 in our dataset. Let's break down the syntax: - health['age'] grabs the age column - health['age'] > 65 creates a True/False value for every row (True if age > 65, False otherwise) - health[...] keeps only the rows where the condition is True

The pattern is always: dataframe[dataframe['column'] condition]

Question: What's the average BMI of smokers vs. non-smokers?

smoker_bmi = health[health['smoker'] == 1]['bmi'].mean()
nonsmoker_bmi = health[health['smoker'] == 0]['bmi'].mean()
print(f"Smoker avg BMI: {round(smoker_bmi, 1)}")
print(f"Non-smoker avg BMI: {round(nonsmoker_bmi, 1)}")

Output:

Smoker avg BMI: 28.9
Non-smoker avg BMI: 27.9

Notice the == (double equals) for checking equality. This is different from = (single equals), which assigns a value. This trips up every beginner at least once. You'll learn to catch it quickly.

Question: How many respondents from California exercise regularly?

ca_exercisers = health[(health['state'] == 'CA') & (health['exercise'] == 1)]
print(len(ca_exercisers))

Output:

When you combine multiple conditions, use & for "and" and | for "or," and wrap each condition in parentheses.

Sorting: "Show Me in Order..."

Sorting arranges your data from smallest to largest (or largest to smallest) based on a column.

Question: Who are the 5 youngest respondents?

health.sort_values('age').head(5)

Question: Who are the 5 respondents with the highest BMI?

health.sort_values('bmi', ascending=False).head(5)

The ascending=False parameter flips the sort order — highest first.

Counting Categories: .value_counts()

For categorical variables (or any column with repeated values), .value_counts() counts how many times each value appears:

health['state'].value_counts().head(10)

Output:

CA    62
TX    48
FL    41
NY    38
PA    27
OH    24
IL    22
NC    19
MI    18
GA    16
Name: state, dtype: int64

California has the most respondents (62), followed by Texas (48) and Florida (41). This makes sense — these are the most populous states.

What about the gen_health variable?

health['gen_health'].value_counts().sort_index()

Output:

1     68
2    112
3    165
4    108
5     47
Name: gen_health, dtype: int64

Sorting by index (.sort_index()) arranges the values in order (1 through 5) rather than by count. Remember, this is an ordinal variable: 1 = Excellent, 2 = Very Good, 3 = Good, 4 = Fair, 5 = Poor. The most common response is "Good" (3), with 165 respondents.

Spaced Review (Chapter 1): Take a moment to think about this: when we calculated that 17.8% of respondents are smokers, were we doing descriptive statistics or inferential statistics? If we used that 17.8% to claim that "about 18% of American adults smoke," what kind of statistics would that be?

Verify

Calculating the percentage from our 500-person sample is descriptive statistics — we're summarizing the data we have. But generalizing to "all American adults" would be inferential statistics — we're reaching beyond our sample to make a claim about the population. The quality of that inference depends on how well our 500 respondents represent the U.S. adult population (which we'll study in Chapter 4).

3.7 Loading Alex Rivera's StreamVibe Data

Let's practice these skills with a second dataset — one that Alex Rivera might use at StreamVibe.

url2 = "https://raw.githubusercontent.com/intro-stats-data/datasets/main/streamvibe_users_300.csv"
stream = pd.read_csv(url2)
stream.head()

Output:

   user_id  age subscription  watch_hours  sessions  recommended_clicks  satisfaction  joined_year
0    U1001   24      premium        38.2        22                  15             4         2021
1    U1002   31         free        12.5         8                   3             2         2023
2    U1003   19      premium        45.7        31                  28             5         2020
3    U1004   42         free        15.8        10                   5             3         2022
4    U1005   27        basic        22.3        15                  10             3         2021

Let's get the full picture:

stream.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   user_id             300 non-null    object
 1   age                 300 non-null    int64
 2   subscription        300 non-null    object
 3   watch_hours         294 non-null    float64
 4   sessions            300 non-null    int64
 5   recommended_clicks  300 non-null    int64
 6   satisfaction        300 non-null    int64
 7   joined_year         300 non-null    int64
dtypes: float64(1), int64(5), object(2)
memory usage: 18.9+ KB

Now let's practice our variable classification skills from Chapter 2:

Variable	pandas Type	Statistical Type	Subtype
`user_id`	object	Categorical	Nominal
`age`	int64	Numerical	Continuous (treated as discrete here)
`subscription`	object	Categorical	Ordinal (free < basic < premium)
`watch_hours`	float64	Numerical	Continuous
`sessions`	int64	Numerical	Discrete
`recommended_clicks`	int64	Numerical	Discrete
`satisfaction`	int64	Categorical (ordinal)	Ordinal (1-5 scale)
`joined_year`	int64	Numerical	Discrete

Notice again: pandas can't tell you the statistical type. It just sees numbers and text. You need to bring your data classification skills from Chapter 2 to the table.

Let's ask Alex's central question — do premium subscribers watch more?

stream.groupby('subscription')['watch_hours'].mean().round(1)

Output:

subscription
basic      21.4
free       13.8
premium    39.6
Name: watch_hours, dtype: float64

Premium subscribers watch nearly three times as much as free users. Interesting — but is that because the premium subscription causes more watching, or because people who already watch a lot are more likely to pay for premium? That's the correlation vs. causation question from Chapter 1, and we'll need proper study design (Chapter 4) and statistical tests (Chapters 13-16) to answer it.

3.8 Debugging Spotlight: When Things Go Wrong

Code breaks. It happens to everyone — beginners and experts alike. The difference between a frustrated beginner and a productive coder isn't that one makes fewer mistakes. It's that the productive coder reads the error message, identifies the problem, and fixes it.

Here are the most common errors you'll encounter, what they mean, and how to fix them.

Error 1: NameError

print(heatlh.head())

NameError: name 'heatlh' is not defined

What happened: You misspelled health as heatlh. Python is case-sensitive and doesn't do spell-check.

Fix: Check your spelling carefully. Also make sure you've run the cell that defined the variable. If you restarted the kernel, you need to re-run previous cells.

Error 2: FileNotFoundError

df = pd.read_csv("my_data.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'my_data.csv'

What happened: Python can't find the file at the path you specified. Either the filename is wrong, the file isn't in the right location, or you're working in Google Colab and haven't uploaded the file.

Fix: In Colab, use the file panel on the left to upload your file, or use a URL with pd.read_csv(). Double-check the exact filename, including capitalization and the .csv extension.

Error 3: KeyError

health['Age']

KeyError: 'Age'

What happened: The column name doesn't exist in your DataFrame. Column names are case-sensitive — 'Age' is not the same as 'age'.

Fix: Use health.columns to see the exact column names, then match them exactly.

print(health.columns.tolist())

Error 4: SyntaxError

health['age' > 65]

SyntaxError: invalid syntax

What happened: The bracket placement is wrong. You put the condition inside the column name selection instead of using the proper filtering syntax.

Fix: Remember the pattern: df[df['column'] > value], not df['column' > value].

health[health['age'] > 65]  # correct

Error 5: ModuleNotFoundError

import panda as pd

ModuleNotFoundError: No module named 'panda'

What happened: The library name is misspelled. It's pandas (with an s), not panda.

Fix: Double-check library names. The common ones: pandas, numpy, matplotlib, scipy.

Pro tip: When you see an error, read the last line first. That's where Python tells you what went wrong. The lines above it (the "traceback") show where it went wrong. Most of the time, the last line plus the line number will tell you exactly what to fix.

3.9 The Spreadsheet Side: Excel and Google Sheets

Python isn't the only tool in your kit. Spreadsheets are genuinely useful for certain tasks, and many workplaces use them daily. Let's see how common operations compare.

Opening a CSV in Google Sheets

Go to sheets.google.com and create a new spreadsheet
Go to File > Import > Upload and select your CSV file
Choose "Replace spreadsheet" and click "Import data"

Your data appears in the familiar grid format — rows and columns, everything visible at once.

Side-by-Side Comparison: Common Tasks

Here's how the same tasks look in Python vs. Google Sheets:

Task 1: See the first 5 rows

Python	Sheets
`health.head()`	Just scroll up — the first rows are visible

Task 2: Count the number of rows

Python	Sheets
`len(health)` or `health.shape[0]`	Look at the row numbers on the left, or use `=COUNTA(A:A)-1`

Task 3: Calculate the average age

Python	Sheets
`health['age'].mean()`	`=AVERAGE(A2:A501)` (assuming age is in column A)

Task 4: Count smokers

Python	Sheets
`health['smoker'].sum()`	`=COUNTIF(D2:D501, 1)` (assuming smoker is in column D)

Task 5: Filter for people over 65

Python	Sheets
`health[health['age'] > 65]`	Data > Create a filter > Click age column dropdown > Condition: Greater than 65

Task 6: Sort by BMI (highest first)

Python	Sheets
`health.sort_values('bmi', ascending=False)`	Data > Sort range > By bmi column Z→A

When to Use Which

Here's my honest recommendation:

Use a spreadsheet when: - You're doing quick, one-off calculations on small data (under ~1,000 rows) - You need to manually enter or edit data - You're sharing results with someone who doesn't know Python - You want to quickly eyeball data by scrolling through it

Use Python when: - Your dataset has more than ~1,000 rows - You need to reproduce your analysis later (or share exact steps) - You're doing anything that requires multiple steps (filter, then calculate, then graph) - You'll need to do the same analysis again on new data - You need statistical tests beyond basic averages and percentages

For this course, we'll primarily use Python because it scales, it's reproducible, and it teaches you a skill that transfers directly to the workplace. But don't be a tool snob — if the fastest way to check something is to paste 20 numbers into a spreadsheet, do it.

Spaced Review (Chapter 2): In the StreamVibe dataset above, subscription has three values: free, basic, and premium. Is this a nominal or ordinal variable? What about user_id?

Verify

subscription is ordinal — the categories have a natural order (free < basic < premium) that reflects increasing levels of service. user_id is nominal — it's a label with no meaningful order. Both are categorical variables, but the distinction matters because ordinal variables preserve rank information that nominal variables don't.

3.10 Sam Okafor's Basketball Data: Putting It All Together

Let's bring in one more dataset — the kind Sam Okafor would work with during his sports analytics internship. This time, you'll see a complete analysis workflow from load to insight.

url3 = "https://raw.githubusercontent.com/intro-stats-data/datasets/main/basketball_stats_200.csv"
ball = pd.read_csv(url3)
print(ball.shape)
ball.head()

Output:

(200, 9)
   player_id       team  games_played  pts_per_game  fg_pct  three_pct  rebounds  assists position
0      P0001  Raptors          72         18.4    0.462      0.371       6.2      3.1       SF
1      P0002  Hawks            68         12.1    0.438      0.342       4.8      5.7       PG
2      P0003  Lakers           75         22.7    0.487      0.389       8.1      2.4        C
3      P0004  Celtics          70         15.3    0.451      0.358       5.5      4.2       SG
4      P0005  Raptors          65         10.8    0.421      0.312       3.9      6.8       PG

Step 1: Understand the data

ball.describe().round(2)

This gives us summary statistics for all numerical columns — points per game, field goal percentage, three-point percentage, rebounds, and assists across 200 players.

Step 2: Ask a question

Sam might wonder: which position scores the most points per game?

ball.groupby('position')['pts_per_game'].mean().round(1).sort_values(ascending=False)

Output:

position
SG    17.8
SF    16.9
C     16.2
PG    14.1
PF    15.5
Name: pts_per_game, dtype: float64

Shooting guards (SG) average the most points per game. Centers (C) are third. Point guards (PG) average the fewest — which makes sense, since their primary role is distributing the ball, not scoring.

Step 3: Dig deeper

Are the best three-point shooters also the top scorers?

elite_shooters = ball[ball['three_pct'] > 0.40]
print(f"Elite shooters (>40% from three): {len(elite_shooters)}")
print(f"Their avg points: {elite_shooters['pts_per_game'].mean():.1f}")
print(f"League avg points: {ball['pts_per_game'].mean():.1f}")

Output:

Elite shooters (>40% from three): 28
Their avg points: 18.3
League avg points: 16.1

Players who shoot over 40% from three average 18.3 points per game, compared to a league average of 16.1. They score more — but we'd need to be careful about concluding that good three-point shooting causes higher scoring. Maybe the best scorers simply get more open looks from three. Correlation vs. causation strikes again.

Check Your Understanding (try to answer before looking)

Write the code to find how many players are on the Raptors.

Write the code to sort all players by rebounds (highest first) and show the top 3.

What would ball['position'].value_counts() tell you?

Verify

len(ball[ball['team'] == 'Raptors']) — filters for Raptors players and counts them.

ball.sort_values('rebounds', ascending=False).head(3) — sorts by rebounds descending and shows top 3.

It would show how many players play each position (PG, SG, SF, PF, C) — a frequency table for the categorical variable position.

3.11 Local Installation: Anaconda and JupyterLab (Optional)

Google Colab is great for getting started, and many students use it for the entire course. But if you want Python on your own machine — working offline, faster performance, more control — here's the recommended path.

Installing Anaconda

Anaconda is a free platform that bundles Python, Jupyter, pandas, and dozens of other libraries into one download.

Go to anaconda.com/download
Download the installer for your operating system (Windows, Mac, or Linux)
Run the installer — accept all defaults
When finished, open Anaconda Navigator from your applications

The entire installation is about 500 MB and takes 5-10 minutes on most computers.

Launching JupyterLab

From Anaconda Navigator, click Launch under JupyterLab. Your web browser will open with a local Jupyter environment. From here:

Navigate to a folder where you want to save your work
Click the + button and select Python 3 (ipykernel) under Notebook
You now have a local Jupyter notebook — it works identically to Google Colab

What's an IDE?

You might hear the term IDE (Integrated Development Environment). An IDE is a software application that provides tools for writing, running, and debugging code. JupyterLab is one IDE. Others include VS Code, PyCharm, and Spyder (which comes with Anaconda). For this course, JupyterLab or Google Colab is all you need.

My recommendation: Start with Google Colab. It works immediately, requires no installation, and your notebooks are saved to Google Drive automatically. If you find yourself wanting faster performance or offline access later, install Anaconda then. You won't lose any skills — the code is identical in both environments.

3.12 Project Checkpoint: Loading Your Data Detective Dataset

It's time to apply what you've learned to your own project. In Chapters 1 and 2, you chose a dataset for your Data Detective Portfolio and identified the variable types. Now you're going to load it into a Jupyter notebook and run your first exploration.

Your Tasks

Open a new notebook (in Google Colab or JupyterLab) and complete these steps:

1. Create a title cell (Markdown/Text cell):

Write a header with your project title, your name, the date, and a one-sentence description of your dataset.

2. Load your data:

import pandas as pd

# Replace with your dataset's URL or file path
df = pd.read_csv("your_dataset.csv")

3. Initial exploration — run each of these and write a brief note about what you observe:

# How many rows and columns?
df.shape

# What do the first few rows look like?
df.head()

# What data types does pandas see?
df.dtypes

# Full column info (including missing values)
df.info()

# Summary statistics for numerical columns
df.describe()

4. Answer these questions in text cells: - How many observational units are in your dataset? - How many variables are there? How many are numerical? How many are categorical? - Are there any missing values? Which columns have them? - Does pandas correctly identify the variable types, or are some categorical variables stored as numbers? - What's one interesting thing you notice from the .describe() output?

5. Try one filter or sort:

Write a filtering or sorting operation that answers a question you're curious about. For example: - Filter for a specific subgroup and compare their average to the whole dataset - Sort by a numerical variable and look at the extreme values - Use .value_counts() on a categorical variable

Suggested datasets and their CSV URLs: - CDC BRFSS: Search "CDC BRFSS data" or use the course-provided sample - Gapminder: gapminder.org/data (download as CSV) - World Happiness Report: Available on Kaggle (search "world happiness report csv") - College Scorecard: data.ed.gov (search "College Scorecard") - NOAA Climate Data: ncdc.noaa.gov

Chapter Summary

You just learned to drive. Let's review what you covered:

The Big Ideas

Tools amplify statistical thinking. They don't replace the judgment you developed in Chapters 1 and 2 — they make it possible to apply that judgment to real datasets with hundreds or thousands of rows.
Jupyter notebooks combine code, output, and explanation. They're living documents where you write code in cells, run each cell, and see results immediately. Google Colab gives you this environment for free in your browser.
pandas is your data Swiss Army knife. With pd.read_csv() you load data. With .head(), .info(), .describe(), and .value_counts(), you explore it. With filtering (df[df['col'] > val]) and sorting (df.sort_values()), you ask questions.
Spreadsheets and Python both have their place. Spreadsheets excel at quick, small-scale, visual tasks. Python excels at large-scale, reproducible, multi-step analyses. Know when to use each.

Commands You Learned

Command	What It Does
`import pandas as pd`	Load the pandas library
`pd.read_csv(path)`	Load a CSV file into a DataFrame
`df.head(n)`	Show first n rows (default 5)
`df.tail(n)`	Show last n rows
`df.shape`	Number of rows and columns
`df.dtypes`	Data type of each column
`df.info()`	Full summary: types, non-null counts, memory
`df.describe()`	Statistical summary of numerical columns
`df.columns`	List all column names
`df['col'].value_counts()`	Count occurrences of each value
`df[df['col'] > val]`	Filter rows by condition
`df.sort_values('col')`	Sort by column (ascending)
`df.sort_values('col', ascending=False)`	Sort by column (descending)
`df.groupby('col')['other'].mean()`	Group by one column and average another

What's Coming Next

In Chapter 4, you'll learn how data gets collected — sampling methods, experimental design, and the biases that can make a dataset misleading even if the analysis is perfect. You'll evaluate how your own dataset was collected and what limitations that creates.

Then in Chapter 5, you'll learn to visualize data with graphs and charts. You'll combine your pandas skills with matplotlib and seaborn to create histograms, bar charts, and scatterplots that reveal patterns no summary statistic can capture.

And in Chapter 7, you'll tackle one of the messiest real-world challenges: cleaning data. Those missing values we spotted in the bmi and exercise columns? You'll learn exactly how to handle them.

Spaced Review: Strengthening Previous Learning

These questions revisit concepts from earlier chapters at expanding intervals, helping you build long-term retention.

SR.1 (From Chapter 1 — Statistical Thinking): In Section 3.5, we calculated that 17.8% of respondents are smokers and then asked whether we could generalize to all Americans. This is the bridge between descriptive and inferential statistics — the core of statistical thinking. In your own words, why can't we simply assume our sample percentage matches the population percentage?

Check your thinking

Because our 500 respondents are a **sample**, not the entire **population** of American adults. Different samples would give different percentages (sampling variability). The 17.8% is a **statistic** (from the sample) that estimates the true **parameter** (the actual smoking rate in the population). How close our estimate is depends on factors like sample size and how the sample was selected — topics we'll formalize in Chapters 11-12.

SR.2 (From Chapter 2 — Categorical vs. Numerical Variables): We noticed that pandas stored smoker (0/1), exercise (0/1), and gen_health (1-5) as integers, but they're actually categorical variables. Why does this misclassification matter? What could go wrong if you treated gen_health as a truly numerical variable?

Check your thinking

If you treat `gen_health` as numerical, you might calculate its mean (we saw it was 2.834) and interpret it as if the scale has equal intervals — but the "distance" between Excellent (1) and Very Good (2) might not be the same as between Fair (4) and Poor (5). It's an **ordinal** variable: the numbers represent ranks, not measured quantities. Computing a mean of ranks can be misleading. This is why variable classification from Chapter 2 matters before you start analyzing.

SR.3 (From Chapter 2 — Observational Units): When we loaded the basketball dataset and ran ball.shape to get (200, 9), what are the observational units? If the dataset instead had one row per game (not per player), how would the observational units change, and what would that mean for the variables?

Check your thinking

In the current dataset, each **observational unit** is a player — each row represents one player's season statistics. If the data were organized by game instead, each observational unit would be a single game performance. The same player would appear in multiple rows (one per game), and variables like `pts_per_game` would become `pts_in_this_game`. The choice of observational unit shapes what questions you can answer and how you interpret the results.

What's Next

You now have tools in your hands. You can load data, look at it, summarize it, filter it, and sort it. That's a real foundation.

But here's the thing — every analysis we did in this chapter assumed the data was trustworthy. We took the BRFSS survey data at face value without asking how it was collected. We compared smokers to non-smokers without considering whether the comparison was fair. We noted that premium StreamVibe subscribers watch more without asking whether the subscription caused the extra watching.

In Chapter 4: Designing Studies, you'll learn why these questions matter — and how the way data is collected determines what conclusions you can draw from it. You'll encounter sampling bias, confounding variables, and the crucial distinction between observational studies and experiments. And you'll evaluate how your own Data Detective dataset was collected.

The tools you learned today will be with you for the rest of the course. In every chapter from here forward, you'll open a notebook, load data, and apply new techniques. The foundation is set. The toolkit is open. Now the real exploration begins.