You had just loaded nearly 5,000 rows of WHO vaccination data using Python's csv module. You wrote loops to count records by region. You built a function to extract numeric values from string columns, carefully skipping empty strings. You computed...
Learning Objectives
- Create DataFrames from dictionaries, lists, and CSV files using pandas constructors and read_csv
- Select columns and filter rows using bracket notation, .loc, and .iloc with boolean conditions
- Sort DataFrames by one or multiple columns and reset indexes appropriately
- Create new computed columns using vectorized operations and apply()
- Explain the relationship between Series and DataFrame and why vectorized operations are preferred over loops
In This Chapter
- Chapter Overview
- 7.1 Why pandas? — The Case for a Data Library
- 7.2 Your First DataFrame
- 7.3 Series: The Building Block
- 7.4 Selecting Data: Columns and Rows
- 7.5 Filtering Rows: Boolean Indexing
- 7.6 Sorting Data
- 7.7 Creating New Columns: Vectorized Operations
- 7.8 Loading Real Data: read_csv Deep Dive
- 7.9 Method Chaining: Composing Operations
- 7.10 Debugging Walkthrough: Common pandas Errors
- 7.11 Project Checkpoint: Rebuilding Chapter 6 in pandas
- 7.12 Spaced Review: Concepts from Chapters 1-6
- 7.13 The Grammar of Data Manipulation: A Framework
- 7.14 What's Coming Next
- Chapter Summary
Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
"The purpose of computing is insight, not numbers." — Richard Hamming
Chapter Overview
Remember the end of Chapter 6?
You had just loaded nearly 5,000 rows of WHO vaccination data using Python's csv module. You wrote loops to count records by region. You built a function to extract numeric values from string columns, carefully skipping empty strings. You computed means and medians by hand. You grouped data by region using nested loops and dictionaries. And when you were done, you had a working analysis — one that required roughly 60 lines of careful, manual Python to answer what felt like simple questions.
It worked. And it taught you a lot. But be honest with yourself: did it feel... heavy? Did you notice how much code was dedicated to plumbing — converting strings to numbers, handling missing values, iterating row by row — rather than expressing what you actually wanted to know?
This chapter is where that weight lifts.
You're about to meet pandas, the Python library that transforms how you work with data. The operation that took you 12 lines in Chapter 6 — filtering vaccination records for a specific region, extracting numeric coverage values, and computing the mean — becomes this:
df[df['region'] == 'AFRO']['coverage_pct'].mean()
One line. Readable. Expressive. No loops. No manual type conversion. No string-to-float gymnastics. Just a clear statement of intent: "Give me the mean coverage percentage for the AFRO region."
That single line contains five new ideas (DataFrames, column selection, boolean indexing, Series methods, and vectorized operations), and by the end of this chapter, every piece of it will feel natural to you. More importantly, you'll understand why it works — not just how to type it.
In this chapter, you will learn to:
- Create DataFrames from dictionaries, lists, and CSV files using pandas constructors and
read_csv - Select columns and filter rows using bracket notation,
.loc, and.ilocwith boolean conditions - Sort DataFrames by one or multiple columns and reset indexes appropriately
- Create new computed columns using vectorized operations and
apply() - Explain the relationship between Series and DataFrame and why vectorized operations are preferred over loops
7.1 Why pandas? — The Case for a Data Library
Before we dive into syntax, let's understand why pandas exists by looking at what we built in Chapter 6 and what it cost us.
The Pain You Already Felt
Cast your mind back to the code you wrote for computing mean vaccination coverage by region. Here's a condensed version:
# Chapter 6 approach: Mean coverage by region (pure Python)
import csv
data = []
with open("who_vaccination_data.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
data.append(row)
# Group by region and compute means
region_stats = {}
for row in data:
region = row["region"]
raw = row["coverage_pct"].strip()
if raw == "":
continue
try:
value = float(raw)
except ValueError:
continue
if region not in region_stats:
region_stats[region] = []
region_stats[region].append(value)
for region in sorted(region_stats):
values = region_stats[region]
mean_val = sum(values) / len(values)
print(f"{region}: {mean_val:.1f}%")
Count the lines. Count the concerns. This code is doing at least five things:
- File I/O: Opening the file, creating a reader, iterating rows
- Type conversion: Converting strings to floats
- Missing value handling: Skipping empty strings
- Error handling: Catching ValueError for non-numeric strings
- Grouping logic: Building a dictionary of region-to-values-list
- Aggregation: Computing the mean manually
And only item 6 is the thing we actually care about. The rest is plumbing.
The pandas Way
Here's the same analysis in pandas:
import pandas as pd
df = pd.read_csv("who_vaccination_data.csv")
df.groupby("region")["coverage_pct"].mean()
Four lines. And two of them are the import and the file load — things you do once per notebook. The actual analysis is one line.
But this isn't just about fewer keystrokes. The pandas version is:
- Faster to write — you express intent, not mechanics
- Faster to run — pandas operates on entire columns at once using optimized C code under the hood
- Easier to read — even someone unfamiliar with pandas can guess what
groupby("region")["coverage_pct"].mean()does - Safer — pandas handles type conversion and missing values (
NaN) automatically
This is the promise of pandas: you think about what you want to know, and pandas handles how to compute it.
What Is pandas, Exactly?
pandas is an open-source Python library for data manipulation and analysis. The name comes from "panel data," a term from econometrics for multi-dimensional structured data (though the creator, Wes McKinney, has also said it stands for "Python Data Analysis Library"). McKinney created pandas in 2008 while working at a hedge fund, because he needed tools for financial data analysis that didn't exist in Python at the time.
Today, pandas is one of the most-used libraries in all of Python. If you do data science in Python, you use pandas. It provides two fundamental data structures:
- DataFrame — a two-dimensional table, like a spreadsheet or a SQL table
- Series — a one-dimensional column, like a single column from a spreadsheet
Everything else in pandas — filtering, sorting, grouping, merging, reshaping — is built on operations over these two structures.
The Import Convention
By universal convention, pandas is imported with the alias pd:
import pandas as pd
You'll see this line at the top of virtually every data science notebook in the world. The alias pd is so standard that writing import pandas (without the alias) would actually confuse experienced readers. Stick with pd.
Check Your Understanding
Before we go further, make sure you can articulate:
- What were the two biggest sources of friction when analyzing CSV data with pure Python in Chapter 6?
- In the pandas one-liner
df.groupby("region")["coverage_pct"].mean(), what does each piece do?- Why might "fewer lines of code" not be a sufficient argument for using a library? (Hint: what else matters?)
Answers
- (a) Manual type conversion — everything loads as strings; (b) Manual iteration — computing grouped statistics required nested loops and dictionaries.
dfis the DataFrame;.groupby("region")groups rows by the region column;["coverage_pct"]selects the coverage column within each group;.mean()computes the average for each group.- Readability, correctness, performance, and maintainability also matter. A clever one-liner that nobody can understand is worse than a clear five-liner.
7.2 Your First DataFrame
Let's build DataFrames from scratch before we load files. Understanding how DataFrames are constructed will help you understand how they work.
Creating a DataFrame from a Dictionary
The most common way to create a DataFrame by hand is from a Python dictionary, where each key is a column name and each value is a list of values for that column:
import pandas as pd
# Each key = column name, each value = list of column values
country_data = {
"country": ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia"],
"region": ["AMRO", "AMRO", "AFRO", "EURO", "AFRO"],
"coverage_pct": [72.3, 85.1, 41.7, 93.2, 68.5],
"year": [2022, 2022, 2022, 2022, 2022]
}
df = pd.DataFrame(country_data)
print(df)
country region coverage_pct year
0 Brazil AMRO 72.3 2022
1 Canada AMRO 85.1 2022
2 Chad AFRO 41.7 2022
3 Denmark EURO 93.2 2022
4 Ethiopia AFRO 68.5 2022
Look at that output. It's a table. Rows and columns, with labels. The numbers on the left (0, 1, 2, 3, 4) are the index — pandas's way of labeling each row. By default, the index is just sequential integers starting at 0, like list indices.
Notice what you didn't have to do:
- You didn't specify data types. pandas figured out that coverage_pct is a float and year is an integer.
- You didn't worry about alignment. pandas made sure each list has the same length (if they didn't, you'd get an error — which is actually a helpful safety check).
- You didn't write a display function. DataFrames know how to display themselves as pretty tables.
Creating a DataFrame from a List of Dictionaries
Remember our Chapter 6 data structure? We loaded CSV data into a list of dictionaries, where each dictionary was one row. pandas can create a DataFrame directly from that:
# This is exactly what csv.DictReader gave us in Chapter 6
rows = [
{"country": "Brazil", "region": "AMRO", "coverage_pct": 72.3},
{"country": "Canada", "region": "AMRO", "coverage_pct": 85.1},
{"country": "Chad", "region": "AFRO", "coverage_pct": 41.7},
]
df = pd.DataFrame(rows)
print(df)
country region coverage_pct
0 Brazil AMRO 72.3
1 Canada AMRO 85.1
2 Chad AFRO 41.7
This means you could take everything you loaded in Chapter 6 and turn it into a DataFrame in one line: df = pd.DataFrame(data). We'll do something even better shortly with read_csv.
Inspecting a DataFrame: shape, dtypes, head, describe
Every time you create or load a DataFrame, your first move should be to inspect it. pandas provides a set of attributes and methods that give you the same information you painstakingly extracted in Chapter 6 — but instantly.
import pandas as pd
# Create a slightly larger example
df = pd.DataFrame({
"country": ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia",
"Finland", "Ghana", "Haiti", "India", "Japan"],
"region": ["AMRO", "AMRO", "AFRO", "EURO", "AFRO",
"EURO", "AFRO", "AMRO", "SEARO", "WPRO"],
"coverage_pct": [72.3, 85.1, 41.7, 93.2, 68.5,
95.0, 78.4, 52.1, 88.0, 97.5],
"year": [2022, 2022, 2022, 2022, 2022,
2022, 2022, 2022, 2022, 2022]
})
shape — the dimensions, as a tuple of (rows, columns):
print(df.shape)
(10, 4)
In Chapter 6, this was (len(data), len(data[0])) — two separate expressions. Now it's one attribute.
dtypes — the data type of each column:
print(df.dtypes)
country object
region object
coverage_pct float64
year int64
dtype: object
pandas uses object for text/string data, float64 for decimal numbers, and int64 for whole numbers. In Chapter 6, all columns were strings because csv.DictReader doesn't do type inference. pandas does.
head() — the first few rows (default: 5):
print(df.head(3))
country region coverage_pct year
0 Brazil AMRO 72.3 2022
1 Canada AMRO 85.1 2022
2 Chad AFRO 41.7 2022
In Chapter 6, this was for row in data[:3]: print(row). Now it's one method call.
tail() — the last few rows:
print(df.tail(2))
country region coverage_pct year
8 India SEARO 88.0 2022
9 Japan WPRO 97.5 2022
describe() — summary statistics for all numeric columns:
print(df.describe())
coverage_pct year
count 10.000000 10.000000
mean 77.180000 2022.000000
std 17.811988 0.000000
min 41.700000 2022.000000
25% 68.975000 2022.000000
50% 80.750000 2022.000000
75% 91.900000 2022.000000
max 97.500000 2022.000000
Stop and take that in. In Chapter 6, you wrote a summarize() function with 12 lines of code that computed count, min, max, mean, median, and range. The describe() method gives you count, mean, standard deviation, min, three quartiles (25%, 50%, 75%), and max — for every numeric column at once. With zero lines of custom code.
This is what pandas does. It turns your custom plumbing into one-word method calls.
info() — a concise summary of the DataFrame:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 10 non-null object
1 region 10 non-null object
2 coverage_pct 10 non-null float64
3 year 10 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 448.0+ bytes
This single call tells you: how many rows, how many columns, each column's name, how many non-null values each has (instantly revealing missing data), each column's type, and even memory usage. It's like your Chapter 6 data dictionary built automatically.
Comparison Table: Chapter 6 vs. Chapter 7
Task Chapter 6 (Pure Python) Chapter 7 (pandas) Load CSV csv.DictReader+ loop + listpd.read_csv()Row count len(data)df.shape[0]orlen(df)Column names list(data[0].keys())df.columnsFirst 5 rows for row in data[:5]: print(row)df.head()Data types Everything is strings df.dtypes(auto-detected)Summary stats Custom summarize()functiondf.describe()Missing value count Loop + counter df.isnull().sum()Column values [row["col"] for row in data]df["col"]
7.3 Series: The Building Block
A Series is a one-dimensional labeled array. If a DataFrame is a table, a Series is a single column (or row) of that table.
Extracting a Series from a DataFrame
When you select a single column from a DataFrame, you get a Series:
coverage = df["coverage_pct"]
print(type(coverage))
<class 'pandas.core.series.Series'>
print(coverage)
0 72.3
1 85.1
2 41.7
3 93.2
4 68.5
5 95.0
6 78.4
7 52.1
8 88.0
9 97.5
Name: coverage_pct, dtype: float64
A Series has three key properties:
- Values: the actual data (72.3, 85.1, 41.7, ...)
- Index: the labels for each value (0, 1, 2, ...)
- Name: the column name it came from (coverage_pct)
Series Methods: Statistics in One Word
Here's where the magic happens. A Series has methods for every common statistical operation:
coverage = df["coverage_pct"]
print(f"Mean: {coverage.mean():.1f}")
print(f"Median: {coverage.median():.1f}")
print(f"Std: {coverage.std():.1f}")
print(f"Min: {coverage.min():.1f}")
print(f"Max: {coverage.max():.1f}")
print(f"Sum: {coverage.sum():.1f}")
print(f"Count: {coverage.count()}")
Mean: 77.2
Median: 80.8
Std: 17.8
Min: 41.7
Max: 97.5
Sum: 771.8
Count: 10
Every single one of those operations would have required a function or loop in Chapter 6. Now they're one method call each.
Creating a Series Directly
You can also create a Series on its own:
# From a list
temperatures = pd.Series([22.1, 19.8, 25.4, 18.3, 22.1])
print(temperatures)
0 22.1
1 19.8
2 25.4
3 18.3
4 22.1
dtype: float64
# With a custom index
temperatures = pd.Series(
[22.1, 19.8, 25.4, 18.3, 22.1],
index=["Mon", "Tue", "Wed", "Thu", "Fri"],
name="temperature_c"
)
print(temperatures)
Mon 22.1
Tue 19.8
Wed 25.4
Thu 18.3
Fri 22.1
Name: temperature_c, dtype: float64
Now temperatures["Wed"] returns 25.4. The index gives your data meaning beyond just position.
The Relationship Between Series and DataFrame
Here's the conceptual model to keep in your head:
- A DataFrame is a collection of Series that share the same index
- Each column of a DataFrame is a Series
- Each Series has the same number of elements as the DataFrame has rows
# A DataFrame is like a dictionary of Series
print(type(df["country"])) # <class 'pandas.core.series.Series'>
print(type(df["region"])) # <class 'pandas.core.series.Series'>
print(type(df["coverage_pct"])) # <class 'pandas.core.series.Series'>
This isn't just a metaphor — it's literally how pandas is structured internally. Understanding this relationship will help you predict what operations return: selecting one column gives you a Series; selecting multiple columns gives you a DataFrame.
7.4 Selecting Data: Columns and Rows
Selecting specific columns and rows is the most fundamental operation in data manipulation. pandas gives you multiple ways to do this, and understanding when to use which one will save you hours of confusion.
Selecting Columns
Single column — bracket notation (returns a Series):
countries = df["country"]
print(countries)
0 Brazil
1 Canada
2 Chad
3 Denmark
4 Ethiopia
5 Finland
6 Ghana
7 Haiti
8 India
9 Japan
Name: country, dtype: object
Single column — dot notation (also returns a Series):
countries = df.country # Same as df["country"]
Dot notation is shorter, but it has limitations: it doesn't work if the column name has spaces, special characters, or conflicts with a DataFrame method. For example, df.count would call the count() method, not select a column named "count." My recommendation: always use bracket notation. It's more explicit and works in every case.
Multiple columns — pass a list (returns a DataFrame):
subset = df[["country", "coverage_pct"]]
print(subset)
country coverage_pct
0 Brazil 72.3
1 Canada 85.1
2 Chad 41.7
3 Denmark 93.2
4 Ethiopia 68.5
5 Finland 95.0
6 Ghana 78.4
7 Haiti 52.1
8 India 88.0
9 Japan 97.5
Notice the double brackets: df[["country", "coverage_pct"]]. The outer brackets are the selection operator. The inner brackets create a list of column names. This is a common source of confusion for beginners — if you write df["country", "coverage_pct"] (single brackets with two arguments), you'll get a KeyError.
Selecting Rows by Position: iloc
iloc stands for "integer location." It selects rows (and columns) by their numeric position, just like list indexing:
# First row
print(df.iloc[0])
country Brazil
region AMRO
coverage_pct 72.3
year 2022
Name: 0, dtype: object
# Rows 2 through 4 (exclusive end, just like list slicing)
print(df.iloc[2:5])
country region coverage_pct year
2 Chad AFRO 41.7 2022
3 Denmark EURO 93.2 2022
4 Ethiopia AFRO 68.5 2022
# Specific rows and columns by position
print(df.iloc[0:3, 0:2]) # First 3 rows, first 2 columns
country region
0 Brazil AMRO
1 Canada AMRO
2 Chad AFRO
When to use iloc: When you know the position of what you want. "Give me the first three rows" or "give me row 42."
Selecting Rows by Label: loc
loc selects by label — the index values and column names:
# With the default integer index, loc and iloc look similar
print(df.loc[0]) # Row with index label 0
country Brazil
region AMRO
coverage_pct 72.3
year 2022
Name: 0, dtype: object
The difference between loc and iloc becomes crucial when the index isn't simple integers. We'll see this more in later chapters. For now, the key rule:
loc— uses labels (index values and column names)iloc— uses integer positions (like list indices)
# loc can select specific rows and columns by name
print(df.loc[0:3, ["country", "coverage_pct"]])
country coverage_pct
0 Brazil 72.3
1 Canada 85.1
2 Chad 41.7
3 Denmark 93.2
Warning
Notice that loc[0:3] includes row 3, while iloc[0:3] excludes it. This is a critical difference:
- iloc uses Python-style slicing (exclusive end)
- loc uses label-based slicing (inclusive end)
This catches many beginners off guard. When in doubt, check your output.
A Quick Reference for Selection
| What You Want | Syntax | Returns |
|---|---|---|
| One column | df["col"] |
Series |
| Multiple columns | df[["col1", "col2"]] |
DataFrame |
| One row by position | df.iloc[n] |
Series |
| Multiple rows by position | df.iloc[start:stop] |
DataFrame |
| One row by label | df.loc[label] |
Series |
| Rows + columns by position | df.iloc[rows, cols] |
DataFrame |
| Rows + columns by label | df.loc[rows, cols] |
DataFrame |
7.5 Filtering Rows: Boolean Indexing
This is one of the most powerful ideas in pandas, and it's built on a concept you already know from Chapter 4: boolean expressions. In pure Python, you filter data with a for loop and an if statement. In pandas, you filter with boolean indexing.
The Concept: Boolean Masks
Watch what happens when you write a comparison on a Series:
print(df["coverage_pct"] > 80)
0 False
1 True
2 False
3 True
4 False
5 True
6 False
7 False
8 True
9 True
Name: coverage_pct, dtype: bool
The result is a Series of True and False values — one for every row in the DataFrame. This is called a boolean mask. It's like a filter template: True means "keep this row," False means "drop it."
Now, pass that mask back into the DataFrame's brackets:
high_coverage = df[df["coverage_pct"] > 80]
print(high_coverage)
country region coverage_pct year
1 Canada AMRO 85.1 2022
3 Denmark EURO 93.2 2022
5 Finland EURO 95.0 2022
8 India SEARO 88.0 2022
9 Japan WPRO 97.5 2022
Five countries with coverage above 80%. No loops. No if statements. One line.
Contrast with Chapter 6
Here's how you would have done this in Chapter 6:
# Chapter 6 approach: Filter for high coverage
high_coverage = []
for row in data:
raw = row["coverage_pct"].strip()
if raw == "":
continue
try:
value = float(raw)
except ValueError:
continue
if value > 80:
high_coverage.append(row)
print(f"Found {len(high_coverage)} countries with coverage > 80%")
That's 10 lines, including type conversion, error handling, and missing value logic. The pandas version is one line because pandas already knows the column is numeric (it converted it during read_csv) and already handles missing values (they're NaN, which comparisons safely ignore).
Combining Conditions
You can combine boolean conditions using & (and), | (or), and ~ (not). Important: Each condition must be wrapped in parentheses.
# Countries in AFRO with coverage above 60
afro_high = df[(df["region"] == "AFRO") & (df["coverage_pct"] > 60)]
print(afro_high)
country region coverage_pct year
4 Ethiopia AFRO 68.5 2022
6 Ghana AFRO 78.4 2022
# Countries in AMRO or EURO
amro_or_euro = df[(df["region"] == "AMRO") | (df["region"] == "EURO")]
print(amro_or_euro)
country region coverage_pct year
0 Brazil AMRO 72.3 2022
1 Canada AMRO 85.1 2022
3 Denmark EURO 93.2 2022
5 Finland EURO 95.0 2022
7 Haiti AMRO 52.1 2022
For checking membership in a set of values, use isin():
# Cleaner alternative for multiple values
amro_or_euro = df[df["region"].isin(["AMRO", "EURO"])]
For negation, use ~:
# Everything EXCEPT AFRO
not_afro = df[~(df["region"] == "AFRO")]
# Or equivalently:
not_afro = df[df["region"] != "AFRO"]
Common Mistake: Using
andinstead of&Python's built-in
andoperator doesn't work with pandas boolean masks. If you writedf[df["region"] == "AFRO" and df["coverage_pct"] > 60], you'll get aValueErrorwith the message "The truth value of a Series is ambiguous." Use&for "and,"|for "or," and~for "not," and always wrap each condition in parentheses.
7.6 Sorting Data
Sorting is straightforward in pandas, and the method name tells you exactly what it does.
Sorting by a Single Column
# Sort by coverage, lowest first (ascending is the default)
sorted_df = df.sort_values("coverage_pct")
print(sorted_df)
country region coverage_pct year
2 Chad AFRO 41.7 2022
7 Haiti AMRO 52.1 2022
4 Ethiopia AFRO 68.5 2022
0 Brazil AMRO 72.3 2022
6 Ghana AFRO 78.4 2022
1 Canada AMRO 85.1 2022
8 India SEARO 88.0 2022
3 Denmark EURO 93.2 2022
5 Finland EURO 95.0 2022
9 Japan WPRO 97.5 2022
Notice that the index values (0, 2, 7, 4, ...) are preserved — they travel with their rows. This means the original row identity is maintained even after sorting.
# Sort by coverage, highest first
sorted_df = df.sort_values("coverage_pct", ascending=False)
print(sorted_df.head())
country region coverage_pct year
9 Japan WPRO 97.5 2022
5 Finland EURO 95.0 2022
3 Denmark EURO 93.2 2022
8 India SEARO 88.0 2022
1 Canada AMRO 85.1 2022
Sorting by Multiple Columns
# Sort by region (alphabetically), then by coverage (highest first) within each region
sorted_df = df.sort_values(["region", "coverage_pct"], ascending=[True, False])
print(sorted_df)
country region coverage_pct year
6 Ghana AFRO 78.4 2022
4 Ethiopia AFRO 68.5 2022
2 Chad AFRO 41.7 2022
1 Canada AMRO 85.1 2022
0 Brazil AMRO 72.3 2022
7 Haiti AMRO 52.1 2022
5 Finland EURO 95.0 2022
3 Denmark EURO 93.2 2022
8 India SEARO 88.0 2022
9 Japan WPRO 97.5 2022
The ascending parameter takes a list matching the columns list: True for ascending, False for descending.
Resetting the Index
After sorting (or filtering), you might want a clean 0-based index:
sorted_df = df.sort_values("coverage_pct").reset_index(drop=True)
print(sorted_df)
country region coverage_pct year
0 Chad AFRO 41.7 2022
1 Haiti AMRO 52.1 2022
2 Ethiopia AFRO 68.5 2022
3 Brazil AMRO 72.3 2022
...
The drop=True parameter prevents pandas from adding the old index as a new column. Without drop=True, you'd get an extra column called index — rarely what you want.
7.7 Creating New Columns: Vectorized Operations
This is the section where the threshold concept of this chapter comes alive.
Threshold Concept: Thinking in Vectors Rather Than Loops
Up until now, when you wanted to do something to every item in a collection, you wrote a loop:
```python
Loop thinking: process one item at a time
results = [] for value in coverage_values: results.append(value / 100) ```
In pandas, you think differently. Instead of processing one value at a time, you operate on entire columns at once:
```python
Vector thinking: operate on the whole column
df["coverage_decimal"] = df["coverage_pct"] / 100 ```
That single line divides every value in the
coverage_pctcolumn by 100 and stores the results in a new column calledcoverage_decimal. No loop. No appending. The operation applies to all rows simultaneously.This is called a vectorized operation, and it's the fundamental mental shift of working with pandas. Instead of asking "how do I process each row?" ask "what operation do I want to apply to the whole column?"
Here's why this matters beyond convenience: - Speed: Vectorized operations use optimized C code under the hood. For a million rows, a vectorized operation can be 100x faster than a Python loop. - Readability:
df["coverage_pct"] / 100clearly states intent. A loop buries the intent inside iteration mechanics. - Safety: Vectorized operations handle missing values (NaN) automatically —NaN / 100producesNaN, not an error.This shift — from "loop over items" to "operate on columns" — is one of the most important mental transitions in your data science education. It will feel unnatural at first. That's normal. Keep practicing, and it will become your default way of thinking.
Arithmetic on Columns
# Create a new column: coverage as a decimal
df["coverage_decimal"] = df["coverage_pct"] / 100
print(df[["country", "coverage_pct", "coverage_decimal"]])
country coverage_pct coverage_decimal
0 Brazil 72.3 0.723
1 Canada 85.1 0.851
2 Chad 41.7 0.417
3 Denmark 93.2 0.932
4 Ethiopia 68.5 0.685
5 Finland 95.0 0.950
6 Ghana 78.4 0.784
7 Haiti 52.1 0.521
8 India 88.0 0.880
9 Japan 97.5 0.975
Combining Columns
You can use multiple columns in a single expression:
# Imagine we have target_population and doses_administered
df2 = pd.DataFrame({
"country": ["Brazil", "Canada", "Chad"],
"target_pop": [50000, 28000, 15000],
"doses_given": [36150, 23800, 6255]
})
# Compute coverage from the raw numbers
df2["computed_coverage"] = (df2["doses_given"] / df2["target_pop"]) * 100
print(df2)
country target_pop doses_given computed_coverage
0 Brazil 50000 36150 72.3
1 Canada 28000 23800 85.0
2 Chad 15000 6255 41.7
Conditional Columns with Where or Apply
Sometimes you need logic more complex than simple arithmetic. For categorization, you can use apply() with a function:
def classify_coverage(pct):
"""Classify a coverage percentage into a category."""
if pct >= 90:
return "High"
elif pct >= 70:
return "Medium"
else:
return "Low"
df["coverage_level"] = df["coverage_pct"].apply(classify_coverage)
print(df[["country", "coverage_pct", "coverage_level"]])
country coverage_pct coverage_level
0 Brazil 72.3 Medium
1 Canada 85.1 Medium
2 Chad 41.7 Low
3 Denmark 93.2 High
4 Ethiopia 68.5 Low
5 Finland 95.0 High
6 Ghana 78.4 Medium
7 Haiti 52.1 Low
8 India 88.0 Medium
9 Japan 97.5 High
The apply() method takes a function and calls it on every value in the Series. It's a bridge between the loop-thinking you know and the vector-thinking you're learning. Use it when you need conditional logic that's too complex for simple arithmetic.
Performance Note:
apply()is slower than true vectorized operations because it still iterates through values under the hood (just in a nicer wrapper). For simple conditions,np.where()orpd.cut()are faster alternatives — you'll learn these in Chapter 9. For now,apply()is perfectly fine for datasets up to hundreds of thousands of rows.
String Operations on Columns
Series has a .str accessor for string operations on text columns:
# Convert country names to uppercase
print(df["country"].str.upper())
0 BRAZIL
1 CANADA
2 CHAD
3 DENMARK
4 ETHIOPIA
5 FINLAND
6 GHANA
7 HAITI
8 INDIA
9 JAPAN
Name: country, dtype: object
# Check which countries start with a specific letter
print(df["country"].str.startswith("C"))
0 False
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: country, dtype: bool
This will become especially powerful in Chapter 10 (Working with Text Data), but knowing it exists now gives you a preview of how pandas extends vectorized thinking beyond just numbers.
7.8 Loading Real Data: read_csv Deep Dive
You've been building DataFrames by hand. Now let's load real data the way professionals do: with pd.read_csv().
Basic Usage
import pandas as pd
df = pd.read_csv("who_vaccination_data.csv")
print(df.shape)
print(df.head())
(4892, 7)
country region year vaccine coverage_pct target_population doses_administered
0 Afghanistan EMRO 2019 MCV1 64.0 NaN NaN
1 Afghanistan EMRO 2020 MCV1 66.0 NaN NaN
2 Afghanistan EMRO 2021 MCV1 58.0 NaN NaN
3 Afghanistan EMRO 2022 MCV1 62.0 NaN NaN
4 Albania EURO 2019 MCV1 96.0 28000.0 26880.0
Look at the differences from Chapter 6:
- One line to load the entire file — no
open(), noDictReader, no loop, no list - Automatic type detection —
yearis an integer,coverage_pctis a float, text columns are objects - Missing values become
NaN— not empty strings that crash your math.NaN(Not a Number) is pandas's sentinel for missing data. It participates safely in computations:NaN + 5isNaN, not an error.
Useful read_csv Parameters
read_csv has dozens of parameters. Here are the ones you'll use most often:
# Specify which columns to load (saves memory on large files)
df = pd.read_csv("who_vaccination_data.csv",
usecols=["country", "region", "year", "coverage_pct"])
# Parse a column as dates
df = pd.read_csv("sales_data.csv",
parse_dates=["sale_date"])
# Use a specific column as the index
df = pd.read_csv("who_vaccination_data.csv",
index_col="country")
# Handle different delimiters (tab-separated, semicolon-separated)
df = pd.read_csv("data.tsv", sep="\t")
# Specify the encoding (important for international data)
df = pd.read_csv("data.csv", encoding="latin-1")
# Skip bad lines instead of crashing
df = pd.read_csv("messy_data.csv", on_bad_lines="skip")
# Read only the first N rows (useful for previewing huge files)
df = pd.read_csv("huge_file.csv", nrows=100)
The NaN Revolution
One of the biggest quality-of-life improvements over pure Python is how pandas handles missing data. In Chapter 6, empty cells loaded as empty strings (""), and you had to check for them manually before every calculation:
# Chapter 6: The missing value dance
raw = row["coverage_pct"].strip()
if raw == "":
continue
try:
value = float(raw)
except ValueError:
continue
In pandas, missing values are NaN (Not a Number), and every statistical method handles them automatically:
# pandas: Just... works
df["coverage_pct"].mean() # Ignores NaN automatically
df["coverage_pct"].count() # Counts only non-NaN values
df["coverage_pct"].isnull() # True where values are missing
We'll dive deep into missing value handling strategies in Chapter 8 (Cleaning Messy Data). For now, the important thing to know is that NaN is your friend, not your enemy — it marks missing data clearly and doesn't crash your calculations.
7.9 Method Chaining: Composing Operations
One of the most elegant aspects of pandas is method chaining — stringing multiple operations together in a single expression, where each operation feeds its result to the next.
The Basic Idea
Instead of writing:
# Step by step (creating intermediate variables)
filtered = df[df["region"] == "AFRO"]
sorted_df = filtered.sort_values("coverage_pct", ascending=False)
result = sorted_df[["country", "coverage_pct"]].head(5)
You can write:
# Method chain (one flowing expression)
result = (df[df["region"] == "AFRO"]
.sort_values("coverage_pct", ascending=False)
[["country", "coverage_pct"]]
.head(5))
Read it as a sentence: "Take the DataFrame, filter for AFRO region, sort by coverage descending, select the country and coverage columns, and show the top 5."
The parentheses around the entire expression let you break it across multiple lines for readability — a formatting convention you'll see in professional pandas code.
Why Chain?
Method chaining encourages you to think about data operations as a pipeline — a sequence of transformations where data flows through each step. This mirrors how data scientists describe their work in plain English:
- "Filter for the African region"
- "Sort by coverage from highest to lowest"
- "Show the top 5 countries"
Each verb maps to a pandas method. This is what we mean by the "grammar of data manipulation" — a consistent vocabulary of operations (select, filter, sort, create, summarize) that you compose into analytical sentences.
A Practical Example
Let's ask a real question: "What are the top 3 vaccines by mean coverage in the EURO region?"
result = (df[df["region"] == "EURO"]
.groupby("vaccine")["coverage_pct"]
.mean()
.sort_values(ascending=False)
.head(3))
print(result)
We haven't formally covered groupby yet (that's Chapter 9), but you can probably read this chain and understand what it does. That's the power of expressive code.
When Not to Chain
Method chaining is elegant, but don't take it too far. If a chain exceeds 5-6 steps, or if intermediate results would be useful for debugging, break it into named variables. Readability always beats cleverness.
7.10 Debugging Walkthrough: Common pandas Errors
Every pandas beginner hits the same set of errors. Let's walk through them so you can diagnose them quickly when they happen to you.
Error 1: KeyError — Column Not Found
# You typed the column name wrong
df["Coverage_Pct"]
KeyError: 'Coverage_Pct'
Why it happens: Column names are case-sensitive. "Coverage_Pct" is not the same as "coverage_pct".
How to fix it: Check the exact column names with df.columns:
print(df.columns.tolist())
# ['country', 'region', 'year', 'vaccine', 'coverage_pct',
# 'target_population', 'doses_administered']
Then use the exact name. Copy-pasting from the output is safer than typing from memory.
Error 2: SettingWithCopyWarning
This is pandas's most confusing warning, and almost every beginner encounters it:
# This might trigger SettingWithCopyWarning
subset = df[df["region"] == "AFRO"]
subset["coverage_level"] = "unknown" # Warning!
SettingWithCopyWarning: A value is trying to be set on a copy of a slice
from a DataFrame.
Why it happens: When you filter a DataFrame, pandas sometimes returns a view (linked to the original) and sometimes a copy (independent). If it's a view and you modify it, you might accidentally change the original DataFrame. pandas warns you about this ambiguity.
How to fix it: Use .copy() when you intend to create an independent subset:
# Explicit copy — no warning, no ambiguity
subset = df[df["region"] == "AFRO"].copy()
subset["coverage_level"] = "unknown" # Safe!
Or use .loc for direct modification:
# Modify the original DataFrame directly
df.loc[df["region"] == "AFRO", "coverage_level"] = "unknown"
The rule of thumb: if you're creating a subset that you'll modify, call .copy(). If you're modifying the original DataFrame, use .loc.
Error 3: ValueError with Boolean Operators
# Using Python's 'and' instead of '&'
df[df["region"] == "AFRO" and df["coverage_pct"] > 80]
ValueError: The truth value of a Series is ambiguous.
Use a.empty, a.bool(), a.item(), a.any() or a.all().
Why it happens: Python's and operator tries to evaluate the entire Series as a single True or False, which is ambiguous (is a Series with some True and some False values True or False?).
How to fix it: Use & for element-wise "and," | for "or," and wrap each condition in parentheses:
df[(df["region"] == "AFRO") & (df["coverage_pct"] > 80)]
Error 4: Double Bracket Confusion
# Forgot the inner list brackets
df["country", "region"]
KeyError: ('country', 'region')
Why it happens: df["country", "region"] looks for a single column whose name is the tuple ('country', 'region'). That doesn't exist.
How to fix it: Pass a list of column names:
df[["country", "region"]] # Note the double brackets
Retrieval Practice
Without looking at the examples above, try to answer:
- What error do you get if you type a column name with wrong capitalization?
- When should you use
.copy()on a filtered DataFrame?- Why can't you use Python's
andkeyword between two pandas boolean conditions?
Answers
KeyError— pandas is case-sensitive about column names.- When you plan to modify the filtered subset (add or change columns).
- Because
andtries to evaluate the entire Series as a single boolean, which is ambiguous. Use&instead.
7.11 Project Checkpoint: Rebuilding Chapter 6 in pandas
Time to put everything together. Let's rebuild the Chapter 6 analysis of the WHO vaccination data — but this time using pandas. You'll feel the difference viscerally.
Step 1: Load the Data
Chapter 6 (11 lines):
import csv
data = []
with open("who_vaccination_data.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
data.append(row)
num_rows = len(data)
num_cols = len(data[0]) if data else 0
print(f"Shape: {num_rows} rows x {num_cols} columns")
Chapter 7 (3 lines):
import pandas as pd
df = pd.read_csv("who_vaccination_data.csv")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
Shape: 4892 rows x 7 columns
Step 2: Inspect the Data
Chapter 6: Multiple loops and print statements to see columns, first rows, data types, unique values.
Chapter 7:
print(df.head())
print()
print(df.dtypes)
print()
print(df.describe())
print()
df.info()
That's it. Four method calls give you everything that took 30+ lines in Chapter 6.
Step 3: Count Records by Region
Chapter 6 (7 lines):
region_counts = {}
for row in data:
region = row["region"]
region_counts[region] = region_counts.get(region, 0) + 1
for region, count in sorted(region_counts.items()):
print(f" {region}: {count} records")
Chapter 7 (1 line):
print(df["region"].value_counts().sort_index())
AFRO 1520
AMRO 820
EMRO 564
EURO 1040
SEARO 308
WPRO 640
Name: region, dtype: int64
Step 4: Unique Countries per Region
Chapter 6 (8 lines):
countries_per_region = {}
for row in data:
region = row["region"]
country = row["country"]
if region not in countries_per_region:
countries_per_region[region] = set()
countries_per_region[region].add(country)
for region in sorted(countries_per_region):
print(f" {region}: {len(countries_per_region[region])} countries")
Chapter 7 (1 line):
print(df.groupby("region")["country"].nunique())
region
AFRO 47
AMRO 35
EMRO 22
EURO 53
SEARO 11
WPRO 26
Name: country, dtype: int64
Step 5: Summary Statistics by Region
Chapter 6 (15+ lines): Required get_numeric_values(), compute_mean(), compute_median(), manual filtering and grouping.
Chapter 7 (1 line):
print(df.groupby("region")["coverage_pct"].describe())
count mean std min 25% 50% 75% max
region
AFRO 1485.0 72.3 22.139... 6.0 58.00 76.0 91.0 99.0
AMRO 805.0 86.4 14.247... 20.0 82.00 90.0 95.0 99.0
EMRO 555.0 80.5 19.876... 11.0 72.00 85.0 94.0 99.0
EURO 1025.0 93.1 6.789... 42.0 91.00 95.0 97.0 99.0
SEARO 295.0 84.8 14.532... 34.0 79.00 88.0 94.0 99.0
WPRO 650.0 88.2 14.673... 15.0 83.00 93.0 97.0 99.0
Count, mean, standard deviation, min, quartiles, and max — for every region — in one line.
Step 6: Filter and Sort
Question: "Which countries in the AFRO region had the lowest MCV1 coverage in 2022?"
Chapter 6 (12+ lines): Filter with nested if statements, sort with sorted() and a key function, format output manually.
Chapter 7 (4 lines):
result = (df[(df["region"] == "AFRO") &
(df["vaccine"] == "MCV1") &
(df["year"] == 2022)]
.sort_values("coverage_pct")
.head(10))
print(result[["country", "coverage_pct"]])
Step 7: Create a Computed Column
Let's add a column classifying coverage as "Low," "Medium," or "High":
def classify(pct):
if pct >= 90:
return "High"
elif pct >= 70:
return "Medium"
else:
return "Low"
df["coverage_level"] = df["coverage_pct"].apply(classify)
# How many records in each category?
print(df["coverage_level"].value_counts())
High 2547
Medium 1386
Low 882
Name: coverage_level, dtype: int64
The Scoreboard
| Operation | Chapter 6 Lines | Chapter 7 Lines | Reduction |
|---|---|---|---|
| Load + inspect | ~20 | 4 | 80% |
| Count by group | 7 | 1 | 86% |
| Unique per group | 8 | 1 | 88% |
| Summary stats by group | 15+ | 1 | 93% |
| Filter + sort | 12+ | 4 | 67% |
| Computed column | 8+ | 4 | 50% |
The total analysis went from roughly 70 lines of careful Python to about 15 lines of pandas. And the pandas version is more readable, handles missing values automatically, and runs faster on large datasets.
This is the payoff for all the work you put in during Part I. You earned this.
7.12 Spaced Review: Concepts from Chapters 1-6
Learning sticks when you revisit it. Here are retrieval practice questions that connect earlier material to what you've just learned.
Spaced Review Block
Try to answer each question from memory before checking.
From Chapter 1: The data science lifecycle has six stages. Which stage does this chapter's work (selecting, filtering, sorting) belong to? What stage came before it in Chapter 6?
Answer
This chapter's work belongs to the data wrangling/exploration stage. In Chapter 6, you were also in exploration but with raw tools. The lifecycle stages are: question formulation, data collection, data cleaning, exploratory analysis, modeling, and communication.From Chapter 3: Why does
pd.read_csv()handle type conversion automatically whilecsv.DictReaderdoes not? What Python concept explains the difference?
Answer
csv.DictReaderreads every field as a string because CSV is a text format with no type information.read_csvperforms type inference — it examines the values and converts them to appropriate Python/NumPy types (int64, float64, object). The Python concept is type conversion (Ch.3) — pandas just does it for you automatically.From Chapter 4: The
apply()method takes a function as an argument. What is this pattern called, and where did you first encounter it?
Answer
This is a higher-order function — a function that takes another function as an argument. You encountered this concept in Chapter 4 withsorted(data, key=some_function). Theapply()method extends the same idea to pandas Series.From Chapter 5: A DataFrame is conceptually similar to what Python data structure from Chapter 5? How is it different?
Answer
A DataFrame is conceptually similar to a list of dictionaries (each dictionary is a row, each key is a column name). The differences: a DataFrame enforces that all rows have the same columns, stores data in columns rather than rows (more efficient for column operations), provides type information, and offers built-in methods for common operations.From Chapter 6: What is the threshold concept from Chapter 6, and how does pandas change your relationship to it?
Answer
Chapter 6's threshold concept was "EDA as a conversation with data." pandas doesn't change this concept — you still ask questions and let answers guide you. But pandas makes the conversation faster. When each question takes 1 line instead of 15, you can ask more questions in the same amount of time, making the conversation richer and more productive.
7.13 The Grammar of Data Manipulation: A Framework
Before we close this chapter, let's step back and see the big picture. The operations you've learned — selecting, filtering, sorting, creating columns — aren't random. They're part of a systematic vocabulary for data manipulation. Data scientists sometimes call this the grammar of data manipulation, by analogy with natural language grammar.
Just as English sentences are built from verbs, nouns, and adjectives, data analysis is built from a small set of fundamental operations:
| Verb | Meaning | pandas Method | Example |
|---|---|---|---|
| Select | Choose columns | df[["col1", "col2"]] |
Keep only country and coverage |
| Filter | Choose rows | df[df["col"] > value] |
Keep only high-coverage rows |
| Sort | Order rows | df.sort_values("col") |
Arrange by coverage |
| Mutate | Create/modify columns | df["new"] = expression |
Add a coverage_level column |
| Summarize | Aggregate | df["col"].mean() |
Compute mean coverage |
| Group | Split-apply-combine | df.groupby("col") |
Statistics by region |
Every analysis you'll ever do is some combination of these six verbs. Learning pandas is largely about learning to express your analytical questions in terms of these operations — to translate "Which regions have average coverage below 80%?" into df.groupby("region")["coverage_pct"].mean() < 80.
We've covered the first four verbs in this chapter. Summarize and Group will be developed fully in Chapter 9 (Reshaping and Transforming). But you've already seen previews of both.
7.14 What's Coming Next
You now have the foundation for everything in Part II. Here's what the next chapters build on top of:
-
Chapter 8 (Cleaning Messy Data): You'll learn to handle the
NaNvalues you saw inread_csvoutput — filling them, dropping them, understanding their impact. You'll also fix data type problems, remove duplicates, and standardize messy text. -
Chapter 9 (Reshaping and Transforming): You'll go deeper on
groupby, learn to merge datasets together, and reshape data between "wide" and "long" formats withpivot_tableandmelt. -
Chapter 10 (Working with Text Data): You'll use the
.straccessor you previewed here along with regular expressions to extract information from messy text fields.
Each of these chapters assumes you're comfortable with the operations from this chapter: creating DataFrames, selecting columns, filtering rows, sorting, and creating new columns. If any of those feel shaky, go back and practice with the exercises before moving forward.
Chapter Summary
This chapter introduced pandas — the library that transforms Python from a general-purpose programming language into a data analysis powerhouse. You learned:
-
DataFrames are two-dimensional tables; Series are one-dimensional columns. Together they form the backbone of pandas.
-
Inspection methods (
shape,dtypes,head(),describe(),info()) give you instant insight into any dataset — replacing dozens of lines of manual Python. -
Column selection uses bracket notation (
df["col"]for one column,df[["col1", "col2"]]for multiple). -
Row selection uses
iloc(by position) andloc(by label). -
Boolean indexing (
df[df["col"] > value]) replaces loops-with-if-statements for filtering, and conditions are combined with&,|, and~. -
Sorting uses
sort_values()with support for multiple columns and ascending/descending control. -
Vectorized operations let you operate on entire columns at once (
df["new"] = df["old"] * 2), andapply()handles more complex per-row logic. -
read_csvloads CSV files in one line, with automatic type detection andNaNfor missing values. -
Method chaining lets you compose operations into readable pipelines.
The threshold concept — thinking in vectors rather than loops — is the mental shift that unlocks pandas. When you catch yourself reaching for a for loop to process DataFrame data, stop and ask: "Is there a vectorized way to do this?" Almost always, the answer is yes.
You rebuilt your Chapter 6 analysis and saw the code shrink from 70 lines to 15 — not by cutting corners, but by expressing your intent at a higher level. That's not laziness. That's power.
Welcome to Part II. The real work starts now.
Related Reading
Explore this topic in other books
Intro to Data Science Python Fundamentals I Introductory Statistics Your Data Toolkit IBM DB2 JDBC, ODBC, and Python