26 min read

You had just loaded nearly 5,000 rows of WHO vaccination data using Python's csv module. You wrote loops to count records by region. You built a function to extract numeric values from string columns, carefully skipping empty strings. You computed...

Learning Objectives

  • Create DataFrames from dictionaries, lists, and CSV files using pandas constructors and read_csv
  • Select columns and filter rows using bracket notation, .loc, and .iloc with boolean conditions
  • Sort DataFrames by one or multiple columns and reset indexes appropriately
  • Create new computed columns using vectorized operations and apply()
  • Explain the relationship between Series and DataFrame and why vectorized operations are preferred over loops

Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation

"The purpose of computing is insight, not numbers." — Richard Hamming


Chapter Overview

Remember the end of Chapter 6?

You had just loaded nearly 5,000 rows of WHO vaccination data using Python's csv module. You wrote loops to count records by region. You built a function to extract numeric values from string columns, carefully skipping empty strings. You computed means and medians by hand. You grouped data by region using nested loops and dictionaries. And when you were done, you had a working analysis — one that required roughly 60 lines of careful, manual Python to answer what felt like simple questions.

It worked. And it taught you a lot. But be honest with yourself: did it feel... heavy? Did you notice how much code was dedicated to plumbing — converting strings to numbers, handling missing values, iterating row by row — rather than expressing what you actually wanted to know?

This chapter is where that weight lifts.

You're about to meet pandas, the Python library that transforms how you work with data. The operation that took you 12 lines in Chapter 6 — filtering vaccination records for a specific region, extracting numeric coverage values, and computing the mean — becomes this:

df[df['region'] == 'AFRO']['coverage_pct'].mean()

One line. Readable. Expressive. No loops. No manual type conversion. No string-to-float gymnastics. Just a clear statement of intent: "Give me the mean coverage percentage for the AFRO region."

That single line contains five new ideas (DataFrames, column selection, boolean indexing, Series methods, and vectorized operations), and by the end of this chapter, every piece of it will feel natural to you. More importantly, you'll understand why it works — not just how to type it.

In this chapter, you will learn to:

  1. Create DataFrames from dictionaries, lists, and CSV files using pandas constructors and read_csv
  2. Select columns and filter rows using bracket notation, .loc, and .iloc with boolean conditions
  3. Sort DataFrames by one or multiple columns and reset indexes appropriately
  4. Create new computed columns using vectorized operations and apply()
  5. Explain the relationship between Series and DataFrame and why vectorized operations are preferred over loops

7.1 Why pandas? — The Case for a Data Library

Before we dive into syntax, let's understand why pandas exists by looking at what we built in Chapter 6 and what it cost us.

The Pain You Already Felt

Cast your mind back to the code you wrote for computing mean vaccination coverage by region. Here's a condensed version:

# Chapter 6 approach: Mean coverage by region (pure Python)
import csv

data = []
with open("who_vaccination_data.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        data.append(row)

# Group by region and compute means
region_stats = {}
for row in data:
    region = row["region"]
    raw = row["coverage_pct"].strip()
    if raw == "":
        continue
    try:
        value = float(raw)
    except ValueError:
        continue
    if region not in region_stats:
        region_stats[region] = []
    region_stats[region].append(value)

for region in sorted(region_stats):
    values = region_stats[region]
    mean_val = sum(values) / len(values)
    print(f"{region}: {mean_val:.1f}%")

Count the lines. Count the concerns. This code is doing at least five things:

  1. File I/O: Opening the file, creating a reader, iterating rows
  2. Type conversion: Converting strings to floats
  3. Missing value handling: Skipping empty strings
  4. Error handling: Catching ValueError for non-numeric strings
  5. Grouping logic: Building a dictionary of region-to-values-list
  6. Aggregation: Computing the mean manually

And only item 6 is the thing we actually care about. The rest is plumbing.

The pandas Way

Here's the same analysis in pandas:

import pandas as pd

df = pd.read_csv("who_vaccination_data.csv")
df.groupby("region")["coverage_pct"].mean()

Four lines. And two of them are the import and the file load — things you do once per notebook. The actual analysis is one line.

But this isn't just about fewer keystrokes. The pandas version is:

  • Faster to write — you express intent, not mechanics
  • Faster to run — pandas operates on entire columns at once using optimized C code under the hood
  • Easier to read — even someone unfamiliar with pandas can guess what groupby("region")["coverage_pct"].mean() does
  • Safer — pandas handles type conversion and missing values (NaN) automatically

This is the promise of pandas: you think about what you want to know, and pandas handles how to compute it.

What Is pandas, Exactly?

pandas is an open-source Python library for data manipulation and analysis. The name comes from "panel data," a term from econometrics for multi-dimensional structured data (though the creator, Wes McKinney, has also said it stands for "Python Data Analysis Library"). McKinney created pandas in 2008 while working at a hedge fund, because he needed tools for financial data analysis that didn't exist in Python at the time.

Today, pandas is one of the most-used libraries in all of Python. If you do data science in Python, you use pandas. It provides two fundamental data structures:

  • DataFrame — a two-dimensional table, like a spreadsheet or a SQL table
  • Series — a one-dimensional column, like a single column from a spreadsheet

Everything else in pandas — filtering, sorting, grouping, merging, reshaping — is built on operations over these two structures.

The Import Convention

By universal convention, pandas is imported with the alias pd:

import pandas as pd

You'll see this line at the top of virtually every data science notebook in the world. The alias pd is so standard that writing import pandas (without the alias) would actually confuse experienced readers. Stick with pd.

Check Your Understanding

Before we go further, make sure you can articulate:

  1. What were the two biggest sources of friction when analyzing CSV data with pure Python in Chapter 6?
  2. In the pandas one-liner df.groupby("region")["coverage_pct"].mean(), what does each piece do?
  3. Why might "fewer lines of code" not be a sufficient argument for using a library? (Hint: what else matters?)

Answers

  1. (a) Manual type conversion — everything loads as strings; (b) Manual iteration — computing grouped statistics required nested loops and dictionaries.
  2. df is the DataFrame; .groupby("region") groups rows by the region column; ["coverage_pct"] selects the coverage column within each group; .mean() computes the average for each group.
  3. Readability, correctness, performance, and maintainability also matter. A clever one-liner that nobody can understand is worse than a clear five-liner.

7.2 Your First DataFrame

Let's build DataFrames from scratch before we load files. Understanding how DataFrames are constructed will help you understand how they work.

Creating a DataFrame from a Dictionary

The most common way to create a DataFrame by hand is from a Python dictionary, where each key is a column name and each value is a list of values for that column:

import pandas as pd

# Each key = column name, each value = list of column values
country_data = {
    "country": ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia"],
    "region": ["AMRO", "AMRO", "AFRO", "EURO", "AFRO"],
    "coverage_pct": [72.3, 85.1, 41.7, 93.2, 68.5],
    "year": [2022, 2022, 2022, 2022, 2022]
}

df = pd.DataFrame(country_data)
print(df)
     country region  coverage_pct  year
0     Brazil   AMRO          72.3  2022
1     Canada   AMRO          85.1  2022
2       Chad   AFRO          41.7  2022
3    Denmark   EURO          93.2  2022
4   Ethiopia   AFRO          68.5  2022

Look at that output. It's a table. Rows and columns, with labels. The numbers on the left (0, 1, 2, 3, 4) are the index — pandas's way of labeling each row. By default, the index is just sequential integers starting at 0, like list indices.

Notice what you didn't have to do: - You didn't specify data types. pandas figured out that coverage_pct is a float and year is an integer. - You didn't worry about alignment. pandas made sure each list has the same length (if they didn't, you'd get an error — which is actually a helpful safety check). - You didn't write a display function. DataFrames know how to display themselves as pretty tables.

Creating a DataFrame from a List of Dictionaries

Remember our Chapter 6 data structure? We loaded CSV data into a list of dictionaries, where each dictionary was one row. pandas can create a DataFrame directly from that:

# This is exactly what csv.DictReader gave us in Chapter 6
rows = [
    {"country": "Brazil", "region": "AMRO", "coverage_pct": 72.3},
    {"country": "Canada", "region": "AMRO", "coverage_pct": 85.1},
    {"country": "Chad", "region": "AFRO", "coverage_pct": 41.7},
]

df = pd.DataFrame(rows)
print(df)
  country region  coverage_pct
0  Brazil   AMRO          72.3
1  Canada   AMRO          85.1
2    Chad   AFRO          41.7

This means you could take everything you loaded in Chapter 6 and turn it into a DataFrame in one line: df = pd.DataFrame(data). We'll do something even better shortly with read_csv.

Inspecting a DataFrame: shape, dtypes, head, describe

Every time you create or load a DataFrame, your first move should be to inspect it. pandas provides a set of attributes and methods that give you the same information you painstakingly extracted in Chapter 6 — but instantly.

import pandas as pd

# Create a slightly larger example
df = pd.DataFrame({
    "country": ["Brazil", "Canada", "Chad", "Denmark", "Ethiopia",
                "Finland", "Ghana", "Haiti", "India", "Japan"],
    "region": ["AMRO", "AMRO", "AFRO", "EURO", "AFRO",
               "EURO", "AFRO", "AMRO", "SEARO", "WPRO"],
    "coverage_pct": [72.3, 85.1, 41.7, 93.2, 68.5,
                     95.0, 78.4, 52.1, 88.0, 97.5],
    "year": [2022, 2022, 2022, 2022, 2022,
             2022, 2022, 2022, 2022, 2022]
})

shape — the dimensions, as a tuple of (rows, columns):

print(df.shape)
(10, 4)

In Chapter 6, this was (len(data), len(data[0])) — two separate expressions. Now it's one attribute.

dtypes — the data type of each column:

print(df.dtypes)
country          object
region           object
coverage_pct    float64
year              int64
dtype: object

pandas uses object for text/string data, float64 for decimal numbers, and int64 for whole numbers. In Chapter 6, all columns were strings because csv.DictReader doesn't do type inference. pandas does.

head() — the first few rows (default: 5):

print(df.head(3))
  country region  coverage_pct  year
0  Brazil   AMRO          72.3  2022
1  Canada   AMRO          85.1  2022
2    Chad   AFRO          41.7  2022

In Chapter 6, this was for row in data[:3]: print(row). Now it's one method call.

tail() — the last few rows:

print(df.tail(2))
  country region  coverage_pct  year
8   India  SEARO          88.0  2022
9   Japan   WPRO          97.5  2022

describe() — summary statistics for all numeric columns:

print(df.describe())
       coverage_pct         year
count     10.000000    10.000000
mean      77.180000  2022.000000
std       17.811988     0.000000
min       41.700000  2022.000000
25%       68.975000  2022.000000
50%       80.750000  2022.000000
75%       91.900000  2022.000000
max       97.500000  2022.000000

Stop and take that in. In Chapter 6, you wrote a summarize() function with 12 lines of code that computed count, min, max, mean, median, and range. The describe() method gives you count, mean, standard deviation, min, three quartiles (25%, 50%, 75%), and max — for every numeric column at once. With zero lines of custom code.

This is what pandas does. It turns your custom plumbing into one-word method calls.

info() — a concise summary of the DataFrame:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   country       10 non-null     object
 1   region        10 non-null     object
 2   coverage_pct  10 non-null     float64
 3   year          10 non-null     int64
dtypes: float64(1), int64(1), object(2)
memory usage: 448.0+ bytes

This single call tells you: how many rows, how many columns, each column's name, how many non-null values each has (instantly revealing missing data), each column's type, and even memory usage. It's like your Chapter 6 data dictionary built automatically.

Comparison Table: Chapter 6 vs. Chapter 7

Task Chapter 6 (Pure Python) Chapter 7 (pandas)
Load CSV csv.DictReader + loop + list pd.read_csv()
Row count len(data) df.shape[0] or len(df)
Column names list(data[0].keys()) df.columns
First 5 rows for row in data[:5]: print(row) df.head()
Data types Everything is strings df.dtypes (auto-detected)
Summary stats Custom summarize() function df.describe()
Missing value count Loop + counter df.isnull().sum()
Column values [row["col"] for row in data] df["col"]

7.3 Series: The Building Block

A Series is a one-dimensional labeled array. If a DataFrame is a table, a Series is a single column (or row) of that table.

Extracting a Series from a DataFrame

When you select a single column from a DataFrame, you get a Series:

coverage = df["coverage_pct"]
print(type(coverage))
<class 'pandas.core.series.Series'>
print(coverage)
0    72.3
1    85.1
2    41.7
3    93.2
4    68.5
5    95.0
6    78.4
7    52.1
8    88.0
9    97.5
Name: coverage_pct, dtype: float64

A Series has three key properties: - Values: the actual data (72.3, 85.1, 41.7, ...) - Index: the labels for each value (0, 1, 2, ...) - Name: the column name it came from (coverage_pct)

Series Methods: Statistics in One Word

Here's where the magic happens. A Series has methods for every common statistical operation:

coverage = df["coverage_pct"]

print(f"Mean:   {coverage.mean():.1f}")
print(f"Median: {coverage.median():.1f}")
print(f"Std:    {coverage.std():.1f}")
print(f"Min:    {coverage.min():.1f}")
print(f"Max:    {coverage.max():.1f}")
print(f"Sum:    {coverage.sum():.1f}")
print(f"Count:  {coverage.count()}")
Mean:   77.2
Median: 80.8
Std:    17.8
Min:    41.7
Max:    97.5
Sum:    771.8
Count:  10

Every single one of those operations would have required a function or loop in Chapter 6. Now they're one method call each.

Creating a Series Directly

You can also create a Series on its own:

# From a list
temperatures = pd.Series([22.1, 19.8, 25.4, 18.3, 22.1])
print(temperatures)
0    22.1
1    19.8
2    25.4
3    18.3
4    22.1
dtype: float64
# With a custom index
temperatures = pd.Series(
    [22.1, 19.8, 25.4, 18.3, 22.1],
    index=["Mon", "Tue", "Wed", "Thu", "Fri"],
    name="temperature_c"
)
print(temperatures)
Mon    22.1
Tue    19.8
Wed    25.4
Thu    18.3
Fri    22.1
Name: temperature_c, dtype: float64

Now temperatures["Wed"] returns 25.4. The index gives your data meaning beyond just position.

The Relationship Between Series and DataFrame

Here's the conceptual model to keep in your head:

  • A DataFrame is a collection of Series that share the same index
  • Each column of a DataFrame is a Series
  • Each Series has the same number of elements as the DataFrame has rows
# A DataFrame is like a dictionary of Series
print(type(df["country"]))   # <class 'pandas.core.series.Series'>
print(type(df["region"]))    # <class 'pandas.core.series.Series'>
print(type(df["coverage_pct"]))  # <class 'pandas.core.series.Series'>

This isn't just a metaphor — it's literally how pandas is structured internally. Understanding this relationship will help you predict what operations return: selecting one column gives you a Series; selecting multiple columns gives you a DataFrame.


7.4 Selecting Data: Columns and Rows

Selecting specific columns and rows is the most fundamental operation in data manipulation. pandas gives you multiple ways to do this, and understanding when to use which one will save you hours of confusion.

Selecting Columns

Single column — bracket notation (returns a Series):

countries = df["country"]
print(countries)
0      Brazil
1      Canada
2        Chad
3     Denmark
4    Ethiopia
5     Finland
6       Ghana
7       Haiti
8       India
9       Japan
Name: country, dtype: object

Single column — dot notation (also returns a Series):

countries = df.country  # Same as df["country"]

Dot notation is shorter, but it has limitations: it doesn't work if the column name has spaces, special characters, or conflicts with a DataFrame method. For example, df.count would call the count() method, not select a column named "count." My recommendation: always use bracket notation. It's more explicit and works in every case.

Multiple columns — pass a list (returns a DataFrame):

subset = df[["country", "coverage_pct"]]
print(subset)
     country  coverage_pct
0     Brazil          72.3
1     Canada          85.1
2       Chad          41.7
3    Denmark          93.2
4   Ethiopia          68.5
5    Finland          95.0
6       Ghana          78.4
7      Haiti          52.1
8      India          88.0
9      Japan          97.5

Notice the double brackets: df[["country", "coverage_pct"]]. The outer brackets are the selection operator. The inner brackets create a list of column names. This is a common source of confusion for beginners — if you write df["country", "coverage_pct"] (single brackets with two arguments), you'll get a KeyError.

Selecting Rows by Position: iloc

iloc stands for "integer location." It selects rows (and columns) by their numeric position, just like list indexing:

# First row
print(df.iloc[0])
country         Brazil
region            AMRO
coverage_pct      72.3
year              2022
Name: 0, dtype: object
# Rows 2 through 4 (exclusive end, just like list slicing)
print(df.iloc[2:5])
    country region  coverage_pct  year
2      Chad   AFRO          41.7  2022
3   Denmark   EURO          93.2  2022
4  Ethiopia   AFRO          68.5  2022
# Specific rows and columns by position
print(df.iloc[0:3, 0:2])  # First 3 rows, first 2 columns
  country region
0  Brazil   AMRO
1  Canada   AMRO
2    Chad   AFRO

When to use iloc: When you know the position of what you want. "Give me the first three rows" or "give me row 42."

Selecting Rows by Label: loc

loc selects by label — the index values and column names:

# With the default integer index, loc and iloc look similar
print(df.loc[0])  # Row with index label 0
country         Brazil
region            AMRO
coverage_pct      72.3
year              2022
Name: 0, dtype: object

The difference between loc and iloc becomes crucial when the index isn't simple integers. We'll see this more in later chapters. For now, the key rule:

  • loc — uses labels (index values and column names)
  • iloc — uses integer positions (like list indices)
# loc can select specific rows and columns by name
print(df.loc[0:3, ["country", "coverage_pct"]])
   country  coverage_pct
0   Brazil          72.3
1   Canada          85.1
2     Chad          41.7
3  Denmark          93.2

Warning

Notice that loc[0:3] includes row 3, while iloc[0:3] excludes it. This is a critical difference: - iloc uses Python-style slicing (exclusive end) - loc uses label-based slicing (inclusive end)

This catches many beginners off guard. When in doubt, check your output.

A Quick Reference for Selection

What You Want Syntax Returns
One column df["col"] Series
Multiple columns df[["col1", "col2"]] DataFrame
One row by position df.iloc[n] Series
Multiple rows by position df.iloc[start:stop] DataFrame
One row by label df.loc[label] Series
Rows + columns by position df.iloc[rows, cols] DataFrame
Rows + columns by label df.loc[rows, cols] DataFrame

7.5 Filtering Rows: Boolean Indexing

This is one of the most powerful ideas in pandas, and it's built on a concept you already know from Chapter 4: boolean expressions. In pure Python, you filter data with a for loop and an if statement. In pandas, you filter with boolean indexing.

The Concept: Boolean Masks

Watch what happens when you write a comparison on a Series:

print(df["coverage_pct"] > 80)
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8     True
9     True
Name: coverage_pct, dtype: bool

The result is a Series of True and False values — one for every row in the DataFrame. This is called a boolean mask. It's like a filter template: True means "keep this row," False means "drop it."

Now, pass that mask back into the DataFrame's brackets:

high_coverage = df[df["coverage_pct"] > 80]
print(high_coverage)
   country region  coverage_pct  year
1   Canada   AMRO          85.1  2022
3  Denmark   EURO          93.2  2022
5  Finland   EURO          95.0  2022
8    India  SEARO          88.0  2022
9    Japan   WPRO          97.5  2022

Five countries with coverage above 80%. No loops. No if statements. One line.

Contrast with Chapter 6

Here's how you would have done this in Chapter 6:

# Chapter 6 approach: Filter for high coverage
high_coverage = []
for row in data:
    raw = row["coverage_pct"].strip()
    if raw == "":
        continue
    try:
        value = float(raw)
    except ValueError:
        continue
    if value > 80:
        high_coverage.append(row)

print(f"Found {len(high_coverage)} countries with coverage > 80%")

That's 10 lines, including type conversion, error handling, and missing value logic. The pandas version is one line because pandas already knows the column is numeric (it converted it during read_csv) and already handles missing values (they're NaN, which comparisons safely ignore).

Combining Conditions

You can combine boolean conditions using & (and), | (or), and ~ (not). Important: Each condition must be wrapped in parentheses.

# Countries in AFRO with coverage above 60
afro_high = df[(df["region"] == "AFRO") & (df["coverage_pct"] > 60)]
print(afro_high)
    country region  coverage_pct  year
4  Ethiopia   AFRO          68.5  2022
6     Ghana   AFRO          78.4  2022
# Countries in AMRO or EURO
amro_or_euro = df[(df["region"] == "AMRO") | (df["region"] == "EURO")]
print(amro_or_euro)
   country region  coverage_pct  year
0   Brazil   AMRO          72.3  2022
1   Canada   AMRO          85.1  2022
3  Denmark   EURO          93.2  2022
5  Finland   EURO          95.0  2022
7    Haiti   AMRO          52.1  2022

For checking membership in a set of values, use isin():

# Cleaner alternative for multiple values
amro_or_euro = df[df["region"].isin(["AMRO", "EURO"])]

For negation, use ~:

# Everything EXCEPT AFRO
not_afro = df[~(df["region"] == "AFRO")]
# Or equivalently:
not_afro = df[df["region"] != "AFRO"]

Common Mistake: Using and instead of &

Python's built-in and operator doesn't work with pandas boolean masks. If you write df[df["region"] == "AFRO" and df["coverage_pct"] > 60], you'll get a ValueError with the message "The truth value of a Series is ambiguous." Use & for "and," | for "or," and ~ for "not," and always wrap each condition in parentheses.


7.6 Sorting Data

Sorting is straightforward in pandas, and the method name tells you exactly what it does.

Sorting by a Single Column

# Sort by coverage, lowest first (ascending is the default)
sorted_df = df.sort_values("coverage_pct")
print(sorted_df)
     country region  coverage_pct  year
2       Chad   AFRO          41.7  2022
7      Haiti   AMRO          52.1  2022
4   Ethiopia   AFRO          68.5  2022
0     Brazil   AMRO          72.3  2022
6      Ghana   AFRO          78.4  2022
1     Canada   AMRO          85.1  2022
8      India  SEARO          88.0  2022
3    Denmark   EURO          93.2  2022
5    Finland   EURO          95.0  2022
9      Japan   WPRO          97.5  2022

Notice that the index values (0, 2, 7, 4, ...) are preserved — they travel with their rows. This means the original row identity is maintained even after sorting.

# Sort by coverage, highest first
sorted_df = df.sort_values("coverage_pct", ascending=False)
print(sorted_df.head())
   country region  coverage_pct  year
9    Japan   WPRO          97.5  2022
5  Finland   EURO          95.0  2022
3  Denmark   EURO          93.2  2022
8    India  SEARO          88.0  2022
1   Canada   AMRO          85.1  2022

Sorting by Multiple Columns

# Sort by region (alphabetically), then by coverage (highest first) within each region
sorted_df = df.sort_values(["region", "coverage_pct"], ascending=[True, False])
print(sorted_df)
     country region  coverage_pct  year
6      Ghana   AFRO          78.4  2022
4   Ethiopia   AFRO          68.5  2022
2       Chad   AFRO          41.7  2022
1     Canada   AMRO          85.1  2022
0     Brazil   AMRO          72.3  2022
7      Haiti   AMRO          52.1  2022
5    Finland   EURO          95.0  2022
3    Denmark   EURO          93.2  2022
8      India  SEARO          88.0  2022
9      Japan   WPRO          97.5  2022

The ascending parameter takes a list matching the columns list: True for ascending, False for descending.

Resetting the Index

After sorting (or filtering), you might want a clean 0-based index:

sorted_df = df.sort_values("coverage_pct").reset_index(drop=True)
print(sorted_df)
     country region  coverage_pct  year
0       Chad   AFRO          41.7  2022
1      Haiti   AMRO          52.1  2022
2   Ethiopia   AFRO          68.5  2022
3     Brazil   AMRO          72.3  2022
...

The drop=True parameter prevents pandas from adding the old index as a new column. Without drop=True, you'd get an extra column called index — rarely what you want.


7.7 Creating New Columns: Vectorized Operations

This is the section where the threshold concept of this chapter comes alive.

Threshold Concept: Thinking in Vectors Rather Than Loops

Up until now, when you wanted to do something to every item in a collection, you wrote a loop:

```python

Loop thinking: process one item at a time

results = [] for value in coverage_values: results.append(value / 100) ```

In pandas, you think differently. Instead of processing one value at a time, you operate on entire columns at once:

```python

Vector thinking: operate on the whole column

df["coverage_decimal"] = df["coverage_pct"] / 100 ```

That single line divides every value in the coverage_pct column by 100 and stores the results in a new column called coverage_decimal. No loop. No appending. The operation applies to all rows simultaneously.

This is called a vectorized operation, and it's the fundamental mental shift of working with pandas. Instead of asking "how do I process each row?" ask "what operation do I want to apply to the whole column?"

Here's why this matters beyond convenience: - Speed: Vectorized operations use optimized C code under the hood. For a million rows, a vectorized operation can be 100x faster than a Python loop. - Readability: df["coverage_pct"] / 100 clearly states intent. A loop buries the intent inside iteration mechanics. - Safety: Vectorized operations handle missing values (NaN) automatically — NaN / 100 produces NaN, not an error.

This shift — from "loop over items" to "operate on columns" — is one of the most important mental transitions in your data science education. It will feel unnatural at first. That's normal. Keep practicing, and it will become your default way of thinking.

Arithmetic on Columns

# Create a new column: coverage as a decimal
df["coverage_decimal"] = df["coverage_pct"] / 100
print(df[["country", "coverage_pct", "coverage_decimal"]])
     country  coverage_pct  coverage_decimal
0     Brazil          72.3             0.723
1     Canada          85.1             0.851
2       Chad          41.7             0.417
3    Denmark          93.2             0.932
4   Ethiopia          68.5             0.685
5    Finland          95.0             0.950
6      Ghana          78.4             0.784
7      Haiti          52.1             0.521
8      India          88.0             0.880
9      Japan          97.5             0.975

Combining Columns

You can use multiple columns in a single expression:

# Imagine we have target_population and doses_administered
df2 = pd.DataFrame({
    "country": ["Brazil", "Canada", "Chad"],
    "target_pop": [50000, 28000, 15000],
    "doses_given": [36150, 23800, 6255]
})

# Compute coverage from the raw numbers
df2["computed_coverage"] = (df2["doses_given"] / df2["target_pop"]) * 100
print(df2)
  country  target_pop  doses_given  computed_coverage
0  Brazil       50000        36150               72.3
1  Canada       28000        23800               85.0
2    Chad       15000         6255               41.7

Conditional Columns with Where or Apply

Sometimes you need logic more complex than simple arithmetic. For categorization, you can use apply() with a function:

def classify_coverage(pct):
    """Classify a coverage percentage into a category."""
    if pct >= 90:
        return "High"
    elif pct >= 70:
        return "Medium"
    else:
        return "Low"

df["coverage_level"] = df["coverage_pct"].apply(classify_coverage)
print(df[["country", "coverage_pct", "coverage_level"]])
     country  coverage_pct coverage_level
0     Brazil          72.3         Medium
1     Canada          85.1         Medium
2       Chad          41.7            Low
3    Denmark          93.2           High
4   Ethiopia          68.5            Low
5    Finland          95.0           High
6      Ghana          78.4         Medium
7      Haiti          52.1            Low
8      India          88.0         Medium
9      Japan          97.5           High

The apply() method takes a function and calls it on every value in the Series. It's a bridge between the loop-thinking you know and the vector-thinking you're learning. Use it when you need conditional logic that's too complex for simple arithmetic.

Performance Note: apply() is slower than true vectorized operations because it still iterates through values under the hood (just in a nicer wrapper). For simple conditions, np.where() or pd.cut() are faster alternatives — you'll learn these in Chapter 9. For now, apply() is perfectly fine for datasets up to hundreds of thousands of rows.

String Operations on Columns

Series has a .str accessor for string operations on text columns:

# Convert country names to uppercase
print(df["country"].str.upper())
0      BRAZIL
1      CANADA
2        CHAD
3     DENMARK
4    ETHIOPIA
5     FINLAND
6       GHANA
7       HAITI
8       INDIA
9       JAPAN
Name: country, dtype: object
# Check which countries start with a specific letter
print(df["country"].str.startswith("C"))
0    False
1     True
2     True
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: country, dtype: bool

This will become especially powerful in Chapter 10 (Working with Text Data), but knowing it exists now gives you a preview of how pandas extends vectorized thinking beyond just numbers.


7.8 Loading Real Data: read_csv Deep Dive

You've been building DataFrames by hand. Now let's load real data the way professionals do: with pd.read_csv().

Basic Usage

import pandas as pd

df = pd.read_csv("who_vaccination_data.csv")
print(df.shape)
print(df.head())
(4892, 7)
   country region  year vaccine  coverage_pct  target_population  doses_administered
0  Afghanistan   EMRO  2019    MCV1          64.0                NaN                 NaN
1  Afghanistan   EMRO  2020    MCV1          66.0                NaN                 NaN
2  Afghanistan   EMRO  2021    MCV1          58.0                NaN                 NaN
3  Afghanistan   EMRO  2022    MCV1          62.0                NaN                 NaN
4      Albania   EURO  2019    MCV1          96.0            28000.0             26880.0

Look at the differences from Chapter 6:

  1. One line to load the entire file — no open(), no DictReader, no loop, no list
  2. Automatic type detectionyear is an integer, coverage_pct is a float, text columns are objects
  3. Missing values become NaN — not empty strings that crash your math. NaN (Not a Number) is pandas's sentinel for missing data. It participates safely in computations: NaN + 5 is NaN, not an error.

Useful read_csv Parameters

read_csv has dozens of parameters. Here are the ones you'll use most often:

# Specify which columns to load (saves memory on large files)
df = pd.read_csv("who_vaccination_data.csv",
                 usecols=["country", "region", "year", "coverage_pct"])

# Parse a column as dates
df = pd.read_csv("sales_data.csv",
                 parse_dates=["sale_date"])

# Use a specific column as the index
df = pd.read_csv("who_vaccination_data.csv",
                 index_col="country")

# Handle different delimiters (tab-separated, semicolon-separated)
df = pd.read_csv("data.tsv", sep="\t")

# Specify the encoding (important for international data)
df = pd.read_csv("data.csv", encoding="latin-1")

# Skip bad lines instead of crashing
df = pd.read_csv("messy_data.csv", on_bad_lines="skip")

# Read only the first N rows (useful for previewing huge files)
df = pd.read_csv("huge_file.csv", nrows=100)

The NaN Revolution

One of the biggest quality-of-life improvements over pure Python is how pandas handles missing data. In Chapter 6, empty cells loaded as empty strings (""), and you had to check for them manually before every calculation:

# Chapter 6: The missing value dance
raw = row["coverage_pct"].strip()
if raw == "":
    continue
try:
    value = float(raw)
except ValueError:
    continue

In pandas, missing values are NaN (Not a Number), and every statistical method handles them automatically:

# pandas: Just... works
df["coverage_pct"].mean()    # Ignores NaN automatically
df["coverage_pct"].count()   # Counts only non-NaN values
df["coverage_pct"].isnull()  # True where values are missing

We'll dive deep into missing value handling strategies in Chapter 8 (Cleaning Messy Data). For now, the important thing to know is that NaN is your friend, not your enemy — it marks missing data clearly and doesn't crash your calculations.


7.9 Method Chaining: Composing Operations

One of the most elegant aspects of pandas is method chaining — stringing multiple operations together in a single expression, where each operation feeds its result to the next.

The Basic Idea

Instead of writing:

# Step by step (creating intermediate variables)
filtered = df[df["region"] == "AFRO"]
sorted_df = filtered.sort_values("coverage_pct", ascending=False)
result = sorted_df[["country", "coverage_pct"]].head(5)

You can write:

# Method chain (one flowing expression)
result = (df[df["region"] == "AFRO"]
          .sort_values("coverage_pct", ascending=False)
          [["country", "coverage_pct"]]
          .head(5))

Read it as a sentence: "Take the DataFrame, filter for AFRO region, sort by coverage descending, select the country and coverage columns, and show the top 5."

The parentheses around the entire expression let you break it across multiple lines for readability — a formatting convention you'll see in professional pandas code.

Why Chain?

Method chaining encourages you to think about data operations as a pipeline — a sequence of transformations where data flows through each step. This mirrors how data scientists describe their work in plain English:

  • "Filter for the African region"
  • "Sort by coverage from highest to lowest"
  • "Show the top 5 countries"

Each verb maps to a pandas method. This is what we mean by the "grammar of data manipulation" — a consistent vocabulary of operations (select, filter, sort, create, summarize) that you compose into analytical sentences.

A Practical Example

Let's ask a real question: "What are the top 3 vaccines by mean coverage in the EURO region?"

result = (df[df["region"] == "EURO"]
          .groupby("vaccine")["coverage_pct"]
          .mean()
          .sort_values(ascending=False)
          .head(3))
print(result)

We haven't formally covered groupby yet (that's Chapter 9), but you can probably read this chain and understand what it does. That's the power of expressive code.

When Not to Chain

Method chaining is elegant, but don't take it too far. If a chain exceeds 5-6 steps, or if intermediate results would be useful for debugging, break it into named variables. Readability always beats cleverness.


7.10 Debugging Walkthrough: Common pandas Errors

Every pandas beginner hits the same set of errors. Let's walk through them so you can diagnose them quickly when they happen to you.

Error 1: KeyError — Column Not Found

# You typed the column name wrong
df["Coverage_Pct"]
KeyError: 'Coverage_Pct'

Why it happens: Column names are case-sensitive. "Coverage_Pct" is not the same as "coverage_pct".

How to fix it: Check the exact column names with df.columns:

print(df.columns.tolist())
# ['country', 'region', 'year', 'vaccine', 'coverage_pct',
#  'target_population', 'doses_administered']

Then use the exact name. Copy-pasting from the output is safer than typing from memory.

Error 2: SettingWithCopyWarning

This is pandas's most confusing warning, and almost every beginner encounters it:

# This might trigger SettingWithCopyWarning
subset = df[df["region"] == "AFRO"]
subset["coverage_level"] = "unknown"   # Warning!
SettingWithCopyWarning: A value is trying to be set on a copy of a slice
from a DataFrame.

Why it happens: When you filter a DataFrame, pandas sometimes returns a view (linked to the original) and sometimes a copy (independent). If it's a view and you modify it, you might accidentally change the original DataFrame. pandas warns you about this ambiguity.

How to fix it: Use .copy() when you intend to create an independent subset:

# Explicit copy — no warning, no ambiguity
subset = df[df["region"] == "AFRO"].copy()
subset["coverage_level"] = "unknown"   # Safe!

Or use .loc for direct modification:

# Modify the original DataFrame directly
df.loc[df["region"] == "AFRO", "coverage_level"] = "unknown"

The rule of thumb: if you're creating a subset that you'll modify, call .copy(). If you're modifying the original DataFrame, use .loc.

Error 3: ValueError with Boolean Operators

# Using Python's 'and' instead of '&'
df[df["region"] == "AFRO" and df["coverage_pct"] > 80]
ValueError: The truth value of a Series is ambiguous.
Use a.empty, a.bool(), a.item(), a.any() or a.all().

Why it happens: Python's and operator tries to evaluate the entire Series as a single True or False, which is ambiguous (is a Series with some True and some False values True or False?).

How to fix it: Use & for element-wise "and," | for "or," and wrap each condition in parentheses:

df[(df["region"] == "AFRO") & (df["coverage_pct"] > 80)]

Error 4: Double Bracket Confusion

# Forgot the inner list brackets
df["country", "region"]
KeyError: ('country', 'region')

Why it happens: df["country", "region"] looks for a single column whose name is the tuple ('country', 'region'). That doesn't exist.

How to fix it: Pass a list of column names:

df[["country", "region"]]   # Note the double brackets

Retrieval Practice

Without looking at the examples above, try to answer:

  1. What error do you get if you type a column name with wrong capitalization?
  2. When should you use .copy() on a filtered DataFrame?
  3. Why can't you use Python's and keyword between two pandas boolean conditions?

Answers

  1. KeyError — pandas is case-sensitive about column names.
  2. When you plan to modify the filtered subset (add or change columns).
  3. Because and tries to evaluate the entire Series as a single boolean, which is ambiguous. Use & instead.

7.11 Project Checkpoint: Rebuilding Chapter 6 in pandas

Time to put everything together. Let's rebuild the Chapter 6 analysis of the WHO vaccination data — but this time using pandas. You'll feel the difference viscerally.

Step 1: Load the Data

Chapter 6 (11 lines):

import csv

data = []
with open("who_vaccination_data.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        data.append(row)

num_rows = len(data)
num_cols = len(data[0]) if data else 0
print(f"Shape: {num_rows} rows x {num_cols} columns")

Chapter 7 (3 lines):

import pandas as pd

df = pd.read_csv("who_vaccination_data.csv")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
Shape: 4892 rows x 7 columns

Step 2: Inspect the Data

Chapter 6: Multiple loops and print statements to see columns, first rows, data types, unique values.

Chapter 7:

print(df.head())
print()
print(df.dtypes)
print()
print(df.describe())
print()
df.info()

That's it. Four method calls give you everything that took 30+ lines in Chapter 6.

Step 3: Count Records by Region

Chapter 6 (7 lines):

region_counts = {}
for row in data:
    region = row["region"]
    region_counts[region] = region_counts.get(region, 0) + 1

for region, count in sorted(region_counts.items()):
    print(f"  {region}: {count} records")

Chapter 7 (1 line):

print(df["region"].value_counts().sort_index())
AFRO    1520
AMRO     820
EMRO     564
EURO    1040
SEARO    308
WPRO     640
Name: region, dtype: int64

Step 4: Unique Countries per Region

Chapter 6 (8 lines):

countries_per_region = {}
for row in data:
    region = row["region"]
    country = row["country"]
    if region not in countries_per_region:
        countries_per_region[region] = set()
    countries_per_region[region].add(country)

for region in sorted(countries_per_region):
    print(f"  {region}: {len(countries_per_region[region])} countries")

Chapter 7 (1 line):

print(df.groupby("region")["country"].nunique())
region
AFRO     47
AMRO     35
EMRO     22
EURO     53
SEARO    11
WPRO     26
Name: country, dtype: int64

Step 5: Summary Statistics by Region

Chapter 6 (15+ lines): Required get_numeric_values(), compute_mean(), compute_median(), manual filtering and grouping.

Chapter 7 (1 line):

print(df.groupby("region")["coverage_pct"].describe())
        count   mean        std   min    25%    50%    75%    max
region
AFRO   1485.0  72.3  22.139...   6.0  58.00  76.0   91.0   99.0
AMRO    805.0  86.4  14.247...  20.0  82.00  90.0   95.0   99.0
EMRO    555.0  80.5  19.876...  11.0  72.00  85.0   94.0   99.0
EURO   1025.0  93.1   6.789...  42.0  91.00  95.0   97.0   99.0
SEARO   295.0  84.8  14.532...  34.0  79.00  88.0   94.0   99.0
WPRO    650.0  88.2  14.673...  15.0  83.00  93.0   97.0   99.0

Count, mean, standard deviation, min, quartiles, and max — for every region — in one line.

Step 6: Filter and Sort

Question: "Which countries in the AFRO region had the lowest MCV1 coverage in 2022?"

Chapter 6 (12+ lines): Filter with nested if statements, sort with sorted() and a key function, format output manually.

Chapter 7 (4 lines):

result = (df[(df["region"] == "AFRO") &
             (df["vaccine"] == "MCV1") &
             (df["year"] == 2022)]
          .sort_values("coverage_pct")
          .head(10))
print(result[["country", "coverage_pct"]])

Step 7: Create a Computed Column

Let's add a column classifying coverage as "Low," "Medium," or "High":

def classify(pct):
    if pct >= 90:
        return "High"
    elif pct >= 70:
        return "Medium"
    else:
        return "Low"

df["coverage_level"] = df["coverage_pct"].apply(classify)

# How many records in each category?
print(df["coverage_level"].value_counts())
High      2547
Medium    1386
Low        882
Name: coverage_level, dtype: int64

The Scoreboard

Operation Chapter 6 Lines Chapter 7 Lines Reduction
Load + inspect ~20 4 80%
Count by group 7 1 86%
Unique per group 8 1 88%
Summary stats by group 15+ 1 93%
Filter + sort 12+ 4 67%
Computed column 8+ 4 50%

The total analysis went from roughly 70 lines of careful Python to about 15 lines of pandas. And the pandas version is more readable, handles missing values automatically, and runs faster on large datasets.

This is the payoff for all the work you put in during Part I. You earned this.


7.12 Spaced Review: Concepts from Chapters 1-6

Learning sticks when you revisit it. Here are retrieval practice questions that connect earlier material to what you've just learned.

Spaced Review Block

Try to answer each question from memory before checking.

From Chapter 1: The data science lifecycle has six stages. Which stage does this chapter's work (selecting, filtering, sorting) belong to? What stage came before it in Chapter 6?

Answer This chapter's work belongs to the data wrangling/exploration stage. In Chapter 6, you were also in exploration but with raw tools. The lifecycle stages are: question formulation, data collection, data cleaning, exploratory analysis, modeling, and communication.

From Chapter 3: Why does pd.read_csv() handle type conversion automatically while csv.DictReader does not? What Python concept explains the difference?

Answer csv.DictReader reads every field as a string because CSV is a text format with no type information. read_csv performs type inference — it examines the values and converts them to appropriate Python/NumPy types (int64, float64, object). The Python concept is type conversion (Ch.3) — pandas just does it for you automatically.

From Chapter 4: The apply() method takes a function as an argument. What is this pattern called, and where did you first encounter it?

Answer This is a higher-order function — a function that takes another function as an argument. You encountered this concept in Chapter 4 with sorted(data, key=some_function). The apply() method extends the same idea to pandas Series.

From Chapter 5: A DataFrame is conceptually similar to what Python data structure from Chapter 5? How is it different?

Answer A DataFrame is conceptually similar to a list of dictionaries (each dictionary is a row, each key is a column name). The differences: a DataFrame enforces that all rows have the same columns, stores data in columns rather than rows (more efficient for column operations), provides type information, and offers built-in methods for common operations.

From Chapter 6: What is the threshold concept from Chapter 6, and how does pandas change your relationship to it?

Answer Chapter 6's threshold concept was "EDA as a conversation with data." pandas doesn't change this concept — you still ask questions and let answers guide you. But pandas makes the conversation faster. When each question takes 1 line instead of 15, you can ask more questions in the same amount of time, making the conversation richer and more productive.


7.13 The Grammar of Data Manipulation: A Framework

Before we close this chapter, let's step back and see the big picture. The operations you've learned — selecting, filtering, sorting, creating columns — aren't random. They're part of a systematic vocabulary for data manipulation. Data scientists sometimes call this the grammar of data manipulation, by analogy with natural language grammar.

Just as English sentences are built from verbs, nouns, and adjectives, data analysis is built from a small set of fundamental operations:

Verb Meaning pandas Method Example
Select Choose columns df[["col1", "col2"]] Keep only country and coverage
Filter Choose rows df[df["col"] > value] Keep only high-coverage rows
Sort Order rows df.sort_values("col") Arrange by coverage
Mutate Create/modify columns df["new"] = expression Add a coverage_level column
Summarize Aggregate df["col"].mean() Compute mean coverage
Group Split-apply-combine df.groupby("col") Statistics by region

Every analysis you'll ever do is some combination of these six verbs. Learning pandas is largely about learning to express your analytical questions in terms of these operations — to translate "Which regions have average coverage below 80%?" into df.groupby("region")["coverage_pct"].mean() < 80.

We've covered the first four verbs in this chapter. Summarize and Group will be developed fully in Chapter 9 (Reshaping and Transforming). But you've already seen previews of both.


7.14 What's Coming Next

You now have the foundation for everything in Part II. Here's what the next chapters build on top of:

  • Chapter 8 (Cleaning Messy Data): You'll learn to handle the NaN values you saw in read_csv output — filling them, dropping them, understanding their impact. You'll also fix data type problems, remove duplicates, and standardize messy text.

  • Chapter 9 (Reshaping and Transforming): You'll go deeper on groupby, learn to merge datasets together, and reshape data between "wide" and "long" formats with pivot_table and melt.

  • Chapter 10 (Working with Text Data): You'll use the .str accessor you previewed here along with regular expressions to extract information from messy text fields.

Each of these chapters assumes you're comfortable with the operations from this chapter: creating DataFrames, selecting columns, filtering rows, sorting, and creating new columns. If any of those feel shaky, go back and practice with the exercises before moving forward.


Chapter Summary

This chapter introduced pandas — the library that transforms Python from a general-purpose programming language into a data analysis powerhouse. You learned:

  1. DataFrames are two-dimensional tables; Series are one-dimensional columns. Together they form the backbone of pandas.

  2. Inspection methods (shape, dtypes, head(), describe(), info()) give you instant insight into any dataset — replacing dozens of lines of manual Python.

  3. Column selection uses bracket notation (df["col"] for one column, df[["col1", "col2"]] for multiple).

  4. Row selection uses iloc (by position) and loc (by label).

  5. Boolean indexing (df[df["col"] > value]) replaces loops-with-if-statements for filtering, and conditions are combined with &, |, and ~.

  6. Sorting uses sort_values() with support for multiple columns and ascending/descending control.

  7. Vectorized operations let you operate on entire columns at once (df["new"] = df["old"] * 2), and apply() handles more complex per-row logic.

  8. read_csv loads CSV files in one line, with automatic type detection and NaN for missing values.

  9. Method chaining lets you compose operations into readable pipelines.

The threshold concept — thinking in vectors rather than loops — is the mental shift that unlocks pandas. When you catch yourself reaching for a for loop to process DataFrame data, stop and ask: "Is there a vectorized way to do this?" Almost always, the answer is yes.

You rebuilt your Chapter 6 analysis and saw the code shrink from 70 lines to 15 — not by cutting corners, but by expressing your intent at a higher level. That's not laziness. That's power.

Welcome to Part II. The real work starts now.