Key Takeaways: Introduction to pandas

Contributors to Introduction to Data Science

Key Takeaways: Introduction to pandas

This is your reference card for Chapter 7 — the chapter where Python stopped being a general-purpose programming language and became a data analysis powerhouse. Keep this nearby whenever you're working with DataFrames.

The Five Core Concepts

DataFrame — A two-dimensional labeled table. Rows and columns, with an index. The primary data structure in pandas.
Series — A one-dimensional labeled array. A single column of a DataFrame. Has an index, values, and a name.
Vectorized operations — Applying an operation to an entire column at once (df["col"] * 2) rather than looping through values one by one. Faster, safer, more readable.
Boolean indexing — Filtering rows using a True/False mask (df[df["col"] > value]). The pandas replacement for loop-with-if-statement.
read_csv — One-line CSV loading with automatic type detection and NaN for missing values. Replaces csv.DictReader + loop + manual type conversion.

DataFrame Operations Quick Reference

Inspection

Method / Attribute	What It Returns
`df.shape`	Tuple: (rows, columns)
`df.dtypes`	Data type of each column
`df.columns`	Column names
`df.head(n)`	First n rows (default 5)
`df.tail(n)`	Last n rows (default 5)
`df.info()`	Concise summary: types, non-null counts, memory
`df.describe()`	Summary statistics for numeric columns
`df["col"].value_counts()`	Count of each unique value
`df["col"].unique()`	Array of unique values
`df["col"].nunique()`	Number of unique values

Selection

What You Want	Syntax	Returns
One column	`df["col"]`	Series
Multiple columns	`df[["col1", "col2"]]`	DataFrame
Rows by position	`df.iloc[start:stop]`	DataFrame
Rows by label	`df.loc[start:stop]`	DataFrame
Rows + columns by position	`df.iloc[rows, cols]`	varies
Rows + columns by label	`df.loc[rows, cols]`	varies

Key difference: iloc uses exclusive end (like Python slicing). loc uses inclusive end.

Filtering

Pattern	Example
Single condition	`df[df["col"] > value]`
AND (both true)	`df[(df["col1"] > x) & (df["col2"] == y)]`
OR (either true)	`df[(df["col1"] > x) \\| (df["col2"] == y)]`
NOT	`df[~(df["col"] == value)]`
Membership	`df[df["col"].isin(["a", "b", "c"])]`

Always wrap each condition in parentheses when combining with & or |.

Sorting

df.sort_values("col")                           # Ascending (default)
df.sort_values("col", ascending=False)           # Descending
df.sort_values(["col1", "col2"], ascending=[True, False])  # Multi-column
df.sort_values("col").reset_index(drop=True)     # Clean index after sort

Creating and Modifying Columns

df["new_col"] = df["col"] * 2                    # Vectorized arithmetic
df["new_col"] = df["col1"] + df["col2"]          # Combine columns
df["new_col"] = df["col"].apply(some_function)   # Apply custom logic

Method Chaining Pattern

result = (df[df["region"] == "AFRO"]              # Filter
          .sort_values("coverage_pct",             # Sort
                       ascending=False)
          [["country", "coverage_pct"]]            # Select columns
          .head(10))                               # Limit rows

Read chains as sentences. Break across lines using parentheses. Keep chains under 5-6 steps. Use intermediate variables when debugging.

Common Gotchas

Gotcha	Symptom	Fix
Wrong column name	`KeyError`	Check `df.columns.tolist()` — names are case-sensitive
`and` instead of `&`	`ValueError: ambiguous truth value`	Use `&` for AND, `\\|` for OR, `~` for NOT
Single brackets for multiple columns	`KeyError` with a tuple	Use double brackets: `df[["col1", "col2"]]`
Modifying a filtered subset	`SettingWithCopyWarning`	Use `.copy()` or `.loc`
`loc` vs `iloc` slice end	Off-by-one errors	`loc` is inclusive, `iloc` is exclusive
Dot notation on special names	Wrong result or `AttributeError`	Always use bracket notation: `df["count"]` not `df.count`

Terms to Remember

Term	Definition
pandas	Open-source Python library for data manipulation and analysis; provides DataFrame and Series
DataFrame	Two-dimensional labeled data structure; a table with rows and columns
Series	One-dimensional labeled array; a single column of data with an index
index	Row labels in a DataFrame or Series; default is 0-based integers
column	A named vertical slice of a DataFrame; each column is a Series
row	A horizontal slice of a DataFrame; represents one observation or record
loc	Label-based indexer for selecting by index values and column names
iloc	Integer-position-based indexer for selecting by numeric position
boolean indexing	Filtering rows using a Series of True/False values (a boolean mask)
vectorized operation	An operation applied to an entire array/column at once, without explicit looping
apply	Series method that calls a function on each value, returning a new Series
read_csv	pandas function to load a CSV file into a DataFrame with automatic type detection
dtypes	DataFrame attribute showing the data type of each column
shape	DataFrame attribute returning (rows, columns) as a tuple
describe	DataFrame/Series method returning summary statistics for numeric data
head	DataFrame/Series method returning the first n rows (default 5)

What You Should Be Able to Do Now

Use this checklist to verify you've absorbed the chapter. If any item feels shaky, revisit the relevant section or practice with the exercises.

[ ] Import pandas with the standard pd alias
[ ] Create DataFrames from dictionaries, lists of dictionaries, and CSV files
[ ] Inspect any DataFrame with shape, dtypes, head(), describe(), and info()
[ ] Select columns using bracket notation (single and multiple)
[ ] Select rows using iloc (by position) and loc (by label)
[ ] Filter rows using boolean indexing with single and combined conditions
[ ] Sort by one or multiple columns, ascending or descending
[ ] Create new columns using vectorized arithmetic and apply()
[ ] Load a CSV file with pd.read_csv() and understand what NaN means
[ ] Chain methods into readable multi-step pipelines
[ ] Diagnose common errors: KeyError, SettingWithCopyWarning, ValueError with boolean operators
[ ] Explain why vectorized operations are preferred over loops
[ ] Translate English questions into pandas expressions using the grammar of data manipulation

If you checked every box, you're ready for Chapter 8, where you'll learn to handle the messy reality that NaN values and dirty data bring. The tools get sharper from here.