Key Takeaways: Introduction to pandas

This is your reference card for Chapter 7 — the chapter where Python stopped being a general-purpose programming language and became a data analysis powerhouse. Keep this nearby whenever you're working with DataFrames.


The Five Core Concepts

  1. DataFrame — A two-dimensional labeled table. Rows and columns, with an index. The primary data structure in pandas.

  2. Series — A one-dimensional labeled array. A single column of a DataFrame. Has an index, values, and a name.

  3. Vectorized operations — Applying an operation to an entire column at once (df["col"] * 2) rather than looping through values one by one. Faster, safer, more readable.

  4. Boolean indexing — Filtering rows using a True/False mask (df[df["col"] > value]). The pandas replacement for loop-with-if-statement.

  5. read_csv — One-line CSV loading with automatic type detection and NaN for missing values. Replaces csv.DictReader + loop + manual type conversion.


DataFrame Operations Quick Reference

Inspection

Method / Attribute What It Returns
df.shape Tuple: (rows, columns)
df.dtypes Data type of each column
df.columns Column names
df.head(n) First n rows (default 5)
df.tail(n) Last n rows (default 5)
df.info() Concise summary: types, non-null counts, memory
df.describe() Summary statistics for numeric columns
df["col"].value_counts() Count of each unique value
df["col"].unique() Array of unique values
df["col"].nunique() Number of unique values

Selection

What You Want Syntax Returns
One column df["col"] Series
Multiple columns df[["col1", "col2"]] DataFrame
Rows by position df.iloc[start:stop] DataFrame
Rows by label df.loc[start:stop] DataFrame
Rows + columns by position df.iloc[rows, cols] varies
Rows + columns by label df.loc[rows, cols] varies

Key difference: iloc uses exclusive end (like Python slicing). loc uses inclusive end.

Filtering

Pattern Example
Single condition df[df["col"] > value]
AND (both true) df[(df["col1"] > x) & (df["col2"] == y)]
OR (either true) df[(df["col1"] > x) \| (df["col2"] == y)]
NOT df[~(df["col"] == value)]
Membership df[df["col"].isin(["a", "b", "c"])]

Always wrap each condition in parentheses when combining with & or |.

Sorting

df.sort_values("col")                           # Ascending (default)
df.sort_values("col", ascending=False)           # Descending
df.sort_values(["col1", "col2"], ascending=[True, False])  # Multi-column
df.sort_values("col").reset_index(drop=True)     # Clean index after sort

Creating and Modifying Columns

df["new_col"] = df["col"] * 2                    # Vectorized arithmetic
df["new_col"] = df["col1"] + df["col2"]          # Combine columns
df["new_col"] = df["col"].apply(some_function)   # Apply custom logic

Method Chaining Pattern

result = (df[df["region"] == "AFRO"]              # Filter
          .sort_values("coverage_pct",             # Sort
                       ascending=False)
          [["country", "coverage_pct"]]            # Select columns
          .head(10))                               # Limit rows

Read chains as sentences. Break across lines using parentheses. Keep chains under 5-6 steps. Use intermediate variables when debugging.


Common Gotchas

Gotcha Symptom Fix
Wrong column name KeyError Check df.columns.tolist() — names are case-sensitive
and instead of & ValueError: ambiguous truth value Use & for AND, \| for OR, ~ for NOT
Single brackets for multiple columns KeyError with a tuple Use double brackets: df[["col1", "col2"]]
Modifying a filtered subset SettingWithCopyWarning Use .copy() or .loc
loc vs iloc slice end Off-by-one errors loc is inclusive, iloc is exclusive
Dot notation on special names Wrong result or AttributeError Always use bracket notation: df["count"] not df.count

Terms to Remember

Term Definition
pandas Open-source Python library for data manipulation and analysis; provides DataFrame and Series
DataFrame Two-dimensional labeled data structure; a table with rows and columns
Series One-dimensional labeled array; a single column of data with an index
index Row labels in a DataFrame or Series; default is 0-based integers
column A named vertical slice of a DataFrame; each column is a Series
row A horizontal slice of a DataFrame; represents one observation or record
loc Label-based indexer for selecting by index values and column names
iloc Integer-position-based indexer for selecting by numeric position
boolean indexing Filtering rows using a Series of True/False values (a boolean mask)
vectorized operation An operation applied to an entire array/column at once, without explicit looping
apply Series method that calls a function on each value, returning a new Series
read_csv pandas function to load a CSV file into a DataFrame with automatic type detection
dtypes DataFrame attribute showing the data type of each column
shape DataFrame attribute returning (rows, columns) as a tuple
describe DataFrame/Series method returning summary statistics for numeric data
head DataFrame/Series method returning the first n rows (default 5)

What You Should Be Able to Do Now

Use this checklist to verify you've absorbed the chapter. If any item feels shaky, revisit the relevant section or practice with the exercises.

  • [ ] Import pandas with the standard pd alias
  • [ ] Create DataFrames from dictionaries, lists of dictionaries, and CSV files
  • [ ] Inspect any DataFrame with shape, dtypes, head(), describe(), and info()
  • [ ] Select columns using bracket notation (single and multiple)
  • [ ] Select rows using iloc (by position) and loc (by label)
  • [ ] Filter rows using boolean indexing with single and combined conditions
  • [ ] Sort by one or multiple columns, ascending or descending
  • [ ] Create new columns using vectorized arithmetic and apply()
  • [ ] Load a CSV file with pd.read_csv() and understand what NaN means
  • [ ] Chain methods into readable multi-step pipelines
  • [ ] Diagnose common errors: KeyError, SettingWithCopyWarning, ValueError with boolean operators
  • [ ] Explain why vectorized operations are preferred over loops
  • [ ] Translate English questions into pandas expressions using the grammar of data manipulation

If you checked every box, you're ready for Chapter 8, where you'll learn to handle the messy reality that NaN values and dirty data bring. The tools get sharper from here.