Key Takeaways: Chapter 10 — Introduction to pandas

The Big Ideas

1. pandas gives Python the data analysis power that Excel provides through a GUI — but faster, more reproducible, and at any scale.

Excel is a great tool for small datasets and ad hoc exploration. pandas is the right tool when you need to handle more rows than Excel can manage, produce consistent repeatable results, automate a recurring analysis, or communicate your method transparently to colleagues. For data-driven business work in Python, pandas is not optional — it is the foundation.

2. Two data structures underpin everything: the Series and the DataFrame.

A Series is a one-dimensional labeled array. Every column in a spreadsheet is a Series. A DataFrame is a two-dimensional labeled table. Your entire spreadsheet is a DataFrame. Understanding these two structures and how they relate — a DataFrame is a collection of aligned Series — is the conceptual key to all of pandas.

3. The universal import convention is `import pandas as pd`.

This alias is used everywhere: tutorials, documentation, Stack Overflow, textbooks, production codebases. Using pd is not just convention — it is a professional signal that you understand the ecosystem. Never import as import pandas as pandas or use from pandas import * in business code.

4. Always inspect before you analyze.

Before performing any analysis on a new dataset, run .info(), .describe(), .head(), and check .shape and .dtypes. This 60-second habit prevents wasted hours investigating results that are wrong because of a data type issue, unexpected nulls, or a column layout that differs from what you assumed. Priya's three-step mantra: load, inspect, then calculate.

5. df['col'] vs df[['col']] — single vs. double brackets matter.

df['unit_price'] returns a Series (one-dimensional). df[['unit_price']] returns a single-column DataFrame (two-dimensional). They look similar but behave differently when passed to other functions or when chaining operations. Use bracket notation (df['col']) consistently over dot notation (df.col), since dot notation breaks with spaces in column names and can silently conflict with DataFrame method names.

6. .loc[] is label-based; .iloc[] is position-based. They are not interchangeable.

.loc['ACM-003'] finds the row labeled ACM-003. .iloc[2] finds the row at the third integer position. After filtering or sorting a DataFrame, the label and the position of a row can diverge. Using the wrong indexer is a common source of silent bugs. Default rule: use .loc[] when you know the label, use .iloc[] when you want the Nth row.

7. Boolean filtering is the pandas equivalent of AutoFilter — and it is far more powerful.

df[df['margin_pct'] < 0.20] returns every row where margin is below 20%. Combine conditions with & (and) and | (or), never with Python's and / or keywords. Always wrap each condition in parentheses when combining. Use .isin([...]) for "matches any of these values." Use ~ to negate a condition.

8. Vectorized operations beat loops every time.

df['margin'] = (df['unit_price'] - df['unit_cost']) / df['unit_price'] applies to all rows simultaneously. An equivalent for loop over .iterrows() does the same thing but can be 100–1000x slower and is harder to read. The pandas mindset is column-first, not row-first. When you find yourself writing a loop over a DataFrame, stop and ask: can I express this as a column operation?

9. Most pandas methods return new objects; they do not modify in place.

df.sort_values('price') gives you a sorted DataFrame but leaves df unchanged. df.drop('col', axis=1) returns a DataFrame without that column but leaves df unchanged. To apply the change, always reassign: df = df.sort_values('price'). inplace=True is available but widely considered harder to reason about in complex code.

10. Code is more reproducible than clicks.

The most valuable quality of a pandas analysis is that it can be re-run. When Marcus gives Priya a new product export next quarter, she re-runs her script. When a stakeholder asks "what would this look like with a 25% margin threshold instead of 20%?", Priya changes one number and re-runs. This reproducibility — impossible with a manually filtered spreadsheet — is the deepest business value of learning pandas.

Method Quick Reference

Inspection

Method / Attribute	What It Returns
`df.shape`	Tuple `(rows, columns)`
`df.dtypes`	Data type of each column
`df.columns`	Index of column names
`df.index`	Row index
`df.head(n)`	First `n` rows (default 5)
`df.tail(n)`	Last `n` rows (default 5)
`df.info()`	Structure, dtypes, non-null counts
`df.describe()`	Summary statistics for numeric columns

Selection

Operation	Syntax
Single column (Series)	`df['col']`
Multiple columns (DataFrame)	`df[['col1', 'col2']]`
Row by label	`df.loc['label']`
Row by position	`df.iloc[n]`
Row + column by label	`df.loc['label', ['col1', 'col2']]`
Row + column by position	`df.iloc[n, 0:3]`

Filtering

Operation	Syntax
Single condition	`df[df['col'] > value]`
AND	`df[(condition1) & (condition2)]`
OR	`df[(condition1) \\| (condition2)]`
NOT	`df[~(condition)]`
Match any of list	`df[df['col'].isin([a, b, c])]`

Modifying

Operation	Syntax
Add column	`df['new_col'] = expression`
Drop column	`df = df.drop('col', axis=1)`
Drop row	`df = df.drop('label')`
Sort	`df = df.sort_values('col', ascending=False)`
Set index	`df = df.set_index('col')`
Reset index	`df = df.reset_index()`

Common Mistakes to Avoid

Mistake	What Happens	Fix
Using `and` / `or` in Boolean filters	`ValueError`	Use `&` / `\\|` with parentheses
Forgetting parentheses in combined conditions	`TypeError`	Wrap each condition: `(cond1) & (cond2)`
Not assigning `.sort_values()` result	`df` remains unsorted	`df = df.sort_values(...)`
Using `.iloc[]` with label-based indexes	Wrong row returned	Use `.loc[]` for labels
Expecting `df['col']` and `df[['col']]` to be the same	Type errors in downstream code	Know: `'col'` → Series, `['col']` → DataFrame
Using loops when vectorized operations are available	Very slow at scale	Replace loops with column arithmetic

Connection to Previous and Future Chapters

Chapter 9 (Functions) — The functions you write in Chapter 9 become more useful when they accept and return DataFrames. Many of the analysis patterns here (filter → calculate → summarize) are natural candidates for encapsulation as functions.

Chapter 11 (Reading Real Data) — This chapter built DataFrames by hand from Python dictionaries. In real work, you will load data from CSV files, Excel workbooks, and databases. Chapter 11 covers pd.read_csv(), pd.read_excel(), handling missing values, and fixing data type issues.

Chapter 12 (Grouping and Aggregation) — The groupby() method unlocks category-level summaries: "total revenue by region," "average margin by product category," "count of projects by status." This is where pandas goes from a data viewer to a genuine analytical engine.

Chapter 14 (Visualization) — DataFrames connect directly to matplotlib and seaborn. Once you can build and filter a DataFrame, you are one method call away from a chart.

Chapter 10 in One Sentence

pandas provides two data structures — the Series and the DataFrame — and a toolkit of vectorized operations that let you inspect, select, filter, transform, and sort business data faster and more reproducibly than any spreadsheet application.