Key Takeaways: Chapter 10 — Introduction to pandas
The Big Ideas
1. pandas gives Python the data analysis power that Excel provides through a GUI — but faster, more reproducible, and at any scale.
Excel is a great tool for small datasets and ad hoc exploration. pandas is the right tool when you need to handle more rows than Excel can manage, produce consistent repeatable results, automate a recurring analysis, or communicate your method transparently to colleagues. For data-driven business work in Python, pandas is not optional — it is the foundation.
2. Two data structures underpin everything: the Series and the DataFrame.
A Series is a one-dimensional labeled array. Every column in a spreadsheet is a Series. A DataFrame is a two-dimensional labeled table. Your entire spreadsheet is a DataFrame. Understanding these two structures and how they relate — a DataFrame is a collection of aligned Series — is the conceptual key to all of pandas.
3. The universal import convention is import pandas as pd.
This alias is used everywhere: tutorials, documentation, Stack Overflow, textbooks, production codebases. Using pd is not just convention — it is a professional signal that you understand the ecosystem. Never import as import pandas as pandas or use from pandas import * in business code.
4. Always inspect before you analyze.
Before performing any analysis on a new dataset, run .info(), .describe(), .head(), and check .shape and .dtypes. This 60-second habit prevents wasted hours investigating results that are wrong because of a data type issue, unexpected nulls, or a column layout that differs from what you assumed. Priya's three-step mantra: load, inspect, then calculate.
5. df['col'] vs df[['col']] — single vs. double brackets matter.
df['unit_price'] returns a Series (one-dimensional). df[['unit_price']] returns a single-column DataFrame (two-dimensional). They look similar but behave differently when passed to other functions or when chaining operations. Use bracket notation (df['col']) consistently over dot notation (df.col), since dot notation breaks with spaces in column names and can silently conflict with DataFrame method names.
6. .loc[] is label-based; .iloc[] is position-based. They are not interchangeable.
.loc['ACM-003'] finds the row labeled ACM-003. .iloc[2] finds the row at the third integer position. After filtering or sorting a DataFrame, the label and the position of a row can diverge. Using the wrong indexer is a common source of silent bugs. Default rule: use .loc[] when you know the label, use .iloc[] when you want the Nth row.
7. Boolean filtering is the pandas equivalent of AutoFilter — and it is far more powerful.
df[df['margin_pct'] < 0.20] returns every row where margin is below 20%. Combine conditions with & (and) and | (or), never with Python's and / or keywords. Always wrap each condition in parentheses when combining. Use .isin([...]) for "matches any of these values." Use ~ to negate a condition.
8. Vectorized operations beat loops every time.
df['margin'] = (df['unit_price'] - df['unit_cost']) / df['unit_price'] applies to all rows simultaneously. An equivalent for loop over .iterrows() does the same thing but can be 100–1000x slower and is harder to read. The pandas mindset is column-first, not row-first. When you find yourself writing a loop over a DataFrame, stop and ask: can I express this as a column operation?
9. Most pandas methods return new objects; they do not modify in place.
df.sort_values('price') gives you a sorted DataFrame but leaves df unchanged. df.drop('col', axis=1) returns a DataFrame without that column but leaves df unchanged. To apply the change, always reassign: df = df.sort_values('price'). inplace=True is available but widely considered harder to reason about in complex code.
10. Code is more reproducible than clicks.
The most valuable quality of a pandas analysis is that it can be re-run. When Marcus gives Priya a new product export next quarter, she re-runs her script. When a stakeholder asks "what would this look like with a 25% margin threshold instead of 20%?", Priya changes one number and re-runs. This reproducibility — impossible with a manually filtered spreadsheet — is the deepest business value of learning pandas.
Method Quick Reference
Inspection
| Method / Attribute | What It Returns |
|---|---|
df.shape |
Tuple (rows, columns) |
df.dtypes |
Data type of each column |
df.columns |
Index of column names |
df.index |
Row index |
df.head(n) |
First n rows (default 5) |
df.tail(n) |
Last n rows (default 5) |
df.info() |
Structure, dtypes, non-null counts |
df.describe() |
Summary statistics for numeric columns |
Selection
| Operation | Syntax |
|---|---|
| Single column (Series) | df['col'] |
| Multiple columns (DataFrame) | df[['col1', 'col2']] |
| Row by label | df.loc['label'] |
| Row by position | df.iloc[n] |
| Row + column by label | df.loc['label', ['col1', 'col2']] |
| Row + column by position | df.iloc[n, 0:3] |
Filtering
| Operation | Syntax |
|---|---|
| Single condition | df[df['col'] > value] |
| AND | df[(condition1) & (condition2)] |
| OR | df[(condition1) \| (condition2)] |
| NOT | df[~(condition)] |
| Match any of list | df[df['col'].isin([a, b, c])] |
Modifying
| Operation | Syntax |
|---|---|
| Add column | df['new_col'] = expression |
| Drop column | df = df.drop('col', axis=1) |
| Drop row | df = df.drop('label') |
| Sort | df = df.sort_values('col', ascending=False) |
| Set index | df = df.set_index('col') |
| Reset index | df = df.reset_index() |
Common Mistakes to Avoid
| Mistake | What Happens | Fix |
|---|---|---|
Using and / or in Boolean filters |
ValueError |
Use & / \| with parentheses |
| Forgetting parentheses in combined conditions | TypeError |
Wrap each condition: (cond1) & (cond2) |
Not assigning .sort_values() result |
df remains unsorted |
df = df.sort_values(...) |
Using .iloc[] with label-based indexes |
Wrong row returned | Use .loc[] for labels |
Expecting df['col'] and df[['col']] to be the same |
Type errors in downstream code | Know: 'col' → Series, ['col'] → DataFrame |
| Using loops when vectorized operations are available | Very slow at scale | Replace loops with column arithmetic |
Connection to Previous and Future Chapters
Chapter 9 (Functions) — The functions you write in Chapter 9 become more useful when they accept and return DataFrames. Many of the analysis patterns here (filter → calculate → summarize) are natural candidates for encapsulation as functions.
Chapter 11 (Reading Real Data) — This chapter built DataFrames by hand from Python dictionaries. In real work, you will load data from CSV files, Excel workbooks, and databases. Chapter 11 covers pd.read_csv(), pd.read_excel(), handling missing values, and fixing data type issues.
Chapter 12 (Grouping and Aggregation) — The groupby() method unlocks category-level summaries: "total revenue by region," "average margin by product category," "count of projects by status." This is where pandas goes from a data viewer to a genuine analytical engine.
Chapter 14 (Visualization) — DataFrames connect directly to matplotlib and seaborn. Once you can build and filter a DataFrame, you are one method call away from a chart.
Chapter 10 in One Sentence
pandas provides two data structures — the Series and the DataFrame — and a toolkit of vectorized operations that let you inspect, select, filter, transform, and sort business data faster and more reproducibly than any spreadsheet application.