Key Takeaways: Chapter 10 — Introduction to pandas


The Big Ideas

1. pandas gives Python the data analysis power that Excel provides through a GUI — but faster, more reproducible, and at any scale.

Excel is a great tool for small datasets and ad hoc exploration. pandas is the right tool when you need to handle more rows than Excel can manage, produce consistent repeatable results, automate a recurring analysis, or communicate your method transparently to colleagues. For data-driven business work in Python, pandas is not optional — it is the foundation.

2. Two data structures underpin everything: the Series and the DataFrame.

A Series is a one-dimensional labeled array. Every column in a spreadsheet is a Series. A DataFrame is a two-dimensional labeled table. Your entire spreadsheet is a DataFrame. Understanding these two structures and how they relate — a DataFrame is a collection of aligned Series — is the conceptual key to all of pandas.

3. The universal import convention is import pandas as pd.

This alias is used everywhere: tutorials, documentation, Stack Overflow, textbooks, production codebases. Using pd is not just convention — it is a professional signal that you understand the ecosystem. Never import as import pandas as pandas or use from pandas import * in business code.

4. Always inspect before you analyze.

Before performing any analysis on a new dataset, run .info(), .describe(), .head(), and check .shape and .dtypes. This 60-second habit prevents wasted hours investigating results that are wrong because of a data type issue, unexpected nulls, or a column layout that differs from what you assumed. Priya's three-step mantra: load, inspect, then calculate.

5. df['col'] vs df[['col']] — single vs. double brackets matter.

df['unit_price'] returns a Series (one-dimensional). df[['unit_price']] returns a single-column DataFrame (two-dimensional). They look similar but behave differently when passed to other functions or when chaining operations. Use bracket notation (df['col']) consistently over dot notation (df.col), since dot notation breaks with spaces in column names and can silently conflict with DataFrame method names.

6. .loc[] is label-based; .iloc[] is position-based. They are not interchangeable.

.loc['ACM-003'] finds the row labeled ACM-003. .iloc[2] finds the row at the third integer position. After filtering or sorting a DataFrame, the label and the position of a row can diverge. Using the wrong indexer is a common source of silent bugs. Default rule: use .loc[] when you know the label, use .iloc[] when you want the Nth row.

7. Boolean filtering is the pandas equivalent of AutoFilter — and it is far more powerful.

df[df['margin_pct'] < 0.20] returns every row where margin is below 20%. Combine conditions with & (and) and | (or), never with Python's and / or keywords. Always wrap each condition in parentheses when combining. Use .isin([...]) for "matches any of these values." Use ~ to negate a condition.

8. Vectorized operations beat loops every time.

df['margin'] = (df['unit_price'] - df['unit_cost']) / df['unit_price'] applies to all rows simultaneously. An equivalent for loop over .iterrows() does the same thing but can be 100–1000x slower and is harder to read. The pandas mindset is column-first, not row-first. When you find yourself writing a loop over a DataFrame, stop and ask: can I express this as a column operation?

9. Most pandas methods return new objects; they do not modify in place.

df.sort_values('price') gives you a sorted DataFrame but leaves df unchanged. df.drop('col', axis=1) returns a DataFrame without that column but leaves df unchanged. To apply the change, always reassign: df = df.sort_values('price'). inplace=True is available but widely considered harder to reason about in complex code.

10. Code is more reproducible than clicks.

The most valuable quality of a pandas analysis is that it can be re-run. When Marcus gives Priya a new product export next quarter, she re-runs her script. When a stakeholder asks "what would this look like with a 25% margin threshold instead of 20%?", Priya changes one number and re-runs. This reproducibility — impossible with a manually filtered spreadsheet — is the deepest business value of learning pandas.


Method Quick Reference

Inspection

Method / Attribute What It Returns
df.shape Tuple (rows, columns)
df.dtypes Data type of each column
df.columns Index of column names
df.index Row index
df.head(n) First n rows (default 5)
df.tail(n) Last n rows (default 5)
df.info() Structure, dtypes, non-null counts
df.describe() Summary statistics for numeric columns

Selection

Operation Syntax
Single column (Series) df['col']
Multiple columns (DataFrame) df[['col1', 'col2']]
Row by label df.loc['label']
Row by position df.iloc[n]
Row + column by label df.loc['label', ['col1', 'col2']]
Row + column by position df.iloc[n, 0:3]

Filtering

Operation Syntax
Single condition df[df['col'] > value]
AND df[(condition1) & (condition2)]
OR df[(condition1) \| (condition2)]
NOT df[~(condition)]
Match any of list df[df['col'].isin([a, b, c])]

Modifying

Operation Syntax
Add column df['new_col'] = expression
Drop column df = df.drop('col', axis=1)
Drop row df = df.drop('label')
Sort df = df.sort_values('col', ascending=False)
Set index df = df.set_index('col')
Reset index df = df.reset_index()

Common Mistakes to Avoid

Mistake What Happens Fix
Using and / or in Boolean filters ValueError Use & / \| with parentheses
Forgetting parentheses in combined conditions TypeError Wrap each condition: (cond1) & (cond2)
Not assigning .sort_values() result df remains unsorted df = df.sort_values(...)
Using .iloc[] with label-based indexes Wrong row returned Use .loc[] for labels
Expecting df['col'] and df[['col']] to be the same Type errors in downstream code Know: 'col' → Series, ['col'] → DataFrame
Using loops when vectorized operations are available Very slow at scale Replace loops with column arithmetic

Connection to Previous and Future Chapters

Chapter 9 (Functions) — The functions you write in Chapter 9 become more useful when they accept and return DataFrames. Many of the analysis patterns here (filter → calculate → summarize) are natural candidates for encapsulation as functions.

Chapter 11 (Reading Real Data) — This chapter built DataFrames by hand from Python dictionaries. In real work, you will load data from CSV files, Excel workbooks, and databases. Chapter 11 covers pd.read_csv(), pd.read_excel(), handling missing values, and fixing data type issues.

Chapter 12 (Grouping and Aggregation) — The groupby() method unlocks category-level summaries: "total revenue by region," "average margin by product category," "count of projects by status." This is where pandas goes from a data viewer to a genuine analytical engine.

Chapter 14 (Visualization) — DataFrames connect directly to matplotlib and seaborn. Once you can build and filter a DataFrame, you are one method call away from a chart.


Chapter 10 in One Sentence

pandas provides two data structures — the Series and the DataFrame — and a toolkit of vectorized operations that let you inspect, select, filter, transform, and sort business data faster and more reproducibly than any spreadsheet application.