> "The goal is to turn data into information, and information into insight."
In This Chapter
- What You Will Learn
- Opening Scene
- 10.1 What Is pandas and Why Does It Matter?
- 10.2 Installing and Importing pandas
- 10.3 The Series: pandas' One-Dimensional Building Block
- 10.4 The DataFrame: Your In-Memory Spreadsheet
- 10.5 Inspecting a DataFrame
- 10.6 Selecting Columns
- 10.7 Selecting Rows: .loc[] and .iloc[]
- 10.8 Boolean Filtering: The Core of Data Analysis
- 10.9 Adding and Modifying Columns
- 10.10 Dropping Columns and Rows
- 10.11 Sorting a DataFrame
- 10.12 The pandas Mindset: Vectorized Operations vs. Manual Loops
- 10.13 A Complete Business Example: Building Acme's Product Analysis
- 10.14 Common Beginner Mistakes (and How to Avoid Them)
- 10.15 Chapter Summary
- Key Terms
Chapter 10: Introduction to pandas: Your Business Data Toolkit
"The goal is to turn data into information, and information into insight." — Carly Fiorina
What You Will Learn
By the end of this chapter, you will be able to:
- Explain what pandas is and why it is the standard tool for business data work in Python
- Create pandas Series and DataFrame objects from lists, dictionaries, and raw data
- Inspect a DataFrame using
.info(),.describe(),.head(),.tail(),.shape,.dtypes, and.columns - Select specific columns and rows using bracket notation,
.loc[], and.iloc[] - Filter rows using Boolean conditions
- Add, modify, and drop columns and rows
- Sort a DataFrame by one or more columns
- Explain the "vectorized operations" mindset that makes pandas faster than manual loops
Opening Scene
It is a Tuesday morning at Acme Corp. Priya Okonkwo, the company's lead data analyst, has just received an email from Sandra Chen, VP of Sales. Attached is a spreadsheet with 4,200 rows of product catalog data: SKU codes, categories, unit prices, cost of goods, current inventory, and reorder thresholds. Sandra's request is straightforward but urgent: "Before the 2 p.m. budget review, can you tell me which products have margins below 20 percent, and which ones are below their reorder point?"
In a previous life, Priya would have opened Excel, applied filters, written VLOOKUP formulas, and spent 45 minutes doing this work. Today, she opens her code editor, types twelve lines of Python, and has the answer in under three seconds — including a clean, sorted table she can paste directly into her presentation.
The tool that made the difference is pandas.
10.1 What Is pandas and Why Does It Matter?
pandas (the name is derived from "panel data," a term from econometrics) is an open-source Python library for data manipulation and analysis. It was created in 2008 by Wes McKinney while he was working at a hedge fund, because he needed a fast, expressive tool to work with financial data that Python simply did not have at the time. What he built became one of the most widely used software libraries in the world.
At its core, pandas gives you two powerful data structures:
- Series — a one-dimensional labeled array, like a single column of a spreadsheet
- DataFrame — a two-dimensional labeled table, like a full spreadsheet with rows and columns
Everything else pandas offers — filtering, sorting, grouping, merging, reshaping, reading CSV files, writing to Excel — is built on top of these two structures.
The Excel vs. pandas Mindset
If you have been using Excel for business analysis, you already understand the concept of tabular data. You know what a column header is, what a row of values represents, and what it means to filter a table. That conceptual foundation transfers directly to pandas. The difference is in how you interact with the data.
In Excel, your workflow is largely manual and visual. You click cells, drag formulas, apply filters through menus, and format results by hand. For 50 rows, this is fast and comfortable. For 50,000 rows — or 500,000 — it becomes painful or impossible. Excel will slow to a crawl, crash, or simply refuse to open the file.
In pandas, your workflow is written in code. Every operation you perform is a line of Python that can be re-run instantly, modified in seconds, and shared with a colleague who can reproduce your exact results. Pandas handles millions of rows without breaking a sweat, because it is built on top of NumPy, a library that performs numerical computations using highly optimized C code under the hood.
Here is a concrete comparison. Suppose you have a spreadsheet of 10,000 sales transactions and you want to calculate the profit margin for every row. In Excel, you might type a formula in column F and drag it down 10,000 rows. If someone later changes the column layout, your formula breaks.
In pandas, you write:
df['margin'] = (df['unit_price'] - df['cost']) / df['unit_price']
This single line applies the calculation to every row simultaneously. It does not loop. It does not drag. It is what practitioners call a vectorized operation — the library applies the operation to the entire column at once, which is dramatically faster than any loop you could write yourself.
This is the pandas mindset shift: stop thinking row by row, start thinking column by column.
10.2 Installing and Importing pandas
pandas does not ship with Python's standard library, but it comes pre-installed with the Anaconda distribution (the most common Python environment for data work). If you are using a standard Python installation, you can install it with:
pip install pandas
Once installed, the universal convention — used in virtually every tutorial, book, blog post, and Stack Overflow answer in the world — is to import it with the alias pd:
import pandas as pd
You will also frequently import NumPy alongside pandas, as the two libraries work closely together:
import numpy as np
import pandas as pd
Throughout this chapter, every code example assumes these two import lines have already been run.
10.3 The Series: pandas' One-Dimensional Building Block
Before we build a full DataFrame, we need to understand its smallest component: the Series.
A pandas Series is a one-dimensional array of values with an associated index. Think of it as a single column from a spreadsheet, but smarter — each value has a label (the index) that lets you retrieve it by name rather than just by position.
Creating a Series from a List
The simplest way to create a Series is to pass a Python list to pd.Series():
import pandas as pd
monthly_revenue = pd.Series([42000, 58000, 63500, 71000, 49000, 55000])
print(monthly_revenue)
Output:
0 42000
1 58000
2 63500
3 71000
4 49000
5 55000
dtype: int64
Notice that pandas automatically created a numeric index (0, 1, 2, ...). The dtype at the bottom tells you the data type pandas inferred for the values.
Creating a Series with a Custom Index
The real power of a Series becomes apparent when you give it a meaningful index:
monthly_revenue = pd.Series(
[42000, 58000, 63500, 71000, 49000, 55000],
index=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
)
print(monthly_revenue)
Output:
Jan 42000
Feb 58000
Mar 63500
Apr 71000
May 49000
Jun 55000
dtype: int64
Now you can access values by label:
print(monthly_revenue['Mar']) # 63500
print(monthly_revenue['Jan':'Mar']) # Slice from Jan through Mar
Creating a Series from a Dictionary
When your data already exists as a Python dictionary, pandas will use the dictionary keys as the index automatically:
q1_sales = {
'January': 42000,
'February': 58000,
'March': 63500
}
sales_series = pd.Series(q1_sales)
print(sales_series)
Output:
January 42000
February 58000
March 63500
dtype: int64
Basic Series Operations
Series support all the arithmetic operations you would expect, and they apply vectorized:
# Add a 10% commission to every month's revenue
commission = monthly_revenue * 0.10
print(commission)
# Find months where revenue exceeded 60000
high_revenue_months = monthly_revenue[monthly_revenue > 60000]
print(high_revenue_months)
Some useful Series methods:
| Method | What It Does |
|---|---|
.sum() |
Sum of all values |
.mean() |
Average value |
.median() |
Median value |
.min() |
Minimum value |
.max() |
Maximum value |
.count() |
Number of non-null values |
.std() |
Standard deviation |
.value_counts() |
Frequency count of each unique value |
.unique() |
Array of unique values |
.sort_values() |
Returns sorted Series |
.head(n) |
First n values (default 5) |
.tail(n) |
Last n values (default 5) |
print(f"Total H1 Revenue: ${monthly_revenue.sum():,.0f}")
print(f"Average Monthly Revenue: ${monthly_revenue.mean():,.0f}")
print(f"Best Month Revenue: ${monthly_revenue.max():,.0f}")
Series vs. a Python List
You might be wondering: why not just use a Python list? The answer is several-fold:
- Labeled access. A list can only be accessed by integer position. A Series can be accessed by label, making code far more readable.
- Vectorized math.
monthly_revenue * 1.10works on a Series. On a list, you would need a loop or list comprehension. - Built-in statistics.
.mean(),.std(),.describe()are one-liners on a Series. - Missing data handling. Pandas has first-class support for
NaN(Not a Number) values. Python lists have no such concept. - Integration with DataFrames. Each column of a DataFrame is a Series, so understanding Series is fundamental.
10.4 The DataFrame: Your In-Memory Spreadsheet
The DataFrame is the workhorse of pandas. It is a two-dimensional table with labeled rows and labeled columns. Every column in a DataFrame is a Series, and all the columns share a common row index.
If you have used a relational database, a DataFrame is essentially a table. If you have used Excel, it is a worksheet. But unlike either of those tools, a DataFrame lives entirely in memory, is manipulated with code, and can be transformed, filtered, merged, and analyzed with extraordinary speed.
Creating a DataFrame from a Dictionary of Lists
The most common way to create a DataFrame by hand is to pass a dictionary where each key is a column name and each value is a list of column values:
import pandas as pd
product_catalog = {
'sku': ['ACM-001', 'ACM-002', 'ACM-003', 'ACM-004', 'ACM-005'],
'product_name': ['Widget A', 'Widget B', 'Gadget X', 'Gadget Y', 'Component Z'],
'category': ['Widgets', 'Widgets', 'Gadgets', 'Gadgets', 'Components'],
'unit_price': [29.99, 49.99, 89.99, 129.99, 14.99],
'unit_cost': [12.50, 22.00, 41.00, 58.50, 8.25],
'inventory': [250, 180, 95, 42, 600]
}
df = pd.DataFrame(product_catalog)
print(df)
Output:
sku product_name category unit_price unit_cost inventory
0 ACM-001 Widget A Widgets 29.99 12.50 250
1 ACM-002 Widget B Widgets 49.99 22.00 180
2 ACM-003 Gadget X Gadgets 89.99 41.00 95
3 ACM-004 Gadget Y Gadgets 129.99 58.50 42
4 ACM-005 Component Z Components 14.99 8.25 600
Pandas automatically assigned a numeric row index (0 through 4). The column headers come directly from the dictionary keys.
Setting a Custom Row Index
Often you have a natural identifier for each row — a SKU, a customer ID, a date. You can set that as the index:
df = df.set_index('sku')
print(df)
Output:
product_name category unit_price unit_cost inventory
sku
ACM-001 Widget A Widgets 29.99 12.50 250
ACM-002 Widget B Widgets 49.99 22.00 180
ACM-003 Gadget X Gadgets 89.99 41.00 95
ACM-004 Gadget Y Gadgets 129.99 58.50 42
ACM-005 Component Z Components 14.99 8.25 600
Now rows can be accessed by SKU label. We can also reset to the default integer index at any time with .reset_index().
10.5 Inspecting a DataFrame
Before you do any analysis, the first thing you should always do with a new DataFrame is look at it. Pandas provides a full toolkit of inspection methods.
.head() and .tail()
These methods show you the first or last N rows (default 5). They are your first line of defense when you load a new dataset:
print(df.head()) # First 5 rows
print(df.head(3)) # First 3 rows
print(df.tail(2)) # Last 2 rows
In a Jupyter Notebook, you can just type df and press Enter to render a nicely formatted table. In a script, you need print(df.head()).
.shape
Returns a tuple of (number_of_rows, number_of_columns):
print(df.shape)
# (5, 5) -- 5 rows, 5 columns (after setting sku as index)
This is extremely useful for a sanity check after loading data: "Did I load all 4,200 rows? Did I get all 12 columns?"
.dtypes
Shows the data type of each column:
print(df.dtypes)
Output:
product_name object
category object
unit_price float64
unit_cost float64
inventory int64
dtype: object
The most common data types you will encounter:
| pandas dtype | Python/NumPy equivalent | Typical use |
|---|---|---|
int64 |
Integer | Counts, quantities |
float64 |
Float | Prices, rates, measurements |
object |
String (or mixed) | Names, labels, categories |
bool |
Boolean | True/False flags |
datetime64 |
Date and time | Dates, timestamps |
category |
Categorical | Repeated labels (efficient) |
When you load data from a CSV, pandas will infer data types automatically — but it does not always get it right. A column of ZIP codes might be read as integers, stripping leading zeros. A column of dates might be read as strings. Checking .dtypes early saves a lot of debugging later.
.columns
Returns the column names as a pandas Index object:
print(df.columns)
# Index(['product_name', 'category', 'unit_price', 'unit_cost', 'inventory'], dtype='object')
You can convert this to a list with list(df.columns), which is useful when you want to iterate over column names or check whether a specific column exists.
.info()
This is the most comprehensive quick-inspection method. It shows the index type, number of rows, each column name, the count of non-null values, and the data type:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, ACM-001 to ACM-005
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_name 5 non-null object
1 category 5 non-null object
2 unit_price 5 non-null float64
3 unit_cost 5 non-null float64
4 inventory 5 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 240.0+ bytes
The "Non-Null Count" column is particularly valuable. If you have 4,200 rows but a column shows only 3,800 non-null values, you have 400 missing values to deal with before you analyze that column.
.describe()
Generates summary statistics for all numeric columns:
print(df.describe())
Output:
unit_price unit_cost inventory
count 5.000000 5.000000 5.000000
mean 63.190000 28.450000 233.400000
std 44.804018 19.537655 218.777285
min 14.990000 8.250000 42.000000
25% 29.990000 12.500000 95.000000
50% 49.990000 22.000000 180.000000
75% 89.990000 41.000000 250.000000
max 129.990000 58.500000 600.000000
In four seconds, you have learned that your average unit price is $63.19, that the cheapest product costs $14.99 and the most expensive costs $129.99, and that inventory ranges from 42 to 600 units. This kind of instant summary is why .describe() is one of the first methods data analysts call on any new dataset.
10.6 Selecting Columns
Selecting columns is one of the most frequent operations you will perform. Pandas gives you several ways to do it.
Selecting a Single Column
To select one column, use bracket notation with the column name as a string:
prices = df['unit_price']
print(type(prices)) # <class 'pandas.core.series.Series'>
print(prices)
This returns a Series. Remember: every column in a DataFrame is a Series.
You can also use attribute-style access (dot notation) for a single column — but only when the column name is a valid Python identifier (no spaces, does not start with a number, does not conflict with a DataFrame method name):
prices = df.unit_price # Same result as df['unit_price']
Recommendation: Use bracket notation (df['column_name']) consistently. Dot notation looks cleaner, but it breaks when column names have spaces, and it can silently return a DataFrame method instead of your column if names conflict. Bracket notation always works.
Selecting Multiple Columns
To select multiple columns, pass a list of column names inside the brackets. Note the double brackets — the outer brackets are the selection operator, the inner brackets define the list:
price_and_cost = df[['unit_price', 'unit_cost']]
print(type(price_and_cost)) # <class 'pandas.core.frame.DataFrame'>
print(price_and_cost)
Output:
unit_price unit_cost
sku
ACM-001 29.99 12.50
ACM-002 49.99 22.00
ACM-003 89.99 41.00
ACM-004 129.99 58.50
ACM-005 14.99 8.25
When you select multiple columns, you get back a DataFrame, not a Series. This is a common source of confusion for beginners: df['col'] returns a Series, df[['col']] returns a single-column DataFrame. They look similar when printed, but they behave differently.
10.7 Selecting Rows: .loc[] and .iloc[]
Selecting rows is slightly more nuanced than selecting columns. Pandas provides two primary indexers: .loc[] for label-based selection and .iloc[] for integer-position-based selection.
.loc[] — Label-Based Selection
.loc[] selects rows (and optionally columns) by their label — the value in the index.
# Select a single row by its index label
print(df.loc['ACM-003'])
# Select multiple rows
print(df.loc[['ACM-001', 'ACM-004']])
# Select a row AND specific columns
print(df.loc['ACM-002', ['unit_price', 'unit_cost']])
# Select a range of rows (inclusive on both ends with label-based slicing)
print(df.loc['ACM-001':'ACM-003'])
The comma syntax is powerful: df.loc[row_selector, column_selector]. You can pass a label, a list of labels, or a slice for either dimension.
.iloc[] — Integer-Position-Based Selection
.iloc[] selects rows and columns by their integer position, regardless of what the index labels are. This is like accessing a list by index number.
# Select the first row (position 0)
print(df.iloc[0])
# Select the third row (position 2)
print(df.iloc[2])
# Select rows 1 through 3 (exclusive of the end, like Python slicing)
print(df.iloc[1:4])
# Select row 0, columns 0 and 1
print(df.iloc[0, 0:2])
# Select the last row
print(df.iloc[-1])
When to Use Which
| Situation | Use |
|---|---|
| You know the row's index label (SKU, name, date) | .loc[] |
| You want "the first 10 rows" or "every other row" | .iloc[] |
| Your index is integers AND those integers match positions | Either, but .loc[] is clearer |
| Combining row and column selection by label | .loc[] |
| Combining row and column selection by position | .iloc[] |
A common beginner mistake is to use .iloc[] on a DataFrame whose index has been reset or shuffled, expecting label-based results. When in doubt, use .loc[] for label-based work and .iloc[] for positional work — and keep them straight.
10.8 Boolean Filtering: The Core of Data Analysis
Boolean filtering is one of the most powerful features in pandas, and it is the technique you will use for the vast majority of real business analysis tasks. It lets you select rows that meet a condition — the pandas equivalent of Excel's AutoFilter.
How Boolean Filtering Works
The mechanics are simple. First, you create a Boolean Series — a series of True/False values, one per row, based on some condition. Then you use that Boolean Series to index the DataFrame, which returns only the rows where the value is True.
# Step 1: Create a Boolean Series
is_gadget = df['category'] == 'Gadgets'
print(is_gadget)
Output:
sku
ACM-001 False
ACM-002 False
ACM-003 True
ACM-004 True
ACM-005 False
dtype: bool
# Step 2: Use it to filter the DataFrame
gadgets_only = df[is_gadget]
print(gadgets_only)
In practice, you almost always combine these two steps into one line:
gadgets_only = df[df['category'] == 'Gadgets']
print(gadgets_only)
This reads almost like English: "give me the rows from df where the category is Gadgets."
Comparison Operators
All standard Python comparison operators work with pandas:
| Operator | Meaning | Example |
|---|---|---|
== |
Equal to | df['category'] == 'Widgets' |
!= |
Not equal to | df['category'] != 'Components' |
> |
Greater than | df['unit_price'] > 50 |
>= |
Greater than or equal | df['inventory'] >= 100 |
< |
Less than | df['unit_cost'] < 20 |
<= |
Less than or equal | df['margin'] <= 0.20 |
Combining Conditions with & and |
To combine multiple conditions, use & (and) and | (or). Note that you must wrap each condition in parentheses when combining them:
# Products that are Gadgets AND cost more than $80
expensive_gadgets = df[(df['category'] == 'Gadgets') & (df['unit_price'] > 80)]
# Products that are either Widgets or have price below $20
widget_or_cheap = df[(df['category'] == 'Widgets') | (df['unit_price'] < 20)]
The parentheses are not optional — Python's operator precedence rules mean that & binds more tightly than ==, which produces a confusing error if you omit them.
Negation with ~
To negate a condition (select rows where the condition is False), use ~:
# All products that are NOT in the Gadgets category
non_gadgets = df[~(df['category'] == 'Gadgets')]
# Equivalent to: df[df['category'] != 'Gadgets']
.isin() for Multiple Values
When you want to match any of several values, .isin() is cleaner than a chain of | conditions:
# Products in either Widgets or Components category
widgets_or_components = df[df['category'].isin(['Widgets', 'Components'])]
Real Business Example: Finding Low-Margin Products
Here is how Priya answers Sandra's question using Boolean filtering. First, she adds a margin column, then she filters:
# Calculate gross margin percentage
df['margin_pct'] = (df['unit_price'] - df['unit_cost']) / df['unit_price']
# Find products with margin below 20%
low_margin_products = df[df['margin_pct'] < 0.20]
print(f"Products below 20% margin:\n{low_margin_products[['product_name', 'unit_price', 'unit_cost', 'margin_pct']]}")
Twelve lines of code. Three seconds of runtime. Sandra has her answer before her second cup of coffee.
10.9 Adding and Modifying Columns
Adding a new calculated column to a DataFrame is one of the most common operations in data analysis. The syntax mirrors a Python dictionary:
df['new_column_name'] = expression
Adding a Calculated Column
# Add a gross profit column
df['gross_profit'] = df['unit_price'] - df['unit_cost']
# Add a margin percentage column
df['margin_pct'] = df['gross_profit'] / df['unit_price']
# Add a revenue column (assuming all inventory is sold)
df['potential_revenue'] = df['unit_price'] * df['inventory']
print(df[['product_name', 'gross_profit', 'margin_pct', 'potential_revenue']])
Each of these operations is vectorized — pandas applies the arithmetic to every row simultaneously without any explicit loop.
Modifying an Existing Column
You can overwrite an existing column the same way you created it:
# Apply a 5% price increase to all products
df['unit_price'] = df['unit_price'] * 1.05
# Round prices to 2 decimal places
df['unit_price'] = df['unit_price'].round(2)
Adding a Column Based on a Condition
Sometimes you want to categorize rows based on their values. The np.where() function (from NumPy) is the pandas-idiomatic way to do this — it is the vectorized equivalent of a conditional expression:
import numpy as np
# Tag products as 'Low Margin' or 'Healthy Margin'
df['margin_status'] = np.where(df['margin_pct'] < 0.25, 'Low Margin', 'Healthy Margin')
For more than two categories, the .apply() method with a custom function, or pd.cut() for numeric bins, are the appropriate tools — but those belong in a more advanced chapter.
10.10 Dropping Columns and Rows
Dropping Columns
Use .drop() with axis=1 to remove columns (axis=1 means "along the column dimension"):
# Drop a single column
df_cleaned = df.drop('potential_revenue', axis=1)
# Drop multiple columns
df_cleaned = df.drop(['gross_profit', 'potential_revenue'], axis=1)
Note that .drop() returns a new DataFrame by default — it does not modify df in place. To modify in place, use inplace=True:
df.drop('potential_revenue', axis=1, inplace=True)
Many experienced pandas users avoid inplace=True because it makes code harder to reason about (you cannot easily undo in-place operations). The preferred pattern is assignment: df = df.drop(...).
Dropping Rows
To drop rows, use .drop() with the row labels (or positions):
# Drop the row with index label 'ACM-005'
df_trimmed = df.drop('ACM-005')
# Drop multiple rows
df_trimmed = df.drop(['ACM-003', 'ACM-004'])
You can also drop rows based on conditions using Boolean filtering — simply filter for the rows you want to keep, rather than specifying which to drop. This is usually more readable:
# Keep only products with inventory above 50 (effectively dropping the rest)
df_in_stock = df[df['inventory'] > 50]
10.11 Sorting a DataFrame
Sorting is accomplished with .sort_values(). You specify the column (or columns) to sort by, and whether to sort ascending or descending.
Sorting by a Single Column
# Sort by unit price, lowest to highest (ascending is the default)
df_sorted_by_price = df.sort_values('unit_price')
# Sort by unit price, highest to lowest
df_sorted_by_price_desc = df.sort_values('unit_price', ascending=False)
# Sort by margin percentage, worst margins first
df_sorted_by_margin = df.sort_values('margin_pct', ascending=True)
Sorting by Multiple Columns
Pass a list of column names to sort by multiple levels. Pandas sorts by the first column, then uses subsequent columns as tiebreakers:
# Sort by category (A-Z), then by unit price within each category (high to low)
df_multi_sorted = df.sort_values(
by=['category', 'unit_price'],
ascending=[True, False]
)
Sorting the Index
To sort by the DataFrame's row index, use .sort_index():
df_index_sorted = df.sort_index() # Ascending
df_index_sorted_desc = df.sort_index(ascending=False)
Like most pandas methods, .sort_values() and .sort_index() return new DataFrames by default. Use inplace=True to modify in place, or reassign.
10.12 The pandas Mindset: Vectorized Operations vs. Manual Loops
Now that you have seen the core operations, it is worth pausing to address a pattern that trips up nearly every Python programmer coming from a procedural background: the temptation to use loops.
Suppose you have a DataFrame with 100,000 rows and you want to calculate the margin for each product. A programmer comfortable with lists and loops might write:
# DO NOT DO THIS — it is very slow and unpythonic
margins = []
for index, row in df.iterrows():
margin = (row['unit_price'] - row['unit_cost']) / row['unit_price']
margins.append(margin)
df['margin'] = margins
This code works, but it is orders of magnitude slower than the vectorized approach. On 100,000 rows, the loop approach might take 5–10 seconds. The vectorized approach takes milliseconds:
# DO THIS INSTEAD — fast and readable
df['margin'] = (df['unit_price'] - df['unit_cost']) / df['unit_price']
Why is the vectorized approach so much faster? Because pandas passes the entire arrays to NumPy's underlying C implementation, which processes them in bulk using optimized machine code. The loop approach, by contrast, invokes Python overhead for every single row.
The Mental Model Shift
The shift required is moving from "what do I do to each row?" to "what do I do to each column?"
| Procedural mindset | Pandas mindset |
|---|---|
| For each product, calculate its margin | Calculate the margin for the whole margin column |
| For each sale, check if it is high value | Create a Boolean column: is this row a high-value sale? |
| For each customer, add their orders | Group by customer, then sum orders |
| Loop through rows and filter manually | Use Boolean indexing to filter the DataFrame |
This shift is genuinely hard at first. After years of writing loops, it feels unnatural to think about columns. But it is a skill that pays dividends every time you touch data in Python. By the end of this book, it will feel as natural as filtering a spreadsheet.
10.13 A Complete Business Example: Building Acme's Product Analysis
Let's bring everything in this chapter together with a realistic, complete example. Priya is building a product analysis tool for Acme Corp.
import pandas as pd
import numpy as np
# ------------------------------------------------------------------ #
# Build the product catalog DataFrame
# ------------------------------------------------------------------ #
product_data = {
'sku': ['ACM-001', 'ACM-002', 'ACM-003', 'ACM-004', 'ACM-005',
'ACM-006', 'ACM-007', 'ACM-008'],
'product_name': [
'Widget A', 'Widget B', 'Gadget X', 'Gadget Y',
'Component Z', 'Widget C', 'Gadget Z', 'Component W'
],
'category': [
'Widgets', 'Widgets', 'Gadgets', 'Gadgets',
'Components', 'Widgets', 'Gadgets', 'Components'
],
'unit_price': [29.99, 49.99, 89.99, 129.99, 14.99, 39.99, 199.99, 24.99],
'unit_cost': [12.50, 22.00, 41.00, 58.50, 8.25, 34.50, 88.00, 11.00],
'inventory': [250, 180, 95, 42, 600, 310, 18, 420],
'reorder_point': [100, 75, 50, 25, 200, 150, 20, 300]
}
df = pd.DataFrame(product_data)
df = df.set_index('sku')
# ------------------------------------------------------------------ #
# Inspect
# ------------------------------------------------------------------ #
print("=== Dataset Overview ===")
print(f"Shape: {df.shape}")
df.info()
print("\n=== Summary Statistics ===")
print(df.describe())
# ------------------------------------------------------------------ #
# Add calculated columns
# ------------------------------------------------------------------ #
df['gross_profit'] = df['unit_price'] - df['unit_cost']
df['margin_pct'] = (df['gross_profit'] / df['unit_price']).round(4)
df['below_reorder'] = df['inventory'] < df['reorder_point']
# ------------------------------------------------------------------ #
# Analysis 1: Low-margin products (below 30%)
# ------------------------------------------------------------------ #
low_margin = df[df['margin_pct'] < 0.30].sort_values('margin_pct')
print("\n=== Products Below 30% Margin ===")
print(low_margin[['product_name', 'category', 'unit_price', 'unit_cost', 'margin_pct']])
# ------------------------------------------------------------------ #
# Analysis 2: Products below reorder point
# ------------------------------------------------------------------ #
needs_reorder = df[df['below_reorder'] == True].sort_values('inventory')
print("\n=== Products Needing Reorder ===")
print(needs_reorder[['product_name', 'inventory', 'reorder_point']])
# ------------------------------------------------------------------ #
# Analysis 3: High-value, low-stock gadgets
# ------------------------------------------------------------------ #
critical_gadgets = df[
(df['category'] == 'Gadgets') &
(df['inventory'] < df['reorder_point'])
]
print("\n=== Critical Gadget Inventory Alert ===")
print(critical_gadgets[['product_name', 'unit_price', 'inventory', 'reorder_point']])
# ------------------------------------------------------------------ #
# Summary by category
# ------------------------------------------------------------------ #
print("\n=== Category Summary ===")
category_summary = df.groupby('category').agg(
product_count=('product_name', 'count'),
avg_price=('unit_price', 'mean'),
avg_margin=('margin_pct', 'mean'),
total_inventory=('inventory', 'sum')
).round(2)
print(category_summary)
This is the kind of analysis that used to take Priya 45 minutes in Excel. With pandas, it runs in under a second, produces consistent results every time, and can be adapted in minutes to answer new questions.
10.14 Common Beginner Mistakes (and How to Avoid Them)
Mistake 1: Confusing df['col'] and df[['col']]
df['unit_price'] returns a Series. df[['unit_price']] returns a single-column DataFrame. This matters when you pass data to a function that expects one or the other.
Mistake 2: Forgetting Parentheses in Combined Boolean Conditions
# This will raise a ValueError:
bad = df[df['category'] == 'Gadgets' & df['unit_price'] > 80]
# This is correct:
good = df[(df['category'] == 'Gadgets') & (df['unit_price'] > 80)]
Mistake 3: Using Python's and / or Instead of & / |
Python's and and or keywords do not work with Series. You will get a confusing error. Always use & and | for element-wise Boolean operations on pandas Series.
Mistake 4: Assuming Changes Are In-Place
Most pandas methods return new DataFrames. If you write df.sort_values('price') and do not assign the result, df is unchanged. Always assign: df = df.sort_values('price') — or use inplace=True if you are certain you want to modify in place.
Mistake 5: Using .iloc[] When You Mean .loc[]
If your row index is integers (0, 1, 2...), df.loc[3] and df.iloc[3] return the same row. But after filtering and sorting, the integer labels and the integer positions diverge. df.loc[3] will find the row labeled 3 (which may be anywhere in the DataFrame), while df.iloc[3] will always return the fourth row regardless of its label.
10.15 Chapter Summary
This chapter introduced pandas, the foundational Python library for business data analysis. You have covered a significant amount of ground:
You learned that pandas is built on two core data structures: the Series (one-dimensional labeled array) and the DataFrame (two-dimensional labeled table). You learned the universal import convention import pandas as pd and how to create both structures from Python lists and dictionaries.
You explored the DataFrame inspection toolkit: .info() gives a structural overview; .describe() produces summary statistics; .head() and .tail() show rows at the top and bottom; .shape, .dtypes, and .columns tell you the table's dimensions and structure.
You practiced selecting data three ways: single and multiple column selection with bracket notation; row selection by label using .loc[]; row selection by position using .iloc[]. You learned the all-important Boolean filtering technique and how to combine conditions with & and |.
You added and modified columns with vectorized assignment, and you dropped unwanted data with .drop(). You sorted DataFrames with .sort_values().
Most importantly, you began to internalize the pandas mindset: think in columns, not loops. Vectorized operations are faster, more readable, and more maintainable than row-by-row iteration.
In the next chapter, you will take these foundations further: reading real data from CSV files and Excel spreadsheets, handling missing values, and performing grouped aggregations — the operations that transform raw data exports into genuine business intelligence.
Key Terms
Boolean filtering — Selecting rows from a DataFrame by applying a condition that produces a True/False Series, then using that Series to index the DataFrame.
DataFrame — The primary two-dimensional data structure in pandas, organized as labeled rows and labeled columns. Equivalent to a spreadsheet table or database table.
dtype — Data type of a pandas column. Common types include int64, float64, object (string), bool, and datetime64.
index — The row labels of a pandas Series or DataFrame. By default, an integer range starting at 0; can be set to any unique labels.
NaN — "Not a Number." The sentinel value pandas uses to represent missing data.
Series — The primary one-dimensional data structure in pandas. A labeled array equivalent to a single spreadsheet column.
vectorized operation — An operation applied to an entire array (column) at once, processed by optimized low-level code, rather than by iterating row by row in Python.
.iloc[] — Integer-location-based indexing. Selects rows and columns by their integer position, like Python list indexing.
.loc[] — Label-based indexing. Selects rows and columns by their index label.
Next: Chapter 11 — Reading and Cleaning Real Data: CSVs, Missing Values, and Data Types