19 min read

> "The goal is to turn data into information, and information into insight."

Chapter 10: Introduction to pandas: Your Business Data Toolkit

"The goal is to turn data into information, and information into insight." — Carly Fiorina


What You Will Learn

By the end of this chapter, you will be able to:

  • Explain what pandas is and why it is the standard tool for business data work in Python
  • Create pandas Series and DataFrame objects from lists, dictionaries, and raw data
  • Inspect a DataFrame using .info(), .describe(), .head(), .tail(), .shape, .dtypes, and .columns
  • Select specific columns and rows using bracket notation, .loc[], and .iloc[]
  • Filter rows using Boolean conditions
  • Add, modify, and drop columns and rows
  • Sort a DataFrame by one or more columns
  • Explain the "vectorized operations" mindset that makes pandas faster than manual loops

Opening Scene

It is a Tuesday morning at Acme Corp. Priya Okonkwo, the company's lead data analyst, has just received an email from Sandra Chen, VP of Sales. Attached is a spreadsheet with 4,200 rows of product catalog data: SKU codes, categories, unit prices, cost of goods, current inventory, and reorder thresholds. Sandra's request is straightforward but urgent: "Before the 2 p.m. budget review, can you tell me which products have margins below 20 percent, and which ones are below their reorder point?"

In a previous life, Priya would have opened Excel, applied filters, written VLOOKUP formulas, and spent 45 minutes doing this work. Today, she opens her code editor, types twelve lines of Python, and has the answer in under three seconds — including a clean, sorted table she can paste directly into her presentation.

The tool that made the difference is pandas.


10.1 What Is pandas and Why Does It Matter?

pandas (the name is derived from "panel data," a term from econometrics) is an open-source Python library for data manipulation and analysis. It was created in 2008 by Wes McKinney while he was working at a hedge fund, because he needed a fast, expressive tool to work with financial data that Python simply did not have at the time. What he built became one of the most widely used software libraries in the world.

At its core, pandas gives you two powerful data structures:

  1. Series — a one-dimensional labeled array, like a single column of a spreadsheet
  2. DataFrame — a two-dimensional labeled table, like a full spreadsheet with rows and columns

Everything else pandas offers — filtering, sorting, grouping, merging, reshaping, reading CSV files, writing to Excel — is built on top of these two structures.

The Excel vs. pandas Mindset

If you have been using Excel for business analysis, you already understand the concept of tabular data. You know what a column header is, what a row of values represents, and what it means to filter a table. That conceptual foundation transfers directly to pandas. The difference is in how you interact with the data.

In Excel, your workflow is largely manual and visual. You click cells, drag formulas, apply filters through menus, and format results by hand. For 50 rows, this is fast and comfortable. For 50,000 rows — or 500,000 — it becomes painful or impossible. Excel will slow to a crawl, crash, or simply refuse to open the file.

In pandas, your workflow is written in code. Every operation you perform is a line of Python that can be re-run instantly, modified in seconds, and shared with a colleague who can reproduce your exact results. Pandas handles millions of rows without breaking a sweat, because it is built on top of NumPy, a library that performs numerical computations using highly optimized C code under the hood.

Here is a concrete comparison. Suppose you have a spreadsheet of 10,000 sales transactions and you want to calculate the profit margin for every row. In Excel, you might type a formula in column F and drag it down 10,000 rows. If someone later changes the column layout, your formula breaks.

In pandas, you write:

df['margin'] = (df['unit_price'] - df['cost']) / df['unit_price']

This single line applies the calculation to every row simultaneously. It does not loop. It does not drag. It is what practitioners call a vectorized operation — the library applies the operation to the entire column at once, which is dramatically faster than any loop you could write yourself.

This is the pandas mindset shift: stop thinking row by row, start thinking column by column.


10.2 Installing and Importing pandas

pandas does not ship with Python's standard library, but it comes pre-installed with the Anaconda distribution (the most common Python environment for data work). If you are using a standard Python installation, you can install it with:

pip install pandas

Once installed, the universal convention — used in virtually every tutorial, book, blog post, and Stack Overflow answer in the world — is to import it with the alias pd:

import pandas as pd

You will also frequently import NumPy alongside pandas, as the two libraries work closely together:

import numpy as np
import pandas as pd

Throughout this chapter, every code example assumes these two import lines have already been run.


10.3 The Series: pandas' One-Dimensional Building Block

Before we build a full DataFrame, we need to understand its smallest component: the Series.

A pandas Series is a one-dimensional array of values with an associated index. Think of it as a single column from a spreadsheet, but smarter — each value has a label (the index) that lets you retrieve it by name rather than just by position.

Creating a Series from a List

The simplest way to create a Series is to pass a Python list to pd.Series():

import pandas as pd

monthly_revenue = pd.Series([42000, 58000, 63500, 71000, 49000, 55000])
print(monthly_revenue)

Output:

0    42000
1    58000
2    63500
3    71000
4    49000
5    55000
dtype: int64

Notice that pandas automatically created a numeric index (0, 1, 2, ...). The dtype at the bottom tells you the data type pandas inferred for the values.

Creating a Series with a Custom Index

The real power of a Series becomes apparent when you give it a meaningful index:

monthly_revenue = pd.Series(
    [42000, 58000, 63500, 71000, 49000, 55000],
    index=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
)
print(monthly_revenue)

Output:

Jan    42000
Feb    58000
Mar    63500
Apr    71000
May    49000
Jun    55000
dtype: int64

Now you can access values by label:

print(monthly_revenue['Mar'])   # 63500
print(monthly_revenue['Jan':'Mar'])  # Slice from Jan through Mar

Creating a Series from a Dictionary

When your data already exists as a Python dictionary, pandas will use the dictionary keys as the index automatically:

q1_sales = {
    'January': 42000,
    'February': 58000,
    'March': 63500
}

sales_series = pd.Series(q1_sales)
print(sales_series)

Output:

January     42000
February    58000
March       63500
dtype: int64

Basic Series Operations

Series support all the arithmetic operations you would expect, and they apply vectorized:

# Add a 10% commission to every month's revenue
commission = monthly_revenue * 0.10
print(commission)

# Find months where revenue exceeded 60000
high_revenue_months = monthly_revenue[monthly_revenue > 60000]
print(high_revenue_months)

Some useful Series methods:

Method What It Does
.sum() Sum of all values
.mean() Average value
.median() Median value
.min() Minimum value
.max() Maximum value
.count() Number of non-null values
.std() Standard deviation
.value_counts() Frequency count of each unique value
.unique() Array of unique values
.sort_values() Returns sorted Series
.head(n) First n values (default 5)
.tail(n) Last n values (default 5)
print(f"Total H1 Revenue: ${monthly_revenue.sum():,.0f}")
print(f"Average Monthly Revenue: ${monthly_revenue.mean():,.0f}")
print(f"Best Month Revenue: ${monthly_revenue.max():,.0f}")

Series vs. a Python List

You might be wondering: why not just use a Python list? The answer is several-fold:

  1. Labeled access. A list can only be accessed by integer position. A Series can be accessed by label, making code far more readable.
  2. Vectorized math. monthly_revenue * 1.10 works on a Series. On a list, you would need a loop or list comprehension.
  3. Built-in statistics. .mean(), .std(), .describe() are one-liners on a Series.
  4. Missing data handling. Pandas has first-class support for NaN (Not a Number) values. Python lists have no such concept.
  5. Integration with DataFrames. Each column of a DataFrame is a Series, so understanding Series is fundamental.

10.4 The DataFrame: Your In-Memory Spreadsheet

The DataFrame is the workhorse of pandas. It is a two-dimensional table with labeled rows and labeled columns. Every column in a DataFrame is a Series, and all the columns share a common row index.

If you have used a relational database, a DataFrame is essentially a table. If you have used Excel, it is a worksheet. But unlike either of those tools, a DataFrame lives entirely in memory, is manipulated with code, and can be transformed, filtered, merged, and analyzed with extraordinary speed.

Creating a DataFrame from a Dictionary of Lists

The most common way to create a DataFrame by hand is to pass a dictionary where each key is a column name and each value is a list of column values:

import pandas as pd

product_catalog = {
    'sku': ['ACM-001', 'ACM-002', 'ACM-003', 'ACM-004', 'ACM-005'],
    'product_name': ['Widget A', 'Widget B', 'Gadget X', 'Gadget Y', 'Component Z'],
    'category': ['Widgets', 'Widgets', 'Gadgets', 'Gadgets', 'Components'],
    'unit_price': [29.99, 49.99, 89.99, 129.99, 14.99],
    'unit_cost': [12.50, 22.00, 41.00, 58.50, 8.25],
    'inventory': [250, 180, 95, 42, 600]
}

df = pd.DataFrame(product_catalog)
print(df)

Output:

       sku product_name    category  unit_price  unit_cost  inventory
0  ACM-001     Widget A     Widgets       29.99      12.50        250
1  ACM-002     Widget B     Widgets       49.99      22.00        180
2  ACM-003     Gadget X     Gadgets       89.99      41.00         95
3  ACM-004     Gadget Y     Gadgets      129.99      58.50         42
4  ACM-005  Component Z  Components       14.99       8.25        600

Pandas automatically assigned a numeric row index (0 through 4). The column headers come directly from the dictionary keys.

Setting a Custom Row Index

Often you have a natural identifier for each row — a SKU, a customer ID, a date. You can set that as the index:

df = df.set_index('sku')
print(df)

Output:

         product_name    category  unit_price  unit_cost  inventory
sku
ACM-001      Widget A     Widgets       29.99      12.50        250
ACM-002      Widget B     Widgets       49.99      22.00        180
ACM-003      Gadget X     Gadgets       89.99      41.00         95
ACM-004      Gadget Y     Gadgets      129.99      58.50         42
ACM-005   Component Z  Components       14.99       8.25        600

Now rows can be accessed by SKU label. We can also reset to the default integer index at any time with .reset_index().


10.5 Inspecting a DataFrame

Before you do any analysis, the first thing you should always do with a new DataFrame is look at it. Pandas provides a full toolkit of inspection methods.

.head() and .tail()

These methods show you the first or last N rows (default 5). They are your first line of defense when you load a new dataset:

print(df.head())     # First 5 rows
print(df.head(3))    # First 3 rows
print(df.tail(2))    # Last 2 rows

In a Jupyter Notebook, you can just type df and press Enter to render a nicely formatted table. In a script, you need print(df.head()).

.shape

Returns a tuple of (number_of_rows, number_of_columns):

print(df.shape)
# (5, 5)  -- 5 rows, 5 columns (after setting sku as index)

This is extremely useful for a sanity check after loading data: "Did I load all 4,200 rows? Did I get all 12 columns?"

.dtypes

Shows the data type of each column:

print(df.dtypes)

Output:

product_name     object
category         object
unit_price      float64
unit_cost       float64
inventory         int64
dtype: object

The most common data types you will encounter:

pandas dtype Python/NumPy equivalent Typical use
int64 Integer Counts, quantities
float64 Float Prices, rates, measurements
object String (or mixed) Names, labels, categories
bool Boolean True/False flags
datetime64 Date and time Dates, timestamps
category Categorical Repeated labels (efficient)

When you load data from a CSV, pandas will infer data types automatically — but it does not always get it right. A column of ZIP codes might be read as integers, stripping leading zeros. A column of dates might be read as strings. Checking .dtypes early saves a lot of debugging later.

.columns

Returns the column names as a pandas Index object:

print(df.columns)
# Index(['product_name', 'category', 'unit_price', 'unit_cost', 'inventory'], dtype='object')

You can convert this to a list with list(df.columns), which is useful when you want to iterate over column names or check whether a specific column exists.

.info()

This is the most comprehensive quick-inspection method. It shows the index type, number of rows, each column name, the count of non-null values, and the data type:

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, ACM-001 to ACM-005
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   product_name  5 non-null      object
 1   category      5 non-null      object
 2   unit_price    5 non-null      float64
 3   unit_cost     5 non-null      float64
 4   inventory     5 non-null      int64
dtypes: float64(2), int64(1), object(2)
memory usage: 240.0+ bytes

The "Non-Null Count" column is particularly valuable. If you have 4,200 rows but a column shows only 3,800 non-null values, you have 400 missing values to deal with before you analyze that column.

.describe()

Generates summary statistics for all numeric columns:

print(df.describe())

Output:

       unit_price   unit_cost   inventory
count    5.000000    5.000000    5.000000
mean    63.190000   28.450000  233.400000
std     44.804018   19.537655  218.777285
min     14.990000    8.250000   42.000000
25%     29.990000   12.500000   95.000000
50%     49.990000   22.000000  180.000000
75%     89.990000   41.000000  250.000000
max    129.990000   58.500000  600.000000

In four seconds, you have learned that your average unit price is $63.19, that the cheapest product costs $14.99 and the most expensive costs $129.99, and that inventory ranges from 42 to 600 units. This kind of instant summary is why .describe() is one of the first methods data analysts call on any new dataset.


10.6 Selecting Columns

Selecting columns is one of the most frequent operations you will perform. Pandas gives you several ways to do it.

Selecting a Single Column

To select one column, use bracket notation with the column name as a string:

prices = df['unit_price']
print(type(prices))    # <class 'pandas.core.series.Series'>
print(prices)

This returns a Series. Remember: every column in a DataFrame is a Series.

You can also use attribute-style access (dot notation) for a single column — but only when the column name is a valid Python identifier (no spaces, does not start with a number, does not conflict with a DataFrame method name):

prices = df.unit_price   # Same result as df['unit_price']

Recommendation: Use bracket notation (df['column_name']) consistently. Dot notation looks cleaner, but it breaks when column names have spaces, and it can silently return a DataFrame method instead of your column if names conflict. Bracket notation always works.

Selecting Multiple Columns

To select multiple columns, pass a list of column names inside the brackets. Note the double brackets — the outer brackets are the selection operator, the inner brackets define the list:

price_and_cost = df[['unit_price', 'unit_cost']]
print(type(price_and_cost))  # <class 'pandas.core.frame.DataFrame'>
print(price_and_cost)

Output:

         unit_price  unit_cost
sku
ACM-001       29.99      12.50
ACM-002       49.99      22.00
ACM-003       89.99      41.00
ACM-004      129.99      58.50
ACM-005       14.99       8.25

When you select multiple columns, you get back a DataFrame, not a Series. This is a common source of confusion for beginners: df['col'] returns a Series, df[['col']] returns a single-column DataFrame. They look similar when printed, but they behave differently.


10.7 Selecting Rows: .loc[] and .iloc[]

Selecting rows is slightly more nuanced than selecting columns. Pandas provides two primary indexers: .loc[] for label-based selection and .iloc[] for integer-position-based selection.

.loc[] — Label-Based Selection

.loc[] selects rows (and optionally columns) by their label — the value in the index.

# Select a single row by its index label
print(df.loc['ACM-003'])

# Select multiple rows
print(df.loc[['ACM-001', 'ACM-004']])

# Select a row AND specific columns
print(df.loc['ACM-002', ['unit_price', 'unit_cost']])

# Select a range of rows (inclusive on both ends with label-based slicing)
print(df.loc['ACM-001':'ACM-003'])

The comma syntax is powerful: df.loc[row_selector, column_selector]. You can pass a label, a list of labels, or a slice for either dimension.

.iloc[] — Integer-Position-Based Selection

.iloc[] selects rows and columns by their integer position, regardless of what the index labels are. This is like accessing a list by index number.

# Select the first row (position 0)
print(df.iloc[0])

# Select the third row (position 2)
print(df.iloc[2])

# Select rows 1 through 3 (exclusive of the end, like Python slicing)
print(df.iloc[1:4])

# Select row 0, columns 0 and 1
print(df.iloc[0, 0:2])

# Select the last row
print(df.iloc[-1])

When to Use Which

Situation Use
You know the row's index label (SKU, name, date) .loc[]
You want "the first 10 rows" or "every other row" .iloc[]
Your index is integers AND those integers match positions Either, but .loc[] is clearer
Combining row and column selection by label .loc[]
Combining row and column selection by position .iloc[]

A common beginner mistake is to use .iloc[] on a DataFrame whose index has been reset or shuffled, expecting label-based results. When in doubt, use .loc[] for label-based work and .iloc[] for positional work — and keep them straight.


10.8 Boolean Filtering: The Core of Data Analysis

Boolean filtering is one of the most powerful features in pandas, and it is the technique you will use for the vast majority of real business analysis tasks. It lets you select rows that meet a condition — the pandas equivalent of Excel's AutoFilter.

How Boolean Filtering Works

The mechanics are simple. First, you create a Boolean Series — a series of True/False values, one per row, based on some condition. Then you use that Boolean Series to index the DataFrame, which returns only the rows where the value is True.

# Step 1: Create a Boolean Series
is_gadget = df['category'] == 'Gadgets'
print(is_gadget)

Output:

sku
ACM-001    False
ACM-002    False
ACM-003     True
ACM-004     True
ACM-005    False
dtype: bool
# Step 2: Use it to filter the DataFrame
gadgets_only = df[is_gadget]
print(gadgets_only)

In practice, you almost always combine these two steps into one line:

gadgets_only = df[df['category'] == 'Gadgets']
print(gadgets_only)

This reads almost like English: "give me the rows from df where the category is Gadgets."

Comparison Operators

All standard Python comparison operators work with pandas:

Operator Meaning Example
== Equal to df['category'] == 'Widgets'
!= Not equal to df['category'] != 'Components'
> Greater than df['unit_price'] > 50
>= Greater than or equal df['inventory'] >= 100
< Less than df['unit_cost'] < 20
<= Less than or equal df['margin'] <= 0.20

Combining Conditions with & and |

To combine multiple conditions, use & (and) and | (or). Note that you must wrap each condition in parentheses when combining them:

# Products that are Gadgets AND cost more than $80
expensive_gadgets = df[(df['category'] == 'Gadgets') & (df['unit_price'] > 80)]

# Products that are either Widgets or have price below $20
widget_or_cheap = df[(df['category'] == 'Widgets') | (df['unit_price'] < 20)]

The parentheses are not optional — Python's operator precedence rules mean that & binds more tightly than ==, which produces a confusing error if you omit them.

Negation with ~

To negate a condition (select rows where the condition is False), use ~:

# All products that are NOT in the Gadgets category
non_gadgets = df[~(df['category'] == 'Gadgets')]
# Equivalent to: df[df['category'] != 'Gadgets']

.isin() for Multiple Values

When you want to match any of several values, .isin() is cleaner than a chain of | conditions:

# Products in either Widgets or Components category
widgets_or_components = df[df['category'].isin(['Widgets', 'Components'])]

Real Business Example: Finding Low-Margin Products

Here is how Priya answers Sandra's question using Boolean filtering. First, she adds a margin column, then she filters:

# Calculate gross margin percentage
df['margin_pct'] = (df['unit_price'] - df['unit_cost']) / df['unit_price']

# Find products with margin below 20%
low_margin_products = df[df['margin_pct'] < 0.20]
print(f"Products below 20% margin:\n{low_margin_products[['product_name', 'unit_price', 'unit_cost', 'margin_pct']]}")

Twelve lines of code. Three seconds of runtime. Sandra has her answer before her second cup of coffee.


10.9 Adding and Modifying Columns

Adding a new calculated column to a DataFrame is one of the most common operations in data analysis. The syntax mirrors a Python dictionary:

df['new_column_name'] = expression

Adding a Calculated Column

# Add a gross profit column
df['gross_profit'] = df['unit_price'] - df['unit_cost']

# Add a margin percentage column
df['margin_pct'] = df['gross_profit'] / df['unit_price']

# Add a revenue column (assuming all inventory is sold)
df['potential_revenue'] = df['unit_price'] * df['inventory']

print(df[['product_name', 'gross_profit', 'margin_pct', 'potential_revenue']])

Each of these operations is vectorized — pandas applies the arithmetic to every row simultaneously without any explicit loop.

Modifying an Existing Column

You can overwrite an existing column the same way you created it:

# Apply a 5% price increase to all products
df['unit_price'] = df['unit_price'] * 1.05

# Round prices to 2 decimal places
df['unit_price'] = df['unit_price'].round(2)

Adding a Column Based on a Condition

Sometimes you want to categorize rows based on their values. The np.where() function (from NumPy) is the pandas-idiomatic way to do this — it is the vectorized equivalent of a conditional expression:

import numpy as np

# Tag products as 'Low Margin' or 'Healthy Margin'
df['margin_status'] = np.where(df['margin_pct'] < 0.25, 'Low Margin', 'Healthy Margin')

For more than two categories, the .apply() method with a custom function, or pd.cut() for numeric bins, are the appropriate tools — but those belong in a more advanced chapter.


10.10 Dropping Columns and Rows

Dropping Columns

Use .drop() with axis=1 to remove columns (axis=1 means "along the column dimension"):

# Drop a single column
df_cleaned = df.drop('potential_revenue', axis=1)

# Drop multiple columns
df_cleaned = df.drop(['gross_profit', 'potential_revenue'], axis=1)

Note that .drop() returns a new DataFrame by default — it does not modify df in place. To modify in place, use inplace=True:

df.drop('potential_revenue', axis=1, inplace=True)

Many experienced pandas users avoid inplace=True because it makes code harder to reason about (you cannot easily undo in-place operations). The preferred pattern is assignment: df = df.drop(...).

Dropping Rows

To drop rows, use .drop() with the row labels (or positions):

# Drop the row with index label 'ACM-005'
df_trimmed = df.drop('ACM-005')

# Drop multiple rows
df_trimmed = df.drop(['ACM-003', 'ACM-004'])

You can also drop rows based on conditions using Boolean filtering — simply filter for the rows you want to keep, rather than specifying which to drop. This is usually more readable:

# Keep only products with inventory above 50 (effectively dropping the rest)
df_in_stock = df[df['inventory'] > 50]

10.11 Sorting a DataFrame

Sorting is accomplished with .sort_values(). You specify the column (or columns) to sort by, and whether to sort ascending or descending.

Sorting by a Single Column

# Sort by unit price, lowest to highest (ascending is the default)
df_sorted_by_price = df.sort_values('unit_price')

# Sort by unit price, highest to lowest
df_sorted_by_price_desc = df.sort_values('unit_price', ascending=False)

# Sort by margin percentage, worst margins first
df_sorted_by_margin = df.sort_values('margin_pct', ascending=True)

Sorting by Multiple Columns

Pass a list of column names to sort by multiple levels. Pandas sorts by the first column, then uses subsequent columns as tiebreakers:

# Sort by category (A-Z), then by unit price within each category (high to low)
df_multi_sorted = df.sort_values(
    by=['category', 'unit_price'],
    ascending=[True, False]
)

Sorting the Index

To sort by the DataFrame's row index, use .sort_index():

df_index_sorted = df.sort_index()           # Ascending
df_index_sorted_desc = df.sort_index(ascending=False)

Like most pandas methods, .sort_values() and .sort_index() return new DataFrames by default. Use inplace=True to modify in place, or reassign.


10.12 The pandas Mindset: Vectorized Operations vs. Manual Loops

Now that you have seen the core operations, it is worth pausing to address a pattern that trips up nearly every Python programmer coming from a procedural background: the temptation to use loops.

Suppose you have a DataFrame with 100,000 rows and you want to calculate the margin for each product. A programmer comfortable with lists and loops might write:

# DO NOT DO THIS — it is very slow and unpythonic
margins = []
for index, row in df.iterrows():
    margin = (row['unit_price'] - row['unit_cost']) / row['unit_price']
    margins.append(margin)
df['margin'] = margins

This code works, but it is orders of magnitude slower than the vectorized approach. On 100,000 rows, the loop approach might take 5–10 seconds. The vectorized approach takes milliseconds:

# DO THIS INSTEAD — fast and readable
df['margin'] = (df['unit_price'] - df['unit_cost']) / df['unit_price']

Why is the vectorized approach so much faster? Because pandas passes the entire arrays to NumPy's underlying C implementation, which processes them in bulk using optimized machine code. The loop approach, by contrast, invokes Python overhead for every single row.

The Mental Model Shift

The shift required is moving from "what do I do to each row?" to "what do I do to each column?"

Procedural mindset Pandas mindset
For each product, calculate its margin Calculate the margin for the whole margin column
For each sale, check if it is high value Create a Boolean column: is this row a high-value sale?
For each customer, add their orders Group by customer, then sum orders
Loop through rows and filter manually Use Boolean indexing to filter the DataFrame

This shift is genuinely hard at first. After years of writing loops, it feels unnatural to think about columns. But it is a skill that pays dividends every time you touch data in Python. By the end of this book, it will feel as natural as filtering a spreadsheet.


10.13 A Complete Business Example: Building Acme's Product Analysis

Let's bring everything in this chapter together with a realistic, complete example. Priya is building a product analysis tool for Acme Corp.

import pandas as pd
import numpy as np

# ------------------------------------------------------------------ #
# Build the product catalog DataFrame
# ------------------------------------------------------------------ #
product_data = {
    'sku': ['ACM-001', 'ACM-002', 'ACM-003', 'ACM-004', 'ACM-005',
            'ACM-006', 'ACM-007', 'ACM-008'],
    'product_name': [
        'Widget A', 'Widget B', 'Gadget X', 'Gadget Y',
        'Component Z', 'Widget C', 'Gadget Z', 'Component W'
    ],
    'category': [
        'Widgets', 'Widgets', 'Gadgets', 'Gadgets',
        'Components', 'Widgets', 'Gadgets', 'Components'
    ],
    'unit_price': [29.99, 49.99, 89.99, 129.99, 14.99, 39.99, 199.99, 24.99],
    'unit_cost':  [12.50, 22.00, 41.00, 58.50,  8.25, 34.50,  88.00, 11.00],
    'inventory':  [250,   180,   95,    42,      600,  310,    18,    420],
    'reorder_point': [100, 75, 50, 25, 200, 150, 20, 300]
}

df = pd.DataFrame(product_data)
df = df.set_index('sku')

# ------------------------------------------------------------------ #
# Inspect
# ------------------------------------------------------------------ #
print("=== Dataset Overview ===")
print(f"Shape: {df.shape}")
df.info()
print("\n=== Summary Statistics ===")
print(df.describe())

# ------------------------------------------------------------------ #
# Add calculated columns
# ------------------------------------------------------------------ #
df['gross_profit']  = df['unit_price'] - df['unit_cost']
df['margin_pct']    = (df['gross_profit'] / df['unit_price']).round(4)
df['below_reorder'] = df['inventory'] < df['reorder_point']

# ------------------------------------------------------------------ #
# Analysis 1: Low-margin products (below 30%)
# ------------------------------------------------------------------ #
low_margin = df[df['margin_pct'] < 0.30].sort_values('margin_pct')
print("\n=== Products Below 30% Margin ===")
print(low_margin[['product_name', 'category', 'unit_price', 'unit_cost', 'margin_pct']])

# ------------------------------------------------------------------ #
# Analysis 2: Products below reorder point
# ------------------------------------------------------------------ #
needs_reorder = df[df['below_reorder'] == True].sort_values('inventory')
print("\n=== Products Needing Reorder ===")
print(needs_reorder[['product_name', 'inventory', 'reorder_point']])

# ------------------------------------------------------------------ #
# Analysis 3: High-value, low-stock gadgets
# ------------------------------------------------------------------ #
critical_gadgets = df[
    (df['category'] == 'Gadgets') &
    (df['inventory'] < df['reorder_point'])
]
print("\n=== Critical Gadget Inventory Alert ===")
print(critical_gadgets[['product_name', 'unit_price', 'inventory', 'reorder_point']])

# ------------------------------------------------------------------ #
# Summary by category
# ------------------------------------------------------------------ #
print("\n=== Category Summary ===")
category_summary = df.groupby('category').agg(
    product_count=('product_name', 'count'),
    avg_price=('unit_price', 'mean'),
    avg_margin=('margin_pct', 'mean'),
    total_inventory=('inventory', 'sum')
).round(2)
print(category_summary)

This is the kind of analysis that used to take Priya 45 minutes in Excel. With pandas, it runs in under a second, produces consistent results every time, and can be adapted in minutes to answer new questions.


10.14 Common Beginner Mistakes (and How to Avoid Them)

Mistake 1: Confusing df['col'] and df[['col']]

df['unit_price'] returns a Series. df[['unit_price']] returns a single-column DataFrame. This matters when you pass data to a function that expects one or the other.

Mistake 2: Forgetting Parentheses in Combined Boolean Conditions

# This will raise a ValueError:
bad = df[df['category'] == 'Gadgets' & df['unit_price'] > 80]

# This is correct:
good = df[(df['category'] == 'Gadgets') & (df['unit_price'] > 80)]

Mistake 3: Using Python's and / or Instead of & / |

Python's and and or keywords do not work with Series. You will get a confusing error. Always use & and | for element-wise Boolean operations on pandas Series.

Mistake 4: Assuming Changes Are In-Place

Most pandas methods return new DataFrames. If you write df.sort_values('price') and do not assign the result, df is unchanged. Always assign: df = df.sort_values('price') — or use inplace=True if you are certain you want to modify in place.

Mistake 5: Using .iloc[] When You Mean .loc[]

If your row index is integers (0, 1, 2...), df.loc[3] and df.iloc[3] return the same row. But after filtering and sorting, the integer labels and the integer positions diverge. df.loc[3] will find the row labeled 3 (which may be anywhere in the DataFrame), while df.iloc[3] will always return the fourth row regardless of its label.


10.15 Chapter Summary

This chapter introduced pandas, the foundational Python library for business data analysis. You have covered a significant amount of ground:

You learned that pandas is built on two core data structures: the Series (one-dimensional labeled array) and the DataFrame (two-dimensional labeled table). You learned the universal import convention import pandas as pd and how to create both structures from Python lists and dictionaries.

You explored the DataFrame inspection toolkit: .info() gives a structural overview; .describe() produces summary statistics; .head() and .tail() show rows at the top and bottom; .shape, .dtypes, and .columns tell you the table's dimensions and structure.

You practiced selecting data three ways: single and multiple column selection with bracket notation; row selection by label using .loc[]; row selection by position using .iloc[]. You learned the all-important Boolean filtering technique and how to combine conditions with & and |.

You added and modified columns with vectorized assignment, and you dropped unwanted data with .drop(). You sorted DataFrames with .sort_values().

Most importantly, you began to internalize the pandas mindset: think in columns, not loops. Vectorized operations are faster, more readable, and more maintainable than row-by-row iteration.

In the next chapter, you will take these foundations further: reading real data from CSV files and Excel spreadsheets, handling missing values, and performing grouped aggregations — the operations that transform raw data exports into genuine business intelligence.


Key Terms

Boolean filtering — Selecting rows from a DataFrame by applying a condition that produces a True/False Series, then using that Series to index the DataFrame.

DataFrame — The primary two-dimensional data structure in pandas, organized as labeled rows and labeled columns. Equivalent to a spreadsheet table or database table.

dtype — Data type of a pandas column. Common types include int64, float64, object (string), bool, and datetime64.

index — The row labels of a pandas Series or DataFrame. By default, an integer range starting at 0; can be set to any unique labels.

NaN — "Not a Number." The sentinel value pandas uses to represent missing data.

Series — The primary one-dimensional data structure in pandas. A labeled array equivalent to a single spreadsheet column.

vectorized operation — An operation applied to an entire array (column) at once, processed by optimized low-level code, rather than by iterating row by row in Python.

.iloc[] — Integer-location-based indexing. Selects rows and columns by their integer position, like Python list indexing.

.loc[] — Label-based indexing. Selects rows and columns by their index label.


Next: Chapter 11 — Reading and Cleaning Real Data: CSVs, Missing Values, and Data Types