NK Adeyemi stared at the empty Jupyter notebook on her laptop screen. The cursor blinked at her from inside a gray cell, patient and indifferent. She had spent the previous evening reading the installation instructions Professor Okonkwo had posted...
In This Chapter
- 3.1 Why Python? The Business Case
- 3.2 Setting Up Your Environment
- 3.3 The Jupyter Notebook Workflow
- 3.4 Python Basics: Variables and Data Types
- 3.5 Control Flow: Making Decisions and Repeating Actions
- 3.6 Data Structures: Organizing Your Data
- 3.7 Functions: Writing Reusable Code
- 3.8 Introduction to pandas
- 3.9 Basic Data Manipulation with pandas
- 3.10 Your First Analysis: Athena Retail Sales Data
- 3.11 Common Errors and Debugging
- 3.12 Putting It All Together: Your Python Toolkit
- Chapter Summary
Chapter 3: Python for the Business Professional
NK Adeyemi stared at the empty Jupyter notebook on her laptop screen. The cursor blinked at her from inside a gray cell, patient and indifferent. She had spent the previous evening reading the installation instructions Professor Okonkwo had posted to the course portal, and everything had gone smoothly — Anaconda installed, JupyterLab launched, new notebook created. But now that she was here, the blankness felt enormous.
Tom Kowalski, sitting in the next seat, glanced at her screen. "Just type something," he said.
NK looked at him skeptically. Tom had been coding since high school, had spent four years as a software engineer at a fintech company before his MBA. Of course it was easy for him. She turned back to the screen, took a breath, and typed:
print("Hello, business world")
She pressed Shift+Enter. The words appeared below the cell, plain and unformatted.
Hello, business world
"That's it?" she said.
Tom grinned. "That's it. Now let's do something useful."
Professor Okonkwo walked to the front of the room and opened her own notebook on the projector. "Good morning. Before we begin, I want to show you something." She typed three lines of code:
import pandas as pd
sales = pd.read_csv("athena_monthly_sales.csv")
sales.groupby("region")["revenue"].mean()
The output appeared instantly — a clean table showing average revenue by region. "Last year," she said, "an analyst at a company I advise spent four hours building this answer in Excel. Pivot tables, VLOOKUP chains, manual reformatting. These three lines produce the same result in under a second." She paused. "This chapter is not about turning you into software engineers. It is about giving you a tool that lets you ask better questions, faster. By the end of today, every one of you will have written code that analyzes real business data. Let's begin."
3.1 Why Python? The Business Case
Before we write another line of code, it is worth understanding why Python has become the dominant language for business analytics and artificial intelligence. You might reasonably ask: What is wrong with Excel? Why not R? Why not just use SQL?
Nothing is wrong with any of those tools. Each has its place, and you will likely use all of them at various points in your career. The question is not which tool is best in the abstract — it is which tool gives you the broadest leverage for the kinds of work you will do in this course and in the AI-driven business environment you are entering.
Python vs. Excel
Excel is extraordinary for what it does. It is visual, immediate, and nearly universal. For quick calculations, small datasets, and ad hoc analysis, nothing beats a spreadsheet. But Excel has limits:
- Scale. Excel slows down noticeably with datasets above 100,000 rows and has a hard ceiling of roughly 1.05 million rows. Many real-world business datasets — transaction logs, customer interactions, web analytics — exceed this easily.
- Reproducibility. A complex Excel analysis built with formulas, pivot tables, and macros is difficult to audit, share, and reproduce. If you hand someone a 15-tab workbook, they must reverse-engineer your logic by clicking through cells.
- Automation. If you need to run the same analysis every week — say, a Monday morning sales dashboard — Excel requires you to do it manually each time (or build brittle VBA macros). Python scripts run on a schedule.
- Integration. Python connects natively to databases, APIs, cloud services, machine learning libraries, and visualization tools. Excel's connections are limited and often require add-ins.
Business Insight: The goal is not to replace Excel. Many professionals use Python to clean, transform, and analyze data, then export polished results to Excel for stakeholders who prefer the familiar interface. Python and Excel are complementary.
Python vs. R
R is a powerful statistical language with excellent packages for visualization and modeling. In academic statistics departments and some research-heavy organizations, R is the standard. However, Python has surpassed R in business and industry for several reasons:
- General-purpose language. Python is used for web development, automation, DevOps, and software engineering — not just data analysis. This means Python skills transfer across roles.
- Ecosystem. The AI and machine learning ecosystem (TensorFlow, PyTorch, Hugging Face, OpenAI APIs, LangChain) is overwhelmingly Python-first.
- Hiring. Job postings for data-related roles increasingly list Python as a requirement. R remains valued in specific niches (biostatistics, academic research) but Python dominates in industry.
- Readability. Python's syntax was designed to look like pseudocode. This makes it unusually approachable for beginners and easy to read months later.
Python vs. SQL
SQL is the language of databases and remains essential for querying data. You will likely encounter SQL throughout your career. But SQL is a query language — it retrieves and aggregates data. It does not build models, create visualizations, call APIs, or automate workflows. Python complements SQL; in practice, many analysts write SQL to pull data from a database and then switch to Python for analysis and modeling.
Definition: Python — A high-level, general-purpose programming language created by Guido van Rossum in 1991. Known for its readable syntax and vast ecosystem of third-party libraries. As of 2025, Python is the most widely used programming language in the world (TIOBE Index, IEEE Spectrum, Stack Overflow Developer Survey).
The Bottom Line
Python gives you a single tool that scales from quick calculations to enterprise machine learning pipelines. It is free, open-source, and backed by the largest developer community in the world. Every major AI framework speaks Python. Every cloud provider offers Python SDKs. When you invest in learning Python, you are investing in the lingua franca of modern data work.
3.2 Setting Up Your Environment
Let's get your tools installed. We will use Anaconda, a free distribution that bundles Python with hundreds of commonly used libraries (including pandas, NumPy, and matplotlib), and JupyterLab, an interactive coding environment designed for data analysis.
Step 1: Install Anaconda
- Visit anaconda.com/download in your web browser.
- Download the installer for your operating system (Windows, macOS, or Linux). Choose the latest Python 3.x version (Python 3.11 or later as of this writing).
- Run the installer.
- Windows: Double-click the
.exefile. Accept the license agreement. Choose "Install for: Just Me." Leave the default installation path. On the "Advanced Options" screen, check "Add Anaconda to my PATH environment variable" (despite the warning — it simplifies things). Click Install. - macOS: Double-click the.pkgfile. Follow the prompts. Accept defaults. - Linux: Open a terminal. Runbash Anaconda3-<version>-Linux-x86_64.sh. Accept the license. Accept the default location. Say "yes" when asked to initialize Anaconda. - Wait for the installation to complete. This may take several minutes — Anaconda installs over 250 packages.
Caution
If you already have a Python installation on your machine, Anaconda will install a separate, independent copy. This is fine. Anaconda manages its own environment and will not interfere with your existing setup.
Step 2: Launch JupyterLab
- Windows: Open "Anaconda Navigator" from your Start menu. Click the "Launch" button under JupyterLab. Alternatively, open "Anaconda Prompt" and type
jupyter lab. - macOS/Linux: Open a terminal and type
jupyter lab. - Your default web browser will open with JupyterLab running at
http://localhost:8888/lab.
Step 3: Create Your First Notebook
- In JupyterLab, you will see a launcher page. Under "Notebook," click "Python 3."
- A new tab opens with an empty notebook. The file is called "Untitled.ipynb" by default.
- Right-click the tab and choose "Rename Notebook." Name it
chapter_03_practice.ipynb.
You are ready to code.
Try It: If you have not already done so, install Anaconda and launch JupyterLab now. Create a notebook called
chapter_03_practice.ipynb. You will use it throughout this chapter.
3.3 The Jupyter Notebook Workflow
A Jupyter notebook is an interactive document that mixes code, text, and output. It is the standard tool for data exploration, prototyping, and communication in data science. Think of it as a lab notebook for data work — you write code, see results immediately, and annotate your thought process as you go.
Cells
A notebook is made up of cells. Each cell is a container that holds either code or text.
- Code cells contain Python code. When you run a code cell (Shift+Enter), Python executes the code and displays the output directly below.
- Markdown cells contain formatted text. You use these to explain your analysis, add section headings, or include notes. To change a cell to Markdown, click on it and select "Markdown" from the dropdown menu in the toolbar (which defaults to "Code").
Running Cells
There are three ways to run a cell:
| Method | Action |
|---|---|
| Shift+Enter | Run the current cell and move to the next cell |
| Ctrl+Enter | Run the current cell and stay on it |
| Alt+Enter | Run the current cell and create a new cell below |
Shift+Enter is the one you will use 90% of the time.
Execution Order Matters
Cells run in the order you execute them, not the order they appear on the page. This is a critical concept. If you define a variable in cell 5 but run cell 10 first, cell 10 will not know about that variable. The number in brackets to the left of each cell — [1], [2], [3] — tells you the execution order.
Caution
If your notebook starts behaving unexpectedly, go to the "Kernel" menu and select "Restart Kernel and Run All Cells." This clears all variables and re-runs your cells from top to bottom, ensuring everything is in order.
Useful Keyboard Shortcuts
| Shortcut | Action |
|---|---|
Esc then A |
Insert cell above |
Esc then B |
Insert cell below |
Esc then DD |
Delete cell |
Esc then M |
Convert cell to Markdown |
Esc then Y |
Convert cell to Code |
Tab |
Autocomplete variable or function name |
Shift+Tab |
Show documentation for a function |
Business Insight: Jupyter notebooks are not just coding tools — they are communication tools. At many organizations, data scientists share notebooks as reports. The mix of code, output, and explanatory text means a notebook tells the complete story of an analysis: what was done, why, and what the results mean.
3.4 Python Basics: Variables and Data Types
Now let's learn the fundamentals. Every programming language has a few core building blocks, and Python's are especially intuitive.
Variables
A variable is a name that stores a value. You create a variable by using the equals sign (=), which in Python means "assign," not "equals."
store_count = 47
revenue = 1250000.50
company_name = "Athena Retail Group"
is_profitable = True
After running this cell, Python remembers these four values. You can use them later by referencing their names:
print(company_name)
print(store_count)
Athena Retail Group
47
Definition: Variable — A named reference to a value stored in memory. Variables let you reuse values throughout your code without retyping them. If the underlying value changes, you update it in one place.
Naming conventions. Python variable names:
- Must start with a letter or underscore (not a number)
- Can contain letters, numbers, and underscores
- Are case-sensitive (Revenue and revenue are different variables)
- By convention, use lowercase with underscores for readability: monthly_revenue, not monthlyRevenue or MonthlyRevenue
Data Types
Every value in Python has a type that determines what operations you can perform on it. The four fundamental types are:
| Type | Python Name | Example | Business Use |
|---|---|---|---|
| Integer | int |
47 |
Counts, IDs, quantities |
| Float | float |
1250000.50 |
Revenue, percentages, prices |
| String | str |
"Athena Retail Group" |
Names, categories, descriptions |
| Boolean | bool |
True or False |
Flags, yes/no indicators |
You can check the type of any value using the type() function:
print(type(store_count))
print(type(revenue))
print(type(company_name))
print(type(is_profitable))
<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>
Basic Operators
Python supports the arithmetic operators you would expect, plus a few extras:
# Arithmetic
total_sales = 500000 + 750000 # Addition: 1,250,000
net_revenue = 1250000 - 300000 # Subtraction: 950,000
tax = 950000 * 0.21 # Multiplication: 199,500
per_store = 1250000 / 47 # Division: 26,595.74...
whole_units = 100 // 7 # Floor division: 14 (drops the decimal)
remainder = 100 % 7 # Modulo: 2 (the remainder)
growth = 1.05 ** 3 # Exponentiation: 1.157625 (5% growth over 3 years)
String operations work differently from numbers:
first_name = "Nkechi"
last_name = "Adeyemi"
full_name = first_name + " " + last_name # Concatenation
print(full_name)
Nkechi Adeyemi
Caution
You cannot add a string and a number directly. "Revenue: " + 1250000 will cause a TypeError. Instead, convert the number to a string first: "Revenue: " + str(1250000), or better yet, use an f-string:
revenue = 1250000
print(f"Total revenue: ${revenue:,.2f}")
Total revenue: $1,250,000.00
The f before the quotation mark tells Python this is a formatted string literal (f-string). Anything inside curly braces {} is evaluated as Python code. The ,.2f part is a format specifier: comma separators, two decimal places, formatted as a float.
Try It: Create variables for a business you know — store count, annual revenue, company name, and whether the company is publicly traded. Print a sentence using an f-string that incorporates all four variables.
Comparison Operators
Comparisons return Boolean values (True or False) and are essential for filtering data:
revenue = 1250000
target = 1000000
print(revenue > target) # True — revenue exceeds target
print(revenue == target) # False — they are not equal
print(revenue != target) # True — they are not equal
print(revenue >= 1250000) # True — greater than or equal
True
False
True
True
Notice that == (double equals) is the comparison operator, while = (single equals) is the assignment operator. This is a common source of confusion for beginners.
3.5 Control Flow: Making Decisions and Repeating Actions
So far, our code runs straight through from top to bottom. But real analysis requires decisions (do different things depending on the data) and repetition (do the same thing for many items). Python handles both elegantly.
If/Else Statements
An if statement runs a block of code only when a condition is true:
quarterly_revenue = 320000
quarterly_target = 300000
if quarterly_revenue >= quarterly_target:
print("Target met! Great quarter.")
surplus = quarterly_revenue - quarterly_target
print(f"Surplus: ${surplus:,.2f}")
else:
print("Target not met. Review needed.")
shortfall = quarterly_target - quarterly_revenue
print(f"Shortfall: ${shortfall:,.2f}")
Target met! Great quarter.
Surplus: $20,000.00
Key syntax rules:
- The condition (quarterly_revenue >= quarterly_target) is followed by a colon :
- The indented block below the if runs only when the condition is True
- The else block runs when the condition is False
- Indentation matters. Python uses indentation (4 spaces, by convention) to define code blocks. This is not optional — it is how Python knows which lines belong to the if and which belong to the else.
For multiple conditions, use elif (short for "else if"):
satisfaction_score = 7.2
if satisfaction_score >= 9.0:
rating = "Excellent"
elif satisfaction_score >= 7.0:
rating = "Good"
elif satisfaction_score >= 5.0:
rating = "Average"
else:
rating = "Needs Improvement"
print(f"Customer satisfaction rating: {rating}")
Customer satisfaction rating: Good
Business Insight: If/else logic maps directly to business rules. Discount tiers, customer segmentation criteria, risk classification — all of these are conditional logic that can be expressed as
if/elif/elsechains. When you formalize these rules in code, they become testable, auditable, and repeatable.
For Loops
A for loop repeats a block of code for each item in a sequence:
stores = ["Downtown", "Airport", "Mall", "Suburban", "University"]
for store in stores:
print(f"Generating report for: {store}")
Generating report for: Downtown
Generating report for: Airport
Generating report for: Mall
Generating report for: Suburban
Generating report for: University
The variable store takes on a new value from the list each time the loop runs. You can name this variable anything — store, s, location — but descriptive names make your code readable.
Combining loops with conditionals:
monthly_sales = [45000, 52000, 38000, 61000, 29000, 55000]
target = 40000
for month_num, sales in enumerate(monthly_sales, start=1):
if sales >= target:
status = "Met"
else:
status = "Missed"
print(f"Month {month_num}: ${sales:,} — Target {status}")
Month 1: $45,000 — Target Met
Month 2: $52,000 — Target Met
Month 3: $38,000 — Target Missed
Month 4: $61,000 — Target Met
Month 5: $29,000 — Target Missed
Month 6: $55,000 — Target Met
Code Explanation: The
enumerate()function pairs each item in a list with its index number. Thestart=1parameter makes the numbering begin at 1 instead of 0. This is a common pattern when you want both the position and the value.
While Loops
A while loop repeats as long as a condition remains true:
balance = 10000
monthly_expense = 1500
month = 0
while balance > 0:
balance -= monthly_expense # Subtract monthly expense
month += 1
print(f"Funds depleted after {month} months")
Funds depleted after 7 months
Caution
A while loop can run forever if the condition never becomes false. Always ensure something inside the loop changes the condition. If your notebook seems frozen, press the stop button (square icon) in the toolbar or go to Kernel > Interrupt Kernel.
Try It: Write a for loop that iterates through a list of five product prices. For each price, apply a 15% discount and print the original price and the discounted price.
3.6 Data Structures: Organizing Your Data
Variables hold single values. But business data comes in collections — lists of customers, tables of transactions, mappings of product codes to descriptions. Python provides three essential data structures for organizing collections.
Lists
A list is an ordered, changeable collection of items. Lists are the workhorse data structure in Python.
regions = ["Northeast", "Southeast", "Midwest", "West", "International"]
quarterly_revenue = [1250000, 980000, 1100000, 1450000, 620000]
Accessing items — Lists are zero-indexed, meaning the first item is at position 0:
print(regions[0]) # Northeast (first item)
print(regions[2]) # Midwest (third item)
print(regions[-1]) # International (last item)
Common list operations:
# Add an item
regions.append("Canada")
print(regions)
# ['Northeast', 'Southeast', 'Midwest', 'West', 'International', 'Canada']
# Get the length
print(len(regions)) # 6
# Slice a list (items at index 1, 2, 3 — not including 4)
print(regions[1:4]) # ['Southeast', 'Midwest', 'West']
# Sum numeric lists
print(sum(quarterly_revenue)) # 5400000
print(max(quarterly_revenue)) # 1450000
print(min(quarterly_revenue)) # 620000
List comprehensions — A compact way to create lists. This is one of Python's most powerful features:
# Standard approach
discounted_prices = []
original_prices = [29.99, 49.99, 99.99, 149.99]
for price in original_prices:
discounted_prices.append(price * 0.85)
# List comprehension (same result, one line)
discounted_prices = [price * 0.85 for price in original_prices]
print(discounted_prices)
[25.4915, 42.4915, 84.9915, 127.4915]
Dictionaries
A dictionary stores key-value pairs. Use dictionaries when you need to look up values by name rather than by position.
store_info = {
"name": "Downtown Flagship",
"region": "Northeast",
"employees": 45,
"annual_revenue": 3200000,
"is_flagship": True
}
Accessing values:
print(store_info["name"]) # Downtown Flagship
print(store_info["annual_revenue"]) # 3200000
Adding and updating values:
store_info["year_opened"] = 2018 # Add new key-value pair
store_info["employees"] = 48 # Update existing value
Iterating through a dictionary:
for key, value in store_info.items():
print(f"{key}: {value}")
name: Downtown Flagship
region: Northeast
employees: 48
annual_revenue: 3200000
is_flagship: True
year_opened: 2018
Business Insight: Dictionaries are natural representations of business records. A customer profile, a product specification, a financial report — each is a collection of named attributes. When you later learn about DataFrames in pandas, you will see that each row is essentially a dictionary.
Tuples
A tuple is like a list, but it cannot be changed after creation. Use tuples for data that should not be modified — coordinate pairs, database records, or function return values.
headquarters = ("New York", "NY", 10001)
fiscal_year = (2025, "Q1", "January", "March")
# Access works like lists
print(headquarters[0]) # New York
# But you cannot modify a tuple
# headquarters[0] = "Boston" # This would cause a TypeError
When to Use Each
| Structure | Use When | Example |
|---|---|---|
| List | You have an ordered collection that may change | Store names, monthly sales figures |
| Dictionary | You need to look up values by name | Customer profiles, configuration settings |
| Tuple | You have fixed data that should not change | Geographic coordinates, RGB color values |
Try It: Create a dictionary representing a product in a retail catalog. Include at least five attributes (name, price, category, SKU, in-stock status). Then write a for loop that prints each attribute and its value.
3.7 Functions: Writing Reusable Code
As your code grows, you will find yourself repeating the same operations. Functions let you wrap reusable logic into named blocks that you can call whenever you need them.
Defining a Function
def calculate_profit_margin(revenue, costs):
"""Calculate profit margin as a percentage."""
profit = revenue - costs
margin = (profit / revenue) * 100
return margin
Let's break this down:
- def tells Python you are defining a function.
- calculate_profit_margin is the function's name (descriptive, lowercase, underscores).
- revenue and costs are parameters — inputs the function expects.
- The triple-quoted string ("""...""") is a docstring — a brief description of what the function does. This is a best practice.
- return margin sends the result back to the caller.
Calling a Function
q1_margin = calculate_profit_margin(revenue=1250000, costs=875000)
print(f"Q1 Profit Margin: {q1_margin:.1f}%")
Q1 Profit Margin: 30.0%
Functions with Default Parameters
def apply_discount(price, discount_rate=0.10):
"""Apply a discount to a price. Default discount is 10%."""
return price * (1 - discount_rate)
# Using the default discount
print(apply_discount(99.99)) # 89.991
# Specifying a different discount
print(apply_discount(99.99, 0.25)) # 74.9925
Functions That Process Lists
def summarize_sales(sales_list):
"""Return a summary dictionary for a list of sales figures."""
return {
"total": sum(sales_list),
"average": sum(sales_list) / len(sales_list),
"max": max(sales_list),
"min": min(sales_list),
"count": len(sales_list)
}
monthly_sales = [45000, 52000, 38000, 61000, 29000, 55000,
48000, 63000, 41000, 57000, 34000, 72000]
summary = summarize_sales(monthly_sales)
for metric, value in summary.items():
print(f"{metric}: ${value:,.2f}")
total: $595,000.00
average: $49,583.33
max: $72,000.00
min: $29,000.00
count: $12.00
Code Explanation: The function returns a dictionary with five summary statistics. After calling the function, we iterate through the dictionary to print each metric. Notice that
countdisplays as$12.00— that is a formatting artifact because we applied the dollar format to all values. In practice, you would format each metric appropriately. This is the kind of small issue you catch through iteration.
Why Functions Matter for Business
Functions embody a principle from both software engineering and business process management: do not repeat yourself. If you write the same calculation in five notebook cells, and then the formula changes, you must find and update all five. With a function, you update it once.
Functions also make your code self-documenting. Compare:
# Without a function
result = (revenue - costs) / revenue * 100
# With a function
result = calculate_profit_margin(revenue, costs)
The second version tells you what the code does even if you do not remember the formula.
Try It: Write a function called
classify_customerthat takes a customer's annual spending as input and returns a string: "Platinum" for spending above $10,000, "Gold" for $5,000-$10,000, "Silver" for $1,000-$5,000, and "Bronze" for below $1,000. Test it with several values.
3.8 Introduction to pandas
Everything you have learned so far — variables, loops, functions — is standard Python. Now we introduce the library that makes Python the dominant language for business data analysis: pandas.
Pandas (the name comes from "panel data," an econometrics term) provides two core data structures: the Series (a single column of data) and the DataFrame (a table of data with rows and columns). If Python is the language, pandas is the dialect that data analysts speak.
Importing pandas
import pandas as pd
This line loads the pandas library and gives it the nickname pd. This is a universal convention — every pandas tutorial, every data science team, every Stack Overflow answer uses pd. When you see pd.something, it means "use a function from pandas."
Series: A Single Column
A Series is a one-dimensional labeled array:
revenue = pd.Series(
[1250000, 980000, 1100000, 1450000, 620000],
index=["Northeast", "Southeast", "Midwest", "West", "International"],
name="Annual Revenue"
)
print(revenue)
Northeast 1250000
Southeast 980000
Midwest 1100000
West 1450000
International 620000
Name: Annual Revenue, dtype: int64
You can perform operations on the entire Series at once:
# Revenue in millions
print(revenue / 1_000_000)
# Total revenue
print(f"Total: ${revenue.sum():,.0f}")
# Average revenue
print(f"Average: ${revenue.mean():,.0f}")
DataFrame: A Table of Data
A DataFrame is a two-dimensional table — like a spreadsheet or a SQL table. Each column is a Series, and columns can have different data types.
data = {
"store": ["Downtown", "Airport", "Mall", "Suburban", "University"],
"region": ["Northeast", "Northeast", "Southeast", "Midwest", "West"],
"employees": [45, 30, 55, 20, 15],
"annual_revenue": [3200000, 2800000, 4100000, 1500000, 900000],
"year_opened": [2018, 2020, 2015, 2021, 2022]
}
stores_df = pd.DataFrame(data)
print(stores_df)
store region employees annual_revenue year_opened
0 Downtown Northeast 45 3200000 2018
1 Airport Northeast 30 2800000 2020
2 Mall Southeast 55 4100000 2015
3 Suburban Midwest 20 1500000 2021
4 University West 15 900000 2022
Definition: DataFrame — A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure provided by the pandas library. Think of it as a spreadsheet in your code — rows are records, columns are fields, and the whole thing is programmable.
Reading Data from Files
In practice, you will rarely type data into your code. Instead, you load it from files — most commonly CSV (Comma-Separated Values) files.
# Read a CSV file into a DataFrame
sales_df = pd.read_csv("athena_monthly_sales.csv")
# See the first 5 rows
sales_df.head()
Other common formats:
# Excel files
df = pd.read_excel("report.xlsx", sheet_name="Q1 Sales")
# JSON files
df = pd.read_json("api_response.json")
# From a SQL database
# df = pd.read_sql("SELECT * FROM sales", connection)
Exploring Your Data
When you load a new dataset, your first step should always be understanding its shape and content. Pandas provides several methods for this:
# How many rows and columns?
print(sales_df.shape) # (240, 6) means 240 rows, 6 columns
# Column names and data types
print(sales_df.dtypes)
# Summary statistics for numeric columns
sales_df.describe()
# First few rows
sales_df.head()
# Last few rows
sales_df.tail()
# General information (data types, non-null counts, memory usage)
sales_df.info()
Business Insight: The
describe()method is your one-line executive summary. It shows count, mean, standard deviation, min, max, and quartiles for every numeric column. Get in the habit of calling it first whenever you load new data.Try It: If you have any CSV file on your computer (a spreadsheet export, a downloaded dataset), try loading it with
pd.read_csv(). If not, create a small DataFrame from a dictionary as shown above. Call.head(),.shape,.dtypes, and.describe()on it.
3.9 Basic Data Manipulation with pandas
Now that you can load data into a DataFrame, let's learn how to work with it. These operations — selecting, filtering, grouping, and aggregating — are the core of data analysis and the ones you will use in every chapter of this book going forward.
Selecting Columns
# Select a single column (returns a Series)
regions = sales_df["region"]
# Select multiple columns (returns a DataFrame)
subset = sales_df[["store", "revenue", "region"]]
Filtering Rows
Filtering is how you answer questions like "Which stores missed their target?" or "Show me only Northeast region data."
# Filter for rows where revenue exceeds 50,000
high_revenue = sales_df[sales_df["revenue"] > 50000]
# Filter for a specific region
northeast = sales_df[sales_df["region"] == "Northeast"]
# Combine conditions with & (and) or | (or)
# Stores in the Northeast with revenue above 50,000
ne_high = sales_df[
(sales_df["region"] == "Northeast") & (sales_df["revenue"] > 50000)
]
Caution
When combining conditions in pandas filters, each condition must be wrapped in parentheses, and you must use & (not and) and | (not or). This is one of the most common beginner mistakes.
Sorting
# Sort by revenue, highest first
sorted_df = sales_df.sort_values("revenue", ascending=False)
# Sort by region (alphabetically), then by revenue within each region
sorted_df = sales_df.sort_values(["region", "revenue"], ascending=[True, False])
Creating New Columns
# Revenue per employee
sales_df["revenue_per_employee"] = sales_df["revenue"] / sales_df["employees"]
# Performance category
sales_df["performance"] = sales_df["revenue"].apply(
lambda x: "High" if x > 50000 else "Low"
)
Code Explanation: The
apply()method runs a function on every value in a column. Thelambdakeyword creates a small anonymous function inline. This pattern is extremely common for creating categorical columns from numeric data.
GroupBy and Aggregation
groupby() is the pandas equivalent of a pivot table — and one of the most powerful operations in data analysis. It splits your data into groups, applies a calculation to each group, and combines the results.
# Average revenue by region
sales_df.groupby("region")["revenue"].mean()
region
International 41500.0
Midwest 52000.0
Northeast 63750.0
Southeast 58000.0
West 45500.0
Name: revenue, dtype: float64
# Multiple aggregations at once
region_summary = sales_df.groupby("region")["revenue"].agg(["sum", "mean", "count"])
print(region_summary)
# Group by multiple columns
category_region = sales_df.groupby(["region", "category"])["revenue"].sum()
print(category_region)
Business Insight: The
groupby()operation is your most versatile analytical tool. "What is the average order value by customer segment?" —groupby. "What is the total revenue by quarter and product line?" —groupby. "How many support tickets were opened by region and priority level?" —groupby. Master this one method, and you can answer most business reporting questions.Try It: Using any DataFrame you have created, practice these operations: 1. Select two columns and display them. 2. Filter for rows meeting a specific condition. 3. Create a new calculated column. 4. Group by one column and calculate the mean of another.
3.10 Your First Analysis: Athena Retail Sales Data
It is time to put everything together. In this section, we will walk through a complete analysis of sales data from Athena Retail Group — the same kind of analysis that Ravi Mehta's data team runs each month to identify underperforming store categories.
Athena Update: Ravi Mehta, VP of Data & AI at Athena Retail Group, has tasked his team with building a standardized monthly sales report. Previously, this report was assembled manually in Excel — an analyst would download data from the point-of-sale system, create pivot tables, format charts, and email the result to regional managers. The process took most of a day. Ravi wants to automate it. This analysis is the first step.
The Dataset
We will work with a synthetic dataset representing 12 months of sales data across Athena's stores. The CSV file has the following columns:
| Column | Description | Example |
|---|---|---|
month |
Month of the record | "2025-01" |
store |
Store name | "Downtown Flagship" |
region |
Geographic region | "Northeast" |
category |
Product category | "Electronics" |
revenue |
Monthly revenue ($) | 87500 |
units_sold |
Number of units sold | 1250 |
Let's begin by creating the dataset. In a real workflow, you would load this from a CSV file that Athena's systems generate. Here, we will create it programmatically so the analysis is fully reproducible.
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Define the dimensions
months = pd.date_range("2025-01", periods=12, freq="MS").strftime("%Y-%m").tolist()
stores = {
"Downtown Flagship": "Northeast",
"Airport Express": "Northeast",
"Riverside Mall": "Southeast",
"Lakeside Plaza": "Southeast",
"Central Station": "Midwest",
"Prairie View": "Midwest",
"Harbor Point": "West",
"Mountain Ridge": "West",
"University Square": "West",
"Metro Hub": "Northeast"
}
categories = ["Electronics", "Apparel", "Home & Garden", "Grocery"]
# Generate the data
rows = []
for month in months:
for store, region in stores.items():
for category in categories:
# Base revenue varies by store type and category
base = {
"Electronics": 65000, "Apparel": 40000,
"Home & Garden": 30000, "Grocery": 55000
}[category]
# Regional multipliers
region_mult = {
"Northeast": 1.2, "Southeast": 1.0,
"Midwest": 0.85, "West": 1.1
}[region]
# Store-specific adjustment
store_mult = 1.3 if "Flagship" in store else (
0.7 if "Prairie" in store else 1.0
)
# Seasonal pattern (higher in Q4)
month_num = int(month.split("-")[1])
seasonal = 1.0 + (0.3 if month_num in [11, 12] else (
-0.1 if month_num in [1, 2] else 0.0
))
revenue = base * region_mult * store_mult * seasonal
revenue = int(revenue * np.random.uniform(0.85, 1.15))
units = int(revenue / np.random.uniform(35, 75))
rows.append({
"month": month,
"store": store,
"region": region,
"category": category,
"revenue": revenue,
"units_sold": units
})
sales_df = pd.DataFrame(rows)
print(f"Dataset created: {sales_df.shape[0]} rows, {sales_df.shape[1]} columns")
sales_df.head(10)
Code Explanation: This block creates a realistic dataset with 480 rows (12 months x 10 stores x 4 categories). The revenue figures are not purely random — they reflect real business patterns: regional differences in purchasing power, flagship stores outperforming smaller locations, and a seasonal spike in Q4 (November-December). The
np.random.uniform(0.85, 1.15)adds noise so the numbers are not perfectly formulaic.
Step 1: Explore the Data
# Basic shape and structure
print(f"Rows: {sales_df.shape[0]}")
print(f"Columns: {sales_df.shape[1]}")
print(f"Months covered: {sales_df['month'].nunique()}")
print(f"Stores: {sales_df['store'].nunique()}")
print(f"Categories: {sales_df['category'].nunique()}")
print()
# Summary statistics
sales_df.describe()
# Check for missing values
print(sales_df.isnull().sum())
Code Explanation: The
nunique()method counts the number of unique values in a column — a quick way to understand the cardinality of categorical fields. Theisnull().sum()check is a habit you should develop: always verify whether your data contains missing values before analyzing it.
Step 2: Summary Statistics
# Total revenue across all stores and months
total_revenue = sales_df["revenue"].sum()
print(f"Total Annual Revenue: ${total_revenue:,.0f}")
# Average monthly revenue per store
avg_monthly = sales_df.groupby("store")["revenue"].mean().sort_values(ascending=False)
print("\nAverage Monthly Revenue by Store:")
print(avg_monthly.apply(lambda x: f"${x:,.0f}"))
Step 3: Regional Analysis
# Revenue by region
region_summary = sales_df.groupby("region").agg(
total_revenue=("revenue", "sum"),
avg_revenue=("revenue", "mean"),
total_units=("units_sold", "sum"),
store_count=("store", "nunique")
).sort_values("total_revenue", ascending=False)
# Add revenue per store
region_summary["revenue_per_store"] = (
region_summary["total_revenue"] / region_summary["store_count"]
)
print("Regional Performance Summary:")
print(region_summary)
Code Explanation: The
agg()method lets you apply multiple aggregation functions at once and name the resulting columns. The syntax("revenue", "sum")means "apply thesumfunction to therevenuecolumn and call the resulttotal_revenue." This is more flexible than calling.sum()or.mean()individually.
Step 4: Identify Underperforming Stores
This is the analysis Ravi Mehta's team cares about most — which store-category combinations are falling behind?
# Average revenue by store and category
store_category = sales_df.groupby(["store", "category"])["revenue"].mean().reset_index()
store_category.columns = ["store", "category", "avg_monthly_revenue"]
# Calculate the overall average for each category
category_avg = sales_df.groupby("category")["revenue"].mean()
print("Category Averages:")
for cat, avg in category_avg.items():
print(f" {cat}: ${avg:,.0f}")
# Flag underperformers: stores whose category average is below 80% of
# the overall category average
underperformers = []
for _, row in store_category.iterrows():
cat_avg = category_avg[row["category"]]
if row["avg_monthly_revenue"] < cat_avg * 0.80:
underperformers.append({
"store": row["store"],
"category": row["category"],
"avg_revenue": row["avg_monthly_revenue"],
"category_avg": cat_avg,
"pct_of_avg": (row["avg_monthly_revenue"] / cat_avg) * 100
})
underperformer_df = pd.DataFrame(underperformers)
underperformer_df = underperformer_df.sort_values("pct_of_avg")
print(f"\nUnderperforming Store-Categories (below 80% of category average):")
print(f"Found {len(underperformer_df)} underperforming combinations.\n")
print(underperformer_df.to_string(index=False))
Business Insight: Defining "underperforming" requires a threshold. We used 80% of the category average, but this is a business decision, not a technical one. Different thresholds surface different stores. In practice, you would work with stakeholders to define what "underperforming" means for their context.
Step 5: Category Mix by Region
# What percentage of each region's revenue comes from each category?
region_category = sales_df.groupby(["region", "category"])["revenue"].sum().reset_index()
region_totals = sales_df.groupby("region")["revenue"].sum()
region_category["pct_of_regional_revenue"] = region_category.apply(
lambda row: (row["revenue"] / region_totals[row["region"]]) * 100,
axis=1
)
# Pivot for a cleaner view
category_mix = region_category.pivot(
index="region", columns="category", values="pct_of_regional_revenue"
).round(1)
print("Category Mix by Region (% of regional revenue):")
print(category_mix)
Code Explanation: The
pivot()method reshapes data from "long" format (one row per region-category combination) to "wide" format (one row per region, one column per category). This is the pandas equivalent of creating a pivot table in Excel.
Step 6: Create a Simple Visualization
import matplotlib.pyplot as plt
# Monthly revenue trend
monthly_revenue = sales_df.groupby("month")["revenue"].sum()
plt.figure(figsize=(10, 5))
plt.bar(monthly_revenue.index, monthly_revenue.values, color="steelblue")
plt.title("Athena Retail Group — Monthly Revenue (2025)", fontsize=14)
plt.xlabel("Month")
plt.ylabel("Total Revenue ($)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Code Explanation:
matplotlibis Python's foundational plotting library. Theplt.bar()function creates a bar chart. Thefigsize=(10, 5)sets the chart width to 10 inches and height to 5 inches.plt.tight_layout()prevents labels from being cut off. We will explore visualization in depth in Chapter 5.
# Revenue by region — horizontal bar chart
region_revenue = sales_df.groupby("region")["revenue"].sum().sort_values()
plt.figure(figsize=(8, 4))
plt.barh(region_revenue.index, region_revenue.values, color="darkorange")
plt.title("Total Revenue by Region (2025)", fontsize=14)
plt.xlabel("Total Revenue ($)")
plt.tight_layout()
plt.show()
Step 7: Draw Business Conclusions
# Compile key findings
print("=" * 60)
print("ATHENA RETAIL GROUP — MONTHLY SALES ANALYSIS")
print("Period: January 2025 – December 2025")
print("=" * 60)
total = sales_df["revenue"].sum()
print(f"\nTotal Revenue: ${total:,.0f}")
print(f"Total Units Sold: {sales_df['units_sold'].sum():,}")
print(f"\nTop Region by Revenue: "
f"{sales_df.groupby('region')['revenue'].sum().idxmax()}")
print(f"Top Store by Avg Monthly Revenue: "
f"{sales_df.groupby('store')['revenue'].mean().idxmax()}")
print(f"Top Category by Revenue: "
f"{sales_df.groupby('category')['revenue'].sum().idxmax()}")
print(f"\nUnderperforming Store-Categories: {len(underperformer_df)}")
print(f"(Defined as below 80% of category average)")
print("\nRecommendation: Regional managers should review the"
f" {len(underperformer_df)} underperforming store-category"
" combinations identified above. Priority should be given"
" to stores where multiple categories underperform, as this"
" may indicate systemic issues (staffing, location, inventory).")
NK stared at the output on her screen. Forty minutes earlier, she had not known what a DataFrame was. Now she had a complete analysis — summary statistics, regional breakdowns, underperformer identification, charts, and a written recommendation — all in a single notebook.
"I expected this to take days," she said.
Tom looked at her notebook. "You know what's interesting? I would have written half of this differently — more loops, more complex code. But yours is cleaner. You went straight to the pandas functions because you didn't know the 'hard way' yet."
Professor Okonkwo smiled from the front of the room. "That is precisely the point. You are not learning to be software engineers. You are learning to ask better questions — and to get answers faster than your competitors."
Athena Update: In the real Athena Retail Group, Ravi Mehta's team has automated a version of this analysis. It runs every Monday morning, pulls fresh data from the point-of-sale system, and generates a notebook that regional managers can open in their browsers. The analysis that once took an analyst most of a day now runs in under 30 seconds — and the analyst who used to build the report manually has been reassigned to more strategic work: building the customer segmentation model we will see in Chapter 9.
3.11 Common Errors and Debugging
Every programmer makes mistakes. The difference between beginners and experienced coders is not that experienced coders make fewer errors — it is that they read error messages and fix problems faster. Here are the errors you will encounter most often.
NameError
print(revnue)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-...> in <module>
----> 1 print(revnue)
NameError: name 'revnue' is not defined
What happened: You referenced a variable that does not exist. Usually a typo — here, revnue instead of revenue.
Fix: Check the spelling. If you are sure the variable exists, you may not have run the cell that defines it. Check the execution order.
TypeError
revenue = 1250000
label = "Revenue: " + revenue
TypeError: can only concatenate str (not "int") to str
What happened: You tried to combine a string and a number with +. Python does not automatically convert types.
Fix: Convert the number to a string: "Revenue: " + str(revenue), or use an f-string: f"Revenue: {revenue}".
SyntaxError
if revenue > 1000000
print("High revenue")
SyntaxError: expected ':'
What happened: You forgot the colon at the end of the if statement.
Fix: Add the colon: if revenue > 1000000:
IndentationError
if revenue > 1000000:
print("High revenue")
IndentationError: expected an indented block after 'if' statement
What happened: The line after if must be indented. Python uses indentation to define code blocks.
Fix: Add four spaces (or press Tab) before print:
if revenue > 1000000:
print("High revenue")
KeyError (pandas)
sales_df["revenues"]
KeyError: 'revenues'
What happened: You tried to access a column that does not exist in the DataFrame. The column is called revenue (singular), not revenues.
Fix: Check column names with print(sales_df.columns) or sales_df.dtypes.
IndexError
regions = ["Northeast", "Southeast", "Midwest"]
print(regions[5])
IndexError: list index out of range
What happened: You tried to access index 5, but the list only has 3 items (indices 0, 1, 2).
Fix: Check the length with len(regions) and use a valid index.
The Debugging Mindset
When you encounter an error:
- Read the error message. Python's error messages tell you the type of error and the exact line where it occurred. Start there.
- Check the last line of the traceback. It contains the most specific information about what went wrong.
- Use
print()to inspect variables. If you are not sure what a variable contains, print it. Print its type withtype(). Print its shape with.shapeif it is a DataFrame. - Restart and run all. If your notebook is in a confusing state, go to Kernel > Restart Kernel and Run All Cells. This resets everything and runs cells in order.
- Search the error message. Copy the error message into a search engine. You are almost never the first person to encounter a given error, and Stack Overflow usually has the answer.
Business Insight: Debugging is not a sign of failure — it is a normal part of the coding process. Senior data scientists at top companies spend significant time reading error messages and testing their code. The willingness to confront errors calmly and systematically is a professional skill, not just a technical one.
Try It: Intentionally introduce each of the errors described above into your notebook. Run the code, read the error message, and then fix it. Building familiarity with error messages now will save you hours of frustration later.
3.12 Putting It All Together: Your Python Toolkit
Let's take stock of what you have learned in this chapter:
| Concept | What It Does | Business Application |
|---|---|---|
| Variables | Store values | Hold metrics, thresholds, settings |
| Data types | Define what operations are valid | Ensure data integrity |
| If/else | Make decisions | Apply business rules |
| For loops | Repeat operations | Process lists of stores, products |
| Functions | Encapsulate reusable logic | Standardize calculations |
| Lists | Store ordered collections | Hold categories, time periods |
| Dictionaries | Store key-value pairs | Represent records, configurations |
| pandas DataFrame | Store tabular data | Your primary analysis surface |
| Filtering | Subset data by condition | Find underperformers, outliers |
| GroupBy | Aggregate by category | Regional summaries, segment analysis |
| Visualization | Create charts | Communicate findings |
This is your foundation. Every chapter from here forward builds on these skills:
- Chapter 5 (Exploratory Data Analysis) will deepen your pandas and visualization work with the
EDAReporttool. - Chapter 7 (Supervised Learning — Classification) will use pandas to prepare data for a churn prediction model.
- Chapter 19 (Prompt Engineering) will combine Python with AI APIs.
You do not need to memorize every function or method. You need to understand the concepts well enough to know what is possible, and then look up the specifics when you need them. The pandas documentation, Stack Overflow, and — increasingly — AI assistants are legitimate tools that every professional data analyst uses daily.
Chapter Summary
NK closed her laptop at the end of class and sat still for a moment. She had arrived expecting to be lost — she had told herself that coding was for "technical people," a category she had spent her career carefully staying outside of. Instead, she had loaded a dataset, filtered it, grouped it by region, identified underperformers, and created a chart. The code was not elegant, and she had made mistakes along the way (one NameError, two missing colons, and a filter that returned zero rows because she forgot the parentheses). But the mistakes had been fixable, and the final result was real.
Tom, packing up next to her, said: "You know what I learned today?"
"How to be patient with a beginner?" NK said.
"No — well, yes, that too. But explaining things to you made me realize how much I do on autopilot. You asked me why I used a dictionary instead of a list, and I had to actually think about it. I've been writing code for years without thinking about why I make certain choices."
Professor Okonkwo, passing by, overheard. "That," she said, "is the difference between a coder and a professional. The professional asks why."
NK nodded. She was not a software engineer, and she did not need to be. But she was no longer afraid of the blinking cursor.
Looking Ahead: In the next chapter, we will explore Data Strategy and Data Literacy — the organizational and governance questions that determine whether data becomes a strategic asset or a liability. Then, in Chapter 5, we will return to Python with a deeper dive into exploratory data analysis, visualization, and the
EDAReportclass that will become a standard tool in your analytical workflow.