Case Study 11-1: Priya Explores the Acme Sales Dataset
Characters: Priya Okonkwo (Senior Analyst), Marcus Webb (IT Manager) Setting: Acme Corp headquarters, a Monday morning in early January 2024 Skill focus: Systematic dataset exploration, using .info(), .describe(), .value_counts(), and missing value detection
The Setup
Priya Okonkwo arrives at her desk to find a calendar invite from Sandra Chen, VP of Sales, scheduled for Wednesday afternoon. The subject line: "2023 Annual Sales Performance Review — need your analysis." Attached to the invite is a two-sentence note:
"Priya, I need a complete picture of how 2023 went: which regions performed best, which products drove revenue, and whether our discount strategy was effective. Marcus will send you the raw data this morning."
At 9:14 AM, an email arrives from Marcus Webb:
"Priya — attached is the full 2023 sales transaction file:
acme_sales_2023.csv. I pulled it directly from the CRM export. 2,500 rows, should be everything. Let me know if anything looks off."
Priya downloads the file, opens a new Python script, and begins her standard data exploration workflow. She has learned from experience: do not assume a file is clean just because it came from a reliable source.
Step 1: Loading the File
Priya starts with a careful, intentional load. She knows the file is a CSV from the CRM system, and she knows dates and product SKUs need special treatment.
import pandas as pd
# Load the 2023 sales file
# - parse_dates: convert date column to datetime objects
# - dtype: keep product_sku as string (SKUs can have leading zeros)
# - na_values: catch common missing-value representations from the CRM export
df = pd.read_csv(
"acme_sales_2023.csv",
parse_dates=["date"],
dtype={"product_sku": str},
na_values=["N/A", "n/a", "--", ""]
)
print("File loaded successfully.")
print(f"Shape: {df.shape}")
Output:
File loaded successfully.
Shape: (2500, 9)
Good — the row count matches what Marcus said. Nine columns: date, region, salesperson, product_sku, category, units_sold, unit_price, discount_rate, total_revenue.
Step 2: First Visual Check
Priya always looks at the first and last few rows immediately. This catches formatting issues, extra header rows, or summary totals that sometimes get accidentally included in data exports.
print("=== FIRST 5 ROWS ===")
print(df.head())
print("\n=== LAST 5 ROWS ===")
print(df.tail())
Output (first 5 rows):
date region salesperson product_sku category units_sold unit_price discount_rate total_revenue
0 2023-01-03 Northeast Jordan Blake OFF-CH-0045 Office Chairs 2 249.99 0.10 449.98
1 2023-01-03 Southwest Casey Rivera DESK-SIT-012 Desks 1 899.99 0.15 764.99
2 2023-01-04 Midwest Alex Torres TECH-MON-089 Technology 3 349.99 0.05 997.47
3 2023-01-05 NaN Marcus Webb OFF-CH-0046 Office Chairs 1 299.99 0.00 299.99
4 2023-01-05 Northeast Jordan Blake STOR-CAB-023 Storage 4 129.99 0.10 467.96
Priya stops immediately at row 3. The region column is NaN — Marcus Webb's transaction is missing a region. She notes this but continues her systematic review before drawing conclusions.
The last 5 rows look normal: no summary row, no "TOTAL" label, no blank rows at the bottom.
Step 3: Data Types
print("=== DATA TYPES ===")
print(df.dtypes)
Output:
date datetime64[ns]
region object
salesperson object
product_sku object
category object
units_sold int64
unit_price float64
discount_rate float64
total_revenue float64
dtype: object
Priya checks each type against her expectations:
dateisdatetime64[ns]— correct, theparse_datesparameter worked.region,salesperson,product_sku,categoryareobject(string) — expected.units_soldisint64— expected.unit_price,discount_rate,total_revenuearefloat64— expected.
No surprises here. But she has already seen a missing region value. The real question is how many there are.
Step 4: The .info() Summary
print("=== DATASET INFO ===")
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2500 non-null datetime64[ns]
1 region 2487 non-null object
2 salesperson 2500 non-null object
3 product_sku 2500 non-null object
4 category 2495 non-null object
5 units_sold 2500 non-null int64
6 unit_price 2500 non-null float64
7 discount_rate 2500 non-null float64
8 total_revenue 2500 non-null float64
dtypes: float64(3), int64(1), object(4), datetime64[ns](1)
memory usage: 175.8 KB
Priya highlights two lines and writes in her notebook:
Issue #1:
region— 13 missing values (2500 - 2487 = 13) Issue #2:category— 5 missing values (2500 - 2495 = 5)
Thirteen out of 2,500 is 0.52% — small but not trivial. If she is grouping revenue by region, those 13 transactions will be excluded or placed in an "Unknown" bucket. She needs to decide what to do with them.
Step 5: Statistical Summary
print("=== DESCRIPTIVE STATISTICS ===")
print(df.describe().round(2))
Output:
units_sold unit_price discount_rate total_revenue
count 2500.00 2500.00 2500.00 2500.00
mean 3.81 312.46 0.09 1053.23
std 2.16 198.23 0.06 789.23
min 1.00 12.99 0.00 12.99
25% 2.00 149.99 0.05 384.35
50% 3.00 279.99 0.10 839.94
75% 5.00 449.99 0.15 1574.92
max 50.00 1899.99 0.30 94999.50
Priya scans systematically:
units_sold: Min of 1 and max of 50. That maximum seems high — a single transaction for 50 units of an office chair? She makes a note to investigate the largest transactions. The mean is 3.81 and the median is 3, which are close — probably not a distribution problem.
unit_price: Range of $12.99 to $1,899.99. Plausible for office furniture and technology products.
discount_rate: Values from 0.00 to 0.30 (0% to 30%). The mean is 0.09 (9%) and the median is 0.10 (10%). This looks reasonable.
total_revenue: Here is something interesting. The mean is $1,053 but the maximum is $94,999.50 — there are some very large transactions pulling the average up significantly above the median of $840. She wants to look at those high-value outliers.
# Find the top 5 highest-value transactions
print("=== TOP 5 TRANSACTIONS BY REVENUE ===")
print(df.nlargest(5, "total_revenue")[
["date", "salesperson", "product_sku", "category", "units_sold",
"unit_price", "total_revenue"]
])
Output:
date salesperson product_sku category units_sold unit_price total_revenue
1847 2023-11-21 Jordan Blake TECH-LAP-055 Technology 50 1899.99 94999.50
2103 2023-08-14 Morgan Lee TECH-LAP-055 Technology 40 1899.99 75999.60
...
Fifty laptops at $1,899.99 each — that is an enterprise sale. These are legitimate bulk orders, not data errors. Good.
Step 6: Region and Category Value Counts
print("=== REGION DISTRIBUTION ===")
print(df["region"].value_counts(dropna=False))
Output:
Northeast 612
Southwest 521
Midwest 489
Southeast 468
West 397
NaN 13
Name: region, dtype: int64
Five regions plus 13 missing. The dropna=False parameter ensures the NaN count is visible — an important habit. Without it, Priya would see only the five regions and might not notice the missing values.
print("\n=== CATEGORY DISTRIBUTION ===")
print(df["category"].value_counts(dropna=False))
Output:
Office Chairs 847
Desks 623
Technology 589
Storage 431
office chairs 7
OFFICE CHAIRS 3
NaN 5
Name: category, dtype: int64
Priya leans forward in her chair.
Issue #3: Inconsistent category capitalization. "Office Chairs", "office chairs", and "OFFICE CHAIRS" are clearly the same category but will be treated as three separate values by pandas. Her pivot tables and group-by operations will split this category into three incorrect rows.
She quickly checks the other categories:
# Check all unique category values (sorted, to make casing issues obvious)
print("All unique category values:")
for val in sorted(df["category"].dropna().unique()):
print(f" '{val}'")
Output:
All unique category values:
'Desks'
'OFFICE CHAIRS'
'Office Chairs'
'Storage'
'Technology'
'office chairs'
Desks, Storage, and Technology appear only in title case — no inconsistencies there. The problem is isolated to "Office Chairs." She counts: 847 + 7 + 3 = 857 total "Office Chairs" transactions, but only 847 are labeled correctly right now.
Step 7: Unique Value Checks
print("=== UNIQUE VALUE COUNTS ===")
print(df.nunique())
Output:
date 248
region 5
salesperson 22
product_sku 127
category 6
units_sold 18
unit_price 89
discount_rate 7
total_revenue 832
dtype: int64
Wait — category shows 6 unique values, not 4. That is because "Office Chairs", "office chairs", and "OFFICE CHAIRS" are counted as three separate unique values. If the data were clean, she would expect exactly 4 category values (Office Chairs, Desks, Technology, Storage).
She notes: 22 salespersons, 127 distinct product SKUs, 7 distinct discount rates, 248 transaction dates. The 248 dates roughly match 365 days minus weekends and holidays — plausible.
# Check discount rates (should be a small set of standard values)
print("=== DISCOUNT RATES ===")
print(df["discount_rate"].value_counts().sort_index())
Output:
0.00 378
0.05 412
0.10 689
0.15 521
0.20 356
0.25 109
0.30 35
Name: discount_rate, dtype: int64
Seven standard discount tiers: 0%, 5%, 10%, 15%, 20%, 25%, 30%. All whole percentages — no oddities like 0.127 or 0.33. The discount structure is clean.
Step 8: Checking for Duplicate Rows
print("=== DUPLICATE ROWS ===")
n_duplicates = df.duplicated().sum()
print(f"Exact duplicate rows: {n_duplicates}")
if n_duplicates > 0:
print("\nSample duplicate rows:")
print(df[df.duplicated(keep=False)].head(10))
Output:
Exact duplicate rows: 8
Sample duplicate rows:
date region salesperson product_sku category units_sold unit_price discount_rate total_revenue
12 2023-01-15 Northeast Jordan Blake OFF-CH-0045 Office Chairs 2 249.99 0.10 449.98
47 2023-01-15 Northeast Jordan Blake OFF-CH-0045 Office Chairs 2 249.99 0.10 449.98
...
Issue #4: 8 exact duplicate rows. These appear to be double-entered transactions — the same salesperson, same product, same date, same quantity, same price. This is almost certainly a data entry error or a CRM export bug, not a legitimate case of the same person selling the same thing at the same time for the same amount.
Step 9: Date Range Check
print("=== DATE RANGE ===")
print(f"Earliest date: {df['date'].min().strftime('%Y-%m-%d')}")
print(f"Latest date: {df['date'].max().strftime('%Y-%m-%d')}")
print(f"Date span: {(df['date'].max() - df['date'].min()).days} days")
Output:
Earliest date: 2023-01-02
Latest date: 2023-12-29
Date span: 361 days
The data spans essentially the full year. The earliest transaction is January 2nd and the latest is December 29th — no January 1st (a holiday) and the last few days of December are missing, but this is expected given the CRM export date.
Step 10: Priya's Data Quality Summary
Priya writes up her findings in a structured note to herself before she begins any analysis:
DATA QUALITY ISSUES FOUND — acme_sales_2023.csv
Explored: 2024-01-08 by Priya Okonkwo
Dataset: 2,500 rows, 9 columns, date range Jan 2 – Dec 29, 2023
ISSUES TO FIX BEFORE ANALYSIS:
Issue #1 — Missing region (13 rows, 0.52%)
Affected column: region
Action: Investigate if the salesperson always works in one region
and backfill accordingly. If ambiguous, assign "Unknown".
Issue #2 — Missing category (5 rows, 0.20%)
Affected column: category
Action: Look up the product_sku to determine the correct category.
If SKU lookup not possible, assign "Unknown" or drop rows.
Issue #3 — Inconsistent category capitalization (10 rows)
Affected column: category
Values: "Office Chairs" (847), "office chairs" (7), "OFFICE CHAIRS" (3)
Action: Standardize all to "Office Chairs" using .str.title() or .replace()
Issue #4 — Duplicate rows (8 exact duplicates)
Action: Drop duplicates using .drop_duplicates() and document
which rows were removed and why.
DATA THAT LOOKS CLEAN:
- unit_price: plausible range $12.99 to $1,899.99
- discount_rate: 7 standard tiers, 0% to 30%
- date: full year coverage, dates are valid datetime objects
- units_sold: 1 to 50, bulk orders plausible
- total_revenue: large transactions verified as legitimate bulk orders
- No type issues: all numeric columns loaded as numbers
She emails Marcus:
"Marcus — the file loaded cleanly. I found a few data quality items: 13 rows missing region, 5 missing category, some duplicate entries (8 rows), and the 'Office Chairs' category is spelled in three different ways in the file. Can you check whether the CRM has a known issue with case-sensitivity on product categories? I'll clean these up on my end for the analysis, but you might want to fix it at the source too."
Marcus replies within the hour: "Yeah, the CRM export was applying different text formatting depending on which user entered the record. Known bug, it's on the backlog. Sorry about that. The missing regions are probably from Marcus Webb (me) — I sometimes forget to tag the region when entering bulk orders. They should all be 'Northeast'."
Priya now has everything she needs to begin the cleaning work — which she will tackle in Chapter 12.
What Priya Learned
This exploration took Priya approximately 20 minutes. In that time she discovered four distinct data quality issues that would have silently corrupted her analysis if she had dived straight into pivot tables:
- The "Office Chairs" capitalization issue would have split the category into three rows in any group-by report, making it appear that two obscure sub-categories existed.
- The duplicate rows would have inflated total revenue by a small but real amount.
- The missing regions would have caused 0.52% of transactions to be excluded from regional analysis without any warning.
- Knowing the missing regions are all "Northeast" (from Marcus's reply) means she can fill them with confidence rather than guessing.
The lesson: systematic exploration is not overhead — it is insurance against incorrect conclusions.
Key Code from This Case Study
import pandas as pd
# Load with intent
df = pd.read_csv(
"acme_sales_2023.csv",
parse_dates=["date"],
dtype={"product_sku": str},
na_values=["N/A", "n/a", "--", ""]
)
# Quick dimensions check
print(df.shape)
# Visual inspection — first and last rows
print(df.head())
print(df.tail())
# Types and missing values
df.info()
# Statistical ranges
print(df.describe().round(2))
# Category distribution — always use dropna=False
print(df["region"].value_counts(dropna=False))
print(df["category"].value_counts(dropna=False))
# Case-sensitivity check
for val in sorted(df["category"].dropna().unique()):
print(f" '{val}'")
# Duplicate detection
print(f"Duplicates: {df.duplicated().sum()}")
# Date range
print(f"Date range: {df['date'].min()} to {df['date'].max()}")