Case Study 11-1: Priya Explores the Acme Sales Dataset

Characters: Priya Okonkwo (Senior Analyst), Marcus Webb (IT Manager) Setting: Acme Corp headquarters, a Monday morning in early January 2024 Skill focus: Systematic dataset exploration, using .info(), .describe(), .value_counts(), and missing value detection


The Setup

Priya Okonkwo arrives at her desk to find a calendar invite from Sandra Chen, VP of Sales, scheduled for Wednesday afternoon. The subject line: "2023 Annual Sales Performance Review — need your analysis." Attached to the invite is a two-sentence note:

"Priya, I need a complete picture of how 2023 went: which regions performed best, which products drove revenue, and whether our discount strategy was effective. Marcus will send you the raw data this morning."

At 9:14 AM, an email arrives from Marcus Webb:

"Priya — attached is the full 2023 sales transaction file: acme_sales_2023.csv. I pulled it directly from the CRM export. 2,500 rows, should be everything. Let me know if anything looks off."

Priya downloads the file, opens a new Python script, and begins her standard data exploration workflow. She has learned from experience: do not assume a file is clean just because it came from a reliable source.


Step 1: Loading the File

Priya starts with a careful, intentional load. She knows the file is a CSV from the CRM system, and she knows dates and product SKUs need special treatment.

import pandas as pd

# Load the 2023 sales file
# - parse_dates: convert date column to datetime objects
# - dtype: keep product_sku as string (SKUs can have leading zeros)
# - na_values: catch common missing-value representations from the CRM export
df = pd.read_csv(
    "acme_sales_2023.csv",
    parse_dates=["date"],
    dtype={"product_sku": str},
    na_values=["N/A", "n/a", "--", ""]
)

print("File loaded successfully.")
print(f"Shape: {df.shape}")

Output:

File loaded successfully.
Shape: (2500, 9)

Good — the row count matches what Marcus said. Nine columns: date, region, salesperson, product_sku, category, units_sold, unit_price, discount_rate, total_revenue.


Step 2: First Visual Check

Priya always looks at the first and last few rows immediately. This catches formatting issues, extra header rows, or summary totals that sometimes get accidentally included in data exports.

print("=== FIRST 5 ROWS ===")
print(df.head())

print("\n=== LAST 5 ROWS ===")
print(df.tail())

Output (first 5 rows):

        date     region     salesperson product_sku       category  units_sold  unit_price  discount_rate  total_revenue
0 2023-01-03  Northeast    Jordan Blake  OFF-CH-0045  Office Chairs           2      249.99           0.10         449.98
1 2023-01-03  Southwest    Casey Rivera  DESK-SIT-012         Desks           1      899.99           0.15         764.99
2 2023-01-04    Midwest      Alex Torres  TECH-MON-089   Technology           3      349.99           0.05         997.47
3 2023-01-05        NaN     Marcus Webb  OFF-CH-0046  Office Chairs           1      299.99           0.00         299.99
4 2023-01-05  Northeast    Jordan Blake  STOR-CAB-023       Storage           4      129.99           0.10         467.96

Priya stops immediately at row 3. The region column is NaN — Marcus Webb's transaction is missing a region. She notes this but continues her systematic review before drawing conclusions.

The last 5 rows look normal: no summary row, no "TOTAL" label, no blank rows at the bottom.


Step 3: Data Types

print("=== DATA TYPES ===")
print(df.dtypes)

Output:

date              datetime64[ns]
region                    object
salesperson               object
product_sku               object
category                  object
units_sold                 int64
unit_price               float64
discount_rate            float64
total_revenue            float64
dtype: object

Priya checks each type against her expectations:

  • date is datetime64[ns] — correct, the parse_dates parameter worked.
  • region, salesperson, product_sku, category are object (string) — expected.
  • units_sold is int64 — expected.
  • unit_price, discount_rate, total_revenue are float64 — expected.

No surprises here. But she has already seen a missing region value. The real question is how many there are.


Step 4: The .info() Summary

print("=== DATASET INFO ===")
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   date           2500 non-null   datetime64[ns]
 1   region         2487 non-null   object
 2   salesperson    2500 non-null   object
 3   product_sku    2500 non-null   object
 4   category       2495 non-null   object
 5   units_sold     2500 non-null   int64
 6   unit_price     2500 non-null   float64
 7   discount_rate  2500 non-null   float64
 8   total_revenue  2500 non-null   float64
dtypes: float64(3), int64(1), object(4), datetime64[ns](1)
memory usage: 175.8 KB

Priya highlights two lines and writes in her notebook:

Issue #1: region — 13 missing values (2500 - 2487 = 13) Issue #2: category — 5 missing values (2500 - 2495 = 5)

Thirteen out of 2,500 is 0.52% — small but not trivial. If she is grouping revenue by region, those 13 transactions will be excluded or placed in an "Unknown" bucket. She needs to decide what to do with them.


Step 5: Statistical Summary

print("=== DESCRIPTIVE STATISTICS ===")
print(df.describe().round(2))

Output:

       units_sold   unit_price  discount_rate  total_revenue
count     2500.00      2500.00        2500.00        2500.00
mean         3.81       312.46           0.09        1053.23
std          2.16       198.23           0.06         789.23
min          1.00        12.99           0.00          12.99
25%          2.00       149.99           0.05         384.35
50%          3.00       279.99           0.10         839.94
75%          5.00       449.99           0.15        1574.92
max         50.00      1899.99           0.30       94999.50

Priya scans systematically:

units_sold: Min of 1 and max of 50. That maximum seems high — a single transaction for 50 units of an office chair? She makes a note to investigate the largest transactions. The mean is 3.81 and the median is 3, which are close — probably not a distribution problem.

unit_price: Range of $12.99 to $1,899.99. Plausible for office furniture and technology products.

discount_rate: Values from 0.00 to 0.30 (0% to 30%). The mean is 0.09 (9%) and the median is 0.10 (10%). This looks reasonable.

total_revenue: Here is something interesting. The mean is $1,053 but the maximum is $94,999.50 — there are some very large transactions pulling the average up significantly above the median of $840. She wants to look at those high-value outliers.

# Find the top 5 highest-value transactions
print("=== TOP 5 TRANSACTIONS BY REVENUE ===")
print(df.nlargest(5, "total_revenue")[
    ["date", "salesperson", "product_sku", "category", "units_sold",
     "unit_price", "total_revenue"]
])

Output:

          date    salesperson product_sku   category  units_sold  unit_price  total_revenue
1847  2023-11-21  Jordan Blake  TECH-LAP-055  Technology          50     1899.99       94999.50
2103  2023-08-14  Morgan Lee   TECH-LAP-055  Technology          40     1899.99       75999.60
...

Fifty laptops at $1,899.99 each — that is an enterprise sale. These are legitimate bulk orders, not data errors. Good.


Step 6: Region and Category Value Counts

print("=== REGION DISTRIBUTION ===")
print(df["region"].value_counts(dropna=False))

Output:

Northeast    612
Southwest    521
Midwest      489
Southeast    468
West         397
NaN           13
Name: region, dtype: int64

Five regions plus 13 missing. The dropna=False parameter ensures the NaN count is visible — an important habit. Without it, Priya would see only the five regions and might not notice the missing values.

print("\n=== CATEGORY DISTRIBUTION ===")
print(df["category"].value_counts(dropna=False))

Output:

Office Chairs    847
Desks            623
Technology       589
Storage          431
office chairs     7
OFFICE CHAIRS     3
NaN               5
Name: category, dtype: int64

Priya leans forward in her chair.

Issue #3: Inconsistent category capitalization. "Office Chairs", "office chairs", and "OFFICE CHAIRS" are clearly the same category but will be treated as three separate values by pandas. Her pivot tables and group-by operations will split this category into three incorrect rows.

She quickly checks the other categories:

# Check all unique category values (sorted, to make casing issues obvious)
print("All unique category values:")
for val in sorted(df["category"].dropna().unique()):
    print(f"  '{val}'")

Output:

All unique category values:
  'Desks'
  'OFFICE CHAIRS'
  'Office Chairs'
  'Storage'
  'Technology'
  'office chairs'

Desks, Storage, and Technology appear only in title case — no inconsistencies there. The problem is isolated to "Office Chairs." She counts: 847 + 7 + 3 = 857 total "Office Chairs" transactions, but only 847 are labeled correctly right now.


Step 7: Unique Value Checks

print("=== UNIQUE VALUE COUNTS ===")
print(df.nunique())

Output:

date             248
region             5
salesperson       22
product_sku      127
category           6
units_sold        18
unit_price        89
discount_rate      7
total_revenue    832
dtype: int64

Wait — category shows 6 unique values, not 4. That is because "Office Chairs", "office chairs", and "OFFICE CHAIRS" are counted as three separate unique values. If the data were clean, she would expect exactly 4 category values (Office Chairs, Desks, Technology, Storage).

She notes: 22 salespersons, 127 distinct product SKUs, 7 distinct discount rates, 248 transaction dates. The 248 dates roughly match 365 days minus weekends and holidays — plausible.

# Check discount rates (should be a small set of standard values)
print("=== DISCOUNT RATES ===")
print(df["discount_rate"].value_counts().sort_index())

Output:

0.00    378
0.05    412
0.10    689
0.15    521
0.20    356
0.25    109
0.30     35
Name: discount_rate, dtype: int64

Seven standard discount tiers: 0%, 5%, 10%, 15%, 20%, 25%, 30%. All whole percentages — no oddities like 0.127 or 0.33. The discount structure is clean.


Step 8: Checking for Duplicate Rows

print("=== DUPLICATE ROWS ===")
n_duplicates = df.duplicated().sum()
print(f"Exact duplicate rows: {n_duplicates}")

if n_duplicates > 0:
    print("\nSample duplicate rows:")
    print(df[df.duplicated(keep=False)].head(10))

Output:

Exact duplicate rows: 8

Sample duplicate rows:
         date     region     salesperson product_sku       category  units_sold  unit_price  discount_rate  total_revenue
12  2023-01-15  Northeast    Jordan Blake  OFF-CH-0045  Office Chairs           2      249.99           0.10         449.98
47  2023-01-15  Northeast    Jordan Blake  OFF-CH-0045  Office Chairs           2      249.99           0.10         449.98
...

Issue #4: 8 exact duplicate rows. These appear to be double-entered transactions — the same salesperson, same product, same date, same quantity, same price. This is almost certainly a data entry error or a CRM export bug, not a legitimate case of the same person selling the same thing at the same time for the same amount.


Step 9: Date Range Check

print("=== DATE RANGE ===")
print(f"Earliest date: {df['date'].min().strftime('%Y-%m-%d')}")
print(f"Latest date:   {df['date'].max().strftime('%Y-%m-%d')}")
print(f"Date span:     {(df['date'].max() - df['date'].min()).days} days")

Output:

Earliest date: 2023-01-02
Latest date:   2023-12-29
Date span:     361 days

The data spans essentially the full year. The earliest transaction is January 2nd and the latest is December 29th — no January 1st (a holiday) and the last few days of December are missing, but this is expected given the CRM export date.


Step 10: Priya's Data Quality Summary

Priya writes up her findings in a structured note to herself before she begins any analysis:

DATA QUALITY ISSUES FOUND — acme_sales_2023.csv
Explored: 2024-01-08 by Priya Okonkwo

Dataset: 2,500 rows, 9 columns, date range Jan 2 – Dec 29, 2023

ISSUES TO FIX BEFORE ANALYSIS:

Issue #1 — Missing region (13 rows, 0.52%)
  Affected column: region
  Action: Investigate if the salesperson always works in one region
          and backfill accordingly. If ambiguous, assign "Unknown".

Issue #2 — Missing category (5 rows, 0.20%)
  Affected column: category
  Action: Look up the product_sku to determine the correct category.
          If SKU lookup not possible, assign "Unknown" or drop rows.

Issue #3 — Inconsistent category capitalization (10 rows)
  Affected column: category
  Values: "Office Chairs" (847), "office chairs" (7), "OFFICE CHAIRS" (3)
  Action: Standardize all to "Office Chairs" using .str.title() or .replace()

Issue #4 — Duplicate rows (8 exact duplicates)
  Action: Drop duplicates using .drop_duplicates() and document
          which rows were removed and why.

DATA THAT LOOKS CLEAN:
  - unit_price: plausible range $12.99 to $1,899.99
  - discount_rate: 7 standard tiers, 0% to 30%
  - date: full year coverage, dates are valid datetime objects
  - units_sold: 1 to 50, bulk orders plausible
  - total_revenue: large transactions verified as legitimate bulk orders
  - No type issues: all numeric columns loaded as numbers

She emails Marcus:

"Marcus — the file loaded cleanly. I found a few data quality items: 13 rows missing region, 5 missing category, some duplicate entries (8 rows), and the 'Office Chairs' category is spelled in three different ways in the file. Can you check whether the CRM has a known issue with case-sensitivity on product categories? I'll clean these up on my end for the analysis, but you might want to fix it at the source too."

Marcus replies within the hour: "Yeah, the CRM export was applying different text formatting depending on which user entered the record. Known bug, it's on the backlog. Sorry about that. The missing regions are probably from Marcus Webb (me) — I sometimes forget to tag the region when entering bulk orders. They should all be 'Northeast'."

Priya now has everything she needs to begin the cleaning work — which she will tackle in Chapter 12.


What Priya Learned

This exploration took Priya approximately 20 minutes. In that time she discovered four distinct data quality issues that would have silently corrupted her analysis if she had dived straight into pivot tables:

  1. The "Office Chairs" capitalization issue would have split the category into three rows in any group-by report, making it appear that two obscure sub-categories existed.
  2. The duplicate rows would have inflated total revenue by a small but real amount.
  3. The missing regions would have caused 0.52% of transactions to be excluded from regional analysis without any warning.
  4. Knowing the missing regions are all "Northeast" (from Marcus's reply) means she can fill them with confidence rather than guessing.

The lesson: systematic exploration is not overhead — it is insurance against incorrect conclusions.


Key Code from This Case Study

import pandas as pd

# Load with intent
df = pd.read_csv(
    "acme_sales_2023.csv",
    parse_dates=["date"],
    dtype={"product_sku": str},
    na_values=["N/A", "n/a", "--", ""]
)

# Quick dimensions check
print(df.shape)

# Visual inspection — first and last rows
print(df.head())
print(df.tail())

# Types and missing values
df.info()

# Statistical ranges
print(df.describe().round(2))

# Category distribution — always use dropna=False
print(df["region"].value_counts(dropna=False))
print(df["category"].value_counts(dropna=False))

# Case-sensitivity check
for val in sorted(df["category"].dropna().unique()):
    print(f"  '{val}'")

# Duplicate detection
print(f"Duplicates: {df.duplicated().sum()}")

# Date range
print(f"Date range: {df['date'].min()} to {df['date'].max()}")