Case Study 11-2: Maya Explores Her Project and Customer Data
Character: Maya Reyes (Freelance Business Consultant) Setting: Maya's home office, a Tuesday morning Skill focus: Loading CSV and Excel files, exploring data from different sources, identifying inconsistent categorical values
The Setup
Maya Reyes runs a small but growing freelance consulting practice. She has been meaning to analyze her project history for months — partly to understand where her revenue comes from, partly to prepare for a conversation with a potential business partner who wants to see her "portfolio analytics."
She has two data sources:
-
maya_projects.csv— A CSV file she has maintained in a spreadsheet and periodically exports. Contains all client projects since she started freelancing: project name, client, start date, end date, status, hours logged, hourly rate, and total billed. -
maya_clients.xlsx— An Excel workbook with two sheets: "Active Clients" and "Client Notes." She created this manually in Excel.
She opens her laptop, starts a new Python script, and begins.
Loading the Projects CSV
import pandas as pd
# Load the projects file
# parse_dates: start_date and end_date should be datetime objects
# dtype: project_id should remain a string to preserve any alphanumeric IDs
projects = pd.read_csv(
"maya_projects.csv",
parse_dates=["start_date", "end_date"],
dtype={"project_id": str}
)
print(f"Projects loaded: {projects.shape[0]} rows, {projects.shape[1]} columns")
Output:
Projects loaded: 87 rows, 9 columns
87 projects since she started freelancing about four years ago. That feels right — roughly 20 to 25 engagements per year.
print(projects.head())
Output:
project_id project_name client start_date end_date status hours_logged hourly_rate total_billed
0 P001 Market Entry Study Brightfield LLC 2020-03-15 2020-06-30 Completed 82.5 125.0 10312.50
1 P002 Website Copy Refresh Nova Dynamics 2020-04-01 2020-05-15 completed 24.0 125.0 3000.00
2 P003 Supply Chain Audit Harrington & Sons 2020-05-10 2020-09-30 Complete 145.0 150.0 21750.00
3 P004 Brand Strategy Deck LivTech Inc. 2020-07-01 2020-08-15 DONE 38.5 125.0 4812.50
4 P005 Competitive Analysis Brightfield LLC 2020-09-01 2020-10-31 Completed 40.0 125.0 5000.00
Maya's eyes go immediately to the status column. Look at those values:
- "Completed" (row 0)
- "completed" (row 1)
- "Complete" (row 2)
- "DONE" (row 3)
Four rows and already four different representations of what is probably the same status. This was entered manually over four years, and the inconsistency shows.
Checking Data Types
print("=== DATA TYPES ===")
print(projects.dtypes)
Output:
project_id object
project_name object
client object
start_date datetime64[ns]
end_date datetime64[ns]
status object
hours_logged float64
hourly_rate float64
total_billed float64
dtype: object
The dates loaded correctly as datetime64[ns]. The financial columns are float64. The status column is object (string) — which is correct, though the values clearly need standardization.
Running .info()
projects.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 project_id 87 non-null object
1 project_name 87 non-null object
2 client 87 non-null object
3 start_date 87 non-null datetime64[ns]
4 end_date 72 non-null datetime64[ns]
5 status 87 non-null object
6 hours_logged 81 non-null float64
7 hourly_rate 87 non-null object
8 total_billed 87 non-null float64
dtypes: float64(2), object(5), datetime64[ns](2)
memory usage: 6.2 KB
Several immediate findings:
end_date: 72 non-null out of 87 — 15 projects have no end date. This is actually expected: active and in-progress projects do not have end dates yet. Maya will confirm this below.
hours_logged: 81 non-null out of 87 — 6 projects have no hours logged. This is a problem. Every project should have hours recorded. She suspects these are early projects where she did not track time carefully, or perhaps projects that were very short and she forgot to log.
hourly_rate shows as object! This should be float64. Maya stares at this for a moment. Hourly rate should be a number — why is it a string? She checks.
# Show unique values in hourly_rate to understand why it's object type
print("=== HOURLY RATE UNIQUE VALUES ===")
print(projects["hourly_rate"].unique())
Output:
['125.0' '$150' '150.0' '125' '$125' '175.0' '$175' '200' '150' '125.0']
There it is. Some entries have a $ prefix (probably copy-pasted from invoices), others are plain numbers, some have .0 decimal notation and others do not. pandas could not parse all of these as numbers, so it kept the entire column as strings.
Maya makes a note: hourly_rate needs the $ stripped and then a conversion to float. She will handle this in the cleaning chapter.
Examining the Status Problem in Detail
print("=== STATUS VALUE COUNTS ===")
print(projects["status"].value_counts(dropna=False))
Output:
Completed 41
In Progress 12
completed 8
Complete 7
On Hold 5
DONE 4
Cancelled 4
complete 3
in progress 2
COMPLETED 1
Name: status, dtype: int64
Ten distinct values for what should be a small, controlled set of project statuses. Maya thinks about what the canonical values should be:
| Intended Status | Variations Found |
|---|---|
| Completed | "Completed", "completed", "Complete", "complete", "DONE", "COMPLETED" |
| In Progress | "In Progress", "in progress" |
| On Hold | "On Hold" |
| Cancelled | "Cancelled" |
She writes: "Issue: status column has 10 distinct values representing 4 intended statuses. Need to standardize."
# How many projects are currently "active" (not completed/cancelled)?
active_statuses = ["In Progress", "in progress", "On Hold"]
active_projects = projects[projects["status"].isin(active_statuses)]
print(f"\nActive projects (various spellings): {len(active_projects)}")
# This count would be WRONG if we only checked one spelling:
wrong_count = len(projects[projects["status"] == "In Progress"])
print(f"'In Progress' exact match only: {wrong_count} (misses {12 + 2 - wrong_count})")
Output:
Active projects (various spellings): 19
'In Progress' exact match only: 12 (misses 2)
This is exactly why standardization matters. If Maya filtered for status == "In Progress" without cleaning, she would miss 2 active projects — potentially billing errors or follow-up tasks that slip through the cracks.
Checking for Missing hours_logged
# Which projects are missing hours?
missing_hours = projects[projects["hours_logged"].isna()]
print(f"\n=== PROJECTS MISSING HOURS ({len(missing_hours)} records) ===")
print(missing_hours[["project_id", "project_name", "client",
"status", "hours_logged", "total_billed"]])
Output:
project_id project_name client status hours_logged total_billed
5 P006 Brand Workshop OakPath Media Completed NaN 3500.0
11 P012 Logo Consult Brightfield LLC Completed NaN 1200.0
...
Six completed projects with no hours logged. Maya recognizes these: they are older fixed-fee projects from her early freelancing days where she agreed to a flat project fee and did not track hours separately. The total_billed is present, just not the hours_logged.
# What's the average hours for completed projects where hours ARE logged?
avg_hours = projects[
(projects["status"] == "Completed") & projects["hours_logged"].notna()
]["hours_logged"].mean()
print(f"Average hours for completed projects (where logged): {avg_hours:.1f}")
She decides the correct approach is to fill hours_logged with 0 for these fixed-fee projects and add a flag column indicating they were fixed-fee rather than hourly. She notes this for the cleaning chapter.
Loading the Excel Client File
# Load the Excel workbook — first, check what sheets exist
xl_file = pd.ExcelFile("maya_clients.xlsx")
print(f"\n=== EXCEL SHEETS AVAILABLE ===")
print(xl_file.sheet_names)
Output:
['Active Clients', 'Client Notes']
# Load the Active Clients sheet
clients = pd.read_excel(
"maya_clients.xlsx",
sheet_name="Active Clients",
dtype={"client_id": str}
)
print(f"\nClients loaded: {clients.shape[0]} rows, {clients.shape[1]} columns")
print(clients.head())
Output:
Clients loaded: 23 rows, 7 columns
client_id client_name industry account_manager first_contact_date last_project_date client_tier
0 C001 Brightfield LLC Consulting Maya Reyes 2020-03-01 2023-11-15 Gold
1 C002 Nova Dynamics Technology Maya Reyes 2020-04-01 2022-08-30 Silver
2 C003 Harrington & Sons Manufacturing Maya Reyes 2020-05-10 2023-09-30 Gold
3 C004 LivTech Inc. Technology Maya Reyes 2020-07-01 2021-12-15 None
4 C005 OakPath Media Creative Maya Reyes 2020-09-01 2023-04-20 Silver
clients.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 client_id 23 non-null object
1 client_name 23 non-null object
2 industry 23 non-null object
3 account_manager 23 non-null object
4 first_contact_date 20 non-null object
5 last_project_date 23 non-null object
6 client_tier 21 non-null object
dtypes: object(7)
memory usage: 1.4 KB
Two new issues:
first_contact_date is object type, not datetime — Maya did not use parse_dates for the Excel file. She will need to convert these manually. She also notices 3 missing values.
client_tier has 2 missing values — Two clients have no tier assigned. Maya checks which ones:
print("\n=== CLIENT TIER DISTRIBUTION ===")
print(clients["client_tier"].value_counts(dropna=False))
Output:
Gold 9
Silver 8
Bronze 4
None 2
NaN 2
Name: client_tier, dtype: int64
Interesting — there are entries with the string "None" (2 rows) AND Python NaN (2 rows). These are two different things that represent the same concept (missing tier), but pandas treats them differently. The string "None" will be counted as a value in .value_counts(), but Python NaN will not (unless dropna=False). Maya needs to standardize these to a single representation.
Date Format Investigation
# Look at the raw date values in the client file
print("=== FIRST CONTACT DATE SAMPLES ===")
print(clients["first_contact_date"].head(10).tolist())
Output:
['2020-03-01', '2020-04-01', '05/10/2020', '2020-07-01', 'September 2020',
'2020-10-15', nan, '11-15-2020', '2020-12-01', '2021-01-20']
Maya groans. Four different date formats in one column:
- "2020-03-01" — ISO format (YYYY-MM-DD)
- "05/10/2020" — US format (MM/DD/YYYY)
- "September 2020" — Month and year only, no day
- "11-15-2020" — Another US variation (MM-DD-YYYY)
This happened because she typed dates directly into the Excel cell over several years without enforcing a consistent format. pandas cannot automatically parse all of these in a single call because they are inconsistent. She needs a more careful conversion strategy.
Maya's Data Quality Summary
After 30 minutes of exploration, Maya creates her data quality notes:
PROJECTS DATA (maya_projects.csv) — 87 rows, 9 columns
Explored: 2024-01-09 by Maya Reyes
Issue #1 — status column has 10 variants (should be 4)
"Completed", "completed", "Complete", "complete", "DONE", "COMPLETED" → "Completed"
"In Progress", "in progress" → "In Progress"
"On Hold" → "On Hold" (already clean)
"Cancelled" → "Cancelled" (already clean)
Impact: Without fixing, any filter on status returns wrong counts.
Issue #2 — hourly_rate is object type (should be float)
Some values have "$" prefix (e.g., "$125", "$150")
Others are plain numbers ("125.0", "150", "175")
Fix: Strip "$" then convert to float.
Issue #3 — hours_logged missing for 6 projects (early fixed-fee work)
Fix: Fill with 0 and add is_fixed_fee flag.
Issue #4 — end_date missing for 15 projects
These are active/in-progress projects — expected, not an error.
No fix needed; filter out when calculating project duration.
CLIENTS DATA (maya_clients.xlsx, Active Clients sheet) — 23 rows, 7 columns
Issue #5 — first_contact_date loaded as object (not datetime)
Four different date formats in the column.
Fix: Use pd.to_datetime() with errors='coerce', then handle NaT values.
Issue #6 — client_tier has both NaN and string "None" (should both be NaN)
Fix: Replace string "None" with actual NaN, then decide whether to fill.
She closes her notebook, satisfied. She knows exactly what she is dealing with. Now she needs to fix it — which is what Chapter 12 is for.
What This Case Study Demonstrates
Maya's exploration highlights several patterns that appear repeatedly in real business data:
Manual data entry creates inconsistency. When humans type values into a spreadsheet over months or years, capitalization and format vary. Any field that is typed rather than selected from a dropdown is vulnerable.
Mixed formats in a single column are a loading challenge. The hourly_rate column — some with $, some without — is a classic example. pandas cannot resolve this ambiguity at load time, so the column comes in as object.
"Missing" is not always simple. Maya's hours_logged has two kinds of "missing": genuinely missing data (an error) and intentionally absent data (fixed-fee projects where hours were never tracked). These require different treatments.
Excel files need the same care as CSV files. Just because a file is "organized" in Excel does not mean it is analysis-ready. Date formats, merged cells, and inconsistent text entries are just as common in Excel as in CSV exports.
Key Code from This Case Study
import pandas as pd
# Load CSV with date parsing
projects = pd.read_csv(
"maya_projects.csv",
parse_dates=["start_date", "end_date"],
dtype={"project_id": str}
)
# Basic exploration
print(projects.shape)
print(projects.head())
projects.info()
print(projects.describe())
# Identify inconsistent status values
print(projects["status"].value_counts(dropna=False))
# Check for missing values
missing = projects.isna().sum()
print(missing[missing > 0])
# Load Excel with multiple sheets
xl_file = pd.ExcelFile("maya_clients.xlsx")
print(xl_file.sheet_names)
clients = pd.read_excel("maya_clients.xlsx", sheet_name="Active Clients")
# Check for mixed NaN and string "None"
print(clients["client_tier"].value_counts(dropna=False))
# Inspect raw date format variety
print(clients["first_contact_date"].head(10).tolist())