Case Study 11-2: Maya Explores Her Project and Customer Data

Character: Maya Reyes (Freelance Business Consultant) Setting: Maya's home office, a Tuesday morning Skill focus: Loading CSV and Excel files, exploring data from different sources, identifying inconsistent categorical values


The Setup

Maya Reyes runs a small but growing freelance consulting practice. She has been meaning to analyze her project history for months — partly to understand where her revenue comes from, partly to prepare for a conversation with a potential business partner who wants to see her "portfolio analytics."

She has two data sources:

  1. maya_projects.csv — A CSV file she has maintained in a spreadsheet and periodically exports. Contains all client projects since she started freelancing: project name, client, start date, end date, status, hours logged, hourly rate, and total billed.

  2. maya_clients.xlsx — An Excel workbook with two sheets: "Active Clients" and "Client Notes." She created this manually in Excel.

She opens her laptop, starts a new Python script, and begins.


Loading the Projects CSV

import pandas as pd

# Load the projects file
# parse_dates: start_date and end_date should be datetime objects
# dtype: project_id should remain a string to preserve any alphanumeric IDs
projects = pd.read_csv(
    "maya_projects.csv",
    parse_dates=["start_date", "end_date"],
    dtype={"project_id": str}
)

print(f"Projects loaded: {projects.shape[0]} rows, {projects.shape[1]} columns")

Output:

Projects loaded: 87 rows, 9 columns

87 projects since she started freelancing about four years ago. That feels right — roughly 20 to 25 engagements per year.

print(projects.head())

Output:

  project_id           project_name             client start_date   end_date      status  hours_logged  hourly_rate  total_billed
0       P001     Market Entry Study     Brightfield LLC 2020-03-15 2020-06-30   Completed          82.5        125.0      10312.50
1       P002   Website Copy Refresh      Nova Dynamics 2020-04-01 2020-05-15   completed          24.0        125.0       3000.00
2       P003   Supply Chain Audit  Harrington & Sons   2020-05-10 2020-09-30    Complete          145.0       150.0      21750.00
3       P004    Brand Strategy Deck       LivTech Inc. 2020-07-01 2020-08-15       DONE          38.5        125.0       4812.50
4       P005  Competitive Analysis  Brightfield LLC    2020-09-01 2020-10-31  Completed            40.0       125.0       5000.00

Maya's eyes go immediately to the status column. Look at those values: - "Completed" (row 0) - "completed" (row 1) - "Complete" (row 2) - "DONE" (row 3)

Four rows and already four different representations of what is probably the same status. This was entered manually over four years, and the inconsistency shows.


Checking Data Types

print("=== DATA TYPES ===")
print(projects.dtypes)

Output:

project_id       object
project_name     object
client           object
start_date       datetime64[ns]
end_date         datetime64[ns]
status           object
hours_logged     float64
hourly_rate      float64
total_billed     float64
dtype: object

The dates loaded correctly as datetime64[ns]. The financial columns are float64. The status column is object (string) — which is correct, though the values clearly need standardization.


Running .info()

projects.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   project_id    87 non-null     object
 1   project_name  87 non-null     object
 2   client        87 non-null     object
 3   start_date    87 non-null     datetime64[ns]
 4   end_date      72 non-null     datetime64[ns]
 5   status        87 non-null     object
 6   hours_logged  81 non-null     float64
 7   hourly_rate   87 non-null     object
 8   total_billed  87 non-null     float64
dtypes: float64(2), object(5), datetime64[ns](2)
memory usage: 6.2 KB

Several immediate findings:

end_date: 72 non-null out of 87 — 15 projects have no end date. This is actually expected: active and in-progress projects do not have end dates yet. Maya will confirm this below.

hours_logged: 81 non-null out of 87 — 6 projects have no hours logged. This is a problem. Every project should have hours recorded. She suspects these are early projects where she did not track time carefully, or perhaps projects that were very short and she forgot to log.

hourly_rate shows as object! This should be float64. Maya stares at this for a moment. Hourly rate should be a number — why is it a string? She checks.

# Show unique values in hourly_rate to understand why it's object type
print("=== HOURLY RATE UNIQUE VALUES ===")
print(projects["hourly_rate"].unique())

Output:

['125.0' '$150' '150.0' '125' '$125' '175.0' '$175' '200' '150' '125.0']

There it is. Some entries have a $ prefix (probably copy-pasted from invoices), others are plain numbers, some have .0 decimal notation and others do not. pandas could not parse all of these as numbers, so it kept the entire column as strings.

Maya makes a note: hourly_rate needs the $ stripped and then a conversion to float. She will handle this in the cleaning chapter.


Examining the Status Problem in Detail

print("=== STATUS VALUE COUNTS ===")
print(projects["status"].value_counts(dropna=False))

Output:

Completed       41
In Progress     12
completed        8
Complete         7
On Hold          5
DONE             4
Cancelled        4
complete         3
in progress      2
COMPLETED        1
Name: status, dtype: int64

Ten distinct values for what should be a small, controlled set of project statuses. Maya thinks about what the canonical values should be:

Intended Status Variations Found
Completed "Completed", "completed", "Complete", "complete", "DONE", "COMPLETED"
In Progress "In Progress", "in progress"
On Hold "On Hold"
Cancelled "Cancelled"

She writes: "Issue: status column has 10 distinct values representing 4 intended statuses. Need to standardize."

# How many projects are currently "active" (not completed/cancelled)?
active_statuses = ["In Progress", "in progress", "On Hold"]
active_projects = projects[projects["status"].isin(active_statuses)]
print(f"\nActive projects (various spellings): {len(active_projects)}")

# This count would be WRONG if we only checked one spelling:
wrong_count = len(projects[projects["status"] == "In Progress"])
print(f"'In Progress' exact match only: {wrong_count} (misses {12 + 2 - wrong_count})")

Output:

Active projects (various spellings): 19
'In Progress' exact match only: 12 (misses 2)

This is exactly why standardization matters. If Maya filtered for status == "In Progress" without cleaning, she would miss 2 active projects — potentially billing errors or follow-up tasks that slip through the cracks.


Checking for Missing hours_logged

# Which projects are missing hours?
missing_hours = projects[projects["hours_logged"].isna()]
print(f"\n=== PROJECTS MISSING HOURS ({len(missing_hours)} records) ===")
print(missing_hours[["project_id", "project_name", "client",
                      "status", "hours_logged", "total_billed"]])

Output:

  project_id        project_name           client    status  hours_logged  total_billed
5       P006   Brand Workshop       OakPath Media  Completed           NaN       3500.0
11      P012   Logo Consult      Brightfield LLC  Completed           NaN       1200.0
...

Six completed projects with no hours logged. Maya recognizes these: they are older fixed-fee projects from her early freelancing days where she agreed to a flat project fee and did not track hours separately. The total_billed is present, just not the hours_logged.

# What's the average hours for completed projects where hours ARE logged?
avg_hours = projects[
    (projects["status"] == "Completed") & projects["hours_logged"].notna()
]["hours_logged"].mean()
print(f"Average hours for completed projects (where logged): {avg_hours:.1f}")

She decides the correct approach is to fill hours_logged with 0 for these fixed-fee projects and add a flag column indicating they were fixed-fee rather than hourly. She notes this for the cleaning chapter.


Loading the Excel Client File

# Load the Excel workbook — first, check what sheets exist
xl_file = pd.ExcelFile("maya_clients.xlsx")
print(f"\n=== EXCEL SHEETS AVAILABLE ===")
print(xl_file.sheet_names)

Output:

['Active Clients', 'Client Notes']
# Load the Active Clients sheet
clients = pd.read_excel(
    "maya_clients.xlsx",
    sheet_name="Active Clients",
    dtype={"client_id": str}
)

print(f"\nClients loaded: {clients.shape[0]} rows, {clients.shape[1]} columns")
print(clients.head())

Output:

Clients loaded: 23 rows, 7 columns

  client_id           client_name     industry  account_manager  first_contact_date  last_project_date  client_tier
0      C001       Brightfield LLC    Consulting        Maya Reyes         2020-03-01         2023-11-15         Gold
1      C002         Nova Dynamics    Technology        Maya Reyes         2020-04-01         2022-08-30       Silver
2      C003    Harrington & Sons  Manufacturing        Maya Reyes         2020-05-10         2023-09-30         Gold
3      C004          LivTech Inc.    Technology        Maya Reyes         2020-07-01         2021-12-15         None
4      C005        OakPath Media       Creative        Maya Reyes         2020-09-01         2023-04-20       Silver
clients.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   client_id           23 non-null     object
 1   client_name         23 non-null     object
 2   industry            23 non-null     object
 3   account_manager     23 non-null     object
 4   first_contact_date  20 non-null     object
 5   last_project_date   23 non-null     object
 6   client_tier         21 non-null     object
dtypes: object(7)
memory usage: 1.4 KB

Two new issues:

first_contact_date is object type, not datetime — Maya did not use parse_dates for the Excel file. She will need to convert these manually. She also notices 3 missing values.

client_tier has 2 missing values — Two clients have no tier assigned. Maya checks which ones:

print("\n=== CLIENT TIER DISTRIBUTION ===")
print(clients["client_tier"].value_counts(dropna=False))

Output:

Gold      9
Silver    8
Bronze    4
None      2
NaN       2
Name: client_tier, dtype: int64

Interesting — there are entries with the string "None" (2 rows) AND Python NaN (2 rows). These are two different things that represent the same concept (missing tier), but pandas treats them differently. The string "None" will be counted as a value in .value_counts(), but Python NaN will not (unless dropna=False). Maya needs to standardize these to a single representation.


Date Format Investigation

# Look at the raw date values in the client file
print("=== FIRST CONTACT DATE SAMPLES ===")
print(clients["first_contact_date"].head(10).tolist())

Output:

['2020-03-01', '2020-04-01', '05/10/2020', '2020-07-01', 'September 2020',
 '2020-10-15', nan, '11-15-2020', '2020-12-01', '2021-01-20']

Maya groans. Four different date formats in one column: - "2020-03-01" — ISO format (YYYY-MM-DD) - "05/10/2020" — US format (MM/DD/YYYY) - "September 2020" — Month and year only, no day - "11-15-2020" — Another US variation (MM-DD-YYYY)

This happened because she typed dates directly into the Excel cell over several years without enforcing a consistent format. pandas cannot automatically parse all of these in a single call because they are inconsistent. She needs a more careful conversion strategy.


Maya's Data Quality Summary

After 30 minutes of exploration, Maya creates her data quality notes:

PROJECTS DATA (maya_projects.csv) — 87 rows, 9 columns
Explored: 2024-01-09 by Maya Reyes

Issue #1 — status column has 10 variants (should be 4)
  "Completed", "completed", "Complete", "complete", "DONE", "COMPLETED" → "Completed"
  "In Progress", "in progress" → "In Progress"
  "On Hold" → "On Hold" (already clean)
  "Cancelled" → "Cancelled" (already clean)
  Impact: Without fixing, any filter on status returns wrong counts.

Issue #2 — hourly_rate is object type (should be float)
  Some values have "$" prefix (e.g., "$125", "$150")
  Others are plain numbers ("125.0", "150", "175")
  Fix: Strip "$" then convert to float.

Issue #3 — hours_logged missing for 6 projects (early fixed-fee work)
  Fix: Fill with 0 and add is_fixed_fee flag.

Issue #4 — end_date missing for 15 projects
  These are active/in-progress projects — expected, not an error.
  No fix needed; filter out when calculating project duration.

CLIENTS DATA (maya_clients.xlsx, Active Clients sheet) — 23 rows, 7 columns

Issue #5 — first_contact_date loaded as object (not datetime)
  Four different date formats in the column.
  Fix: Use pd.to_datetime() with errors='coerce', then handle NaT values.

Issue #6 — client_tier has both NaN and string "None" (should both be NaN)
  Fix: Replace string "None" with actual NaN, then decide whether to fill.

She closes her notebook, satisfied. She knows exactly what she is dealing with. Now she needs to fix it — which is what Chapter 12 is for.


What This Case Study Demonstrates

Maya's exploration highlights several patterns that appear repeatedly in real business data:

Manual data entry creates inconsistency. When humans type values into a spreadsheet over months or years, capitalization and format vary. Any field that is typed rather than selected from a dropdown is vulnerable.

Mixed formats in a single column are a loading challenge. The hourly_rate column — some with $, some without — is a classic example. pandas cannot resolve this ambiguity at load time, so the column comes in as object.

"Missing" is not always simple. Maya's hours_logged has two kinds of "missing": genuinely missing data (an error) and intentionally absent data (fixed-fee projects where hours were never tracked). These require different treatments.

Excel files need the same care as CSV files. Just because a file is "organized" in Excel does not mean it is analysis-ready. Date formats, merged cells, and inconsistent text entries are just as common in Excel as in CSV exports.


Key Code from This Case Study

import pandas as pd

# Load CSV with date parsing
projects = pd.read_csv(
    "maya_projects.csv",
    parse_dates=["start_date", "end_date"],
    dtype={"project_id": str}
)

# Basic exploration
print(projects.shape)
print(projects.head())
projects.info()
print(projects.describe())

# Identify inconsistent status values
print(projects["status"].value_counts(dropna=False))

# Check for missing values
missing = projects.isna().sum()
print(missing[missing > 0])

# Load Excel with multiple sheets
xl_file = pd.ExcelFile("maya_clients.xlsx")
print(xl_file.sheet_names)

clients = pd.read_excel("maya_clients.xlsx", sheet_name="Active Clients")

# Check for mixed NaN and string "None"
print(clients["client_tier"].value_counts(dropna=False))

# Inspect raw date format variety
print(clients["first_contact_date"].head(10).tolist())