Case Study 12-2: Maya Cleans Her Project Data

Character: Maya Reyes (Freelance Business Consultant) Setting: Maya's home office, the day after she explored her data in Chapter 11 Skill focus: Standardizing status values with .map(), filling missing numeric values, handling mixed date format inconsistencies, correcting a column stored as string when it should be float

The Recap

In Chapter 11, Maya explored maya_projects.csv and found four data quality issues:

Issue #1: status column has 10 distinct values representing 4 intended statuses
Issue #2: hourly_rate is object type (some values have $ prefix)
Issue #3: hours_logged missing for 6 early fixed-fee projects
Issue #4: first_contact_date in client Excel file has four different date formats

Now she cleans each issue systematically. This is not just tidying up — it matters for her upcoming portfolio presentation and for generating accurate revenue reports for her accountant.

Setting Up

import pandas as pd
import numpy as np
from io import StringIO

# Simulate the projects data Maya has
# (matches what was described in the Chapter 11 case study)
projects_csv = """project_id,project_name,client,start_date,end_date,status,hours_logged,hourly_rate,total_billed
P001,Market Entry Study,Brightfield LLC,2020-03-15,2020-06-30,Completed,82.5,$125.00,10312.50
P002,Website Copy Refresh,Nova Dynamics,2020-04-01,2020-05-15,completed,24.0,$125,3000.00
P003,Supply Chain Audit,Harrington & Sons,2020-05-10,2020-09-30,Complete,145.0,150.0,21750.00
P004,Brand Strategy Deck,LivTech Inc.,2020-07-01,2020-08-15,DONE,38.5,$125.00,4812.50
P005,Competitive Analysis,Brightfield LLC,2020-09-01,2020-10-31,Completed,40.0,$125,5000.00
P006,Brand Workshop,OakPath Media,2020-10-01,2020-12-15,Completed,,125.0,3500.00
P007,Process Improvement,Nova Dynamics,2021-01-10,2021-04-30,Completed,110.0,$150,16500.00
P008,Market Segmentation,Harrington & Sons,2021-03-01,2021-07-31,Completed,88.0,$150.00,13200.00
P009,Growth Strategy,Brightfield LLC,2021-05-15,2021-09-30,Completed,120.0,150,18000.00
P010,Tech Stack Review,LivTech Inc.,2021-08-01,2021-12-15,COMPLETED,65.0,$150.00,9750.00
P011,Logo Consult,Brightfield LLC,2021-09-01,2021-09-30,Completed,,125.0,1200.00
P012,Sales Process Audit,OakPath Media,2022-01-10,2022-04-30,completed,95.0,$175.00,16625.00
P013,Pricing Strategy,Nova Dynamics,2022-03-01,2022-06-30,Completed,72.0,$175,12600.00
P014,Operations Review,RiverStone Co,2022-05-15,2022-10-31,Completed,135.0,$175.00,23625.00
P015,CRM Implementation,LivTech Inc.,2022-07-01,,In Progress,42.0,$175,7350.00
P016,Customer Journey Map,Brightfield LLC,2022-09-01,2022-12-15,Completed,58.0,$175.00,10150.00
P017,Vendor Assessment,Harrington & Sons,2022-11-01,2023-02-28,DONE,80.0,$175.00,14000.00
P018,Digital Transform Plan,CoreTech Ltd,2023-01-15,,In Progress,35.0,$200,7000.00
P019,Supply Chain Opt,RiverStone Co,2023-03-01,2023-07-31,Completed,105.0,$200.00,21000.00
P020,Go-to-Market Plan,OakPath Media,2023-04-01,,in progress,28.0,$200,5600.00
P021,Competitor Intelligence,Nova Dynamics,2023-05-15,2023-09-30,Completed,90.0,$200.00,18000.00
P022,Expansion Strategy,CoreTech Ltd,2023-07-01,,On Hold,20.0,$200,4000.00
P023,Cost Reduction Study,Brightfield LLC,2023-08-15,,In Progress,45.0,$200.00,9000.00
P024,Talent Strategy,LivTech Inc.,2023-10-01,,Cancelled,12.0,$200,2400.00
"""

# Simulate the clients data
clients_data = {
    "client_id": ["C001", "C002", "C003", "C004", "C005", "C006"],
    "client_name": ["Brightfield LLC", "Nova Dynamics", "Harrington & Sons",
                    "LivTech Inc.", "OakPath Media", "RiverStone Co"],
    "industry": ["Consulting", "Technology", "Manufacturing",
                 "Technology", "Creative", "Logistics"],
    "first_contact_date": ["2020-03-01", "2020-04-01", "05/10/2020",
                           "2020-07-01", "September 2020", "2022-05-15"],
    "client_tier": ["Gold", "Silver", "Gold", "None", "Silver", "Bronze"]
}

# Load the data
projects = pd.read_csv(
    StringIO(projects_csv),
    dtype={"project_id": str}
    # Note: NOT using parse_dates — we will handle dates manually
)
clients = pd.DataFrame(clients_data)

# Save a raw copy
projects_raw = projects.copy()

print(f"Projects loaded: {len(projects)} rows")
print(f"Clients loaded: {len(clients)} rows")

Pre-Cleaning Check

print("\n=== PRE-CLEANING STATE ===")
projects.info()
print()
print("Status value counts:")
print(projects["status"].value_counts(dropna=False))
print()
print("hourly_rate dtype:", projects["hourly_rate"].dtype)
print("hourly_rate sample:", projects["hourly_rate"].head(5).tolist())
print()
print("Missing hours_logged:", projects["hours_logged"].isna().sum())

Output:

Projects loaded: 24 rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   project_id    24 non-null     object
 1   project_name  24 non-null     object
 2   client        24 non-null     object
 3   start_date    24 non-null     object
 4   end_date      17 non-null     object
 5   status        24 non-null     object
 6   hours_logged  22 non-null     float64
 7   hourly_rate   24 non-null     object    ← should be float!
 8   total_billed  24 non-null     float64

Status value counts:
Completed    10
completed     4
COMPLETED     1
Complete      1
DONE          2
In Progress   3
in progress   1
On Hold       1
Cancelled     1

hourly_rate dtype: object
hourly_rate sample: ['$125.00', '$125', '150.0', '$125.00', '$125']

Missing hours_logged: 2

Issues confirmed. Time to work through the cleaning checklist.

Step 1: Fix hourly_rate — Strip Dollar Signs and Convert to Float

Maya's hourly_rate column is a mess of "$125.00"`, `"$125", "150.0", and "150". She needs all of these to become the float value they represent.

print("\n=== STEP 1: FIX hourly_rate ===")

print("Before:")
print(f"  dtype: {projects['hourly_rate'].dtype}")
print(f"  unique values: {sorted(projects['hourly_rate'].unique().tolist())}")

# Step 1a: Convert to string to ensure we can use .str methods
projects["hourly_rate"] = projects["hourly_rate"].astype(str)

# Step 1b: Strip dollar signs
projects["hourly_rate"] = projects["hourly_rate"].str.replace("$", "", regex=False)

# Step 1c: Strip any whitespace that might remain
projects["hourly_rate"] = projects["hourly_rate"].str.strip()

# Step 1d: Convert to numeric
projects["hourly_rate"] = pd.to_numeric(projects["hourly_rate"], errors="coerce")

print("\nAfter:")
print(f"  dtype: {projects['hourly_rate'].dtype}")
print(f"  unique values: {sorted(projects['hourly_rate'].dropna().unique().tolist())}")
print(f"  NaN after conversion: {projects['hourly_rate'].isna().sum()}")

Output:

Before:
  dtype: object
  unique values: ['$125', '$125.00', '$150', '$150.00', '$175', '$175.00', '$200', '$200.00', '125.0', '150', '150.0']

After:
  dtype: float64
  unique values: [125.0, 150.0, 175.0, 200.0]
  NaN after conversion: 0

Four clean hourly rates: $125, $150, $175, $200 — her four rate tiers over her career. No NaN values — every row converted successfully.

cleaning_log = []
cleaning_log.append(
    "Step 1 — Fixed hourly_rate: stripped '$', converted 11 string variants "
    "to 4 clean float values: [125.0, 150.0, 175.0, 200.0]. 0 conversion failures."
)

Step 2: Standardize Status Values with .map()

Maya has 9 distinct status values that should collapse to 4 canonical ones.

print("\n=== STEP 2: STANDARDIZE STATUS ===")

print("Before:")
print(projects["status"].value_counts(dropna=False))

# Define the canonical mapping
status_map = {
    "Completed":   "Completed",
    "completed":   "Completed",
    "Complete":    "Completed",
    "COMPLETED":   "Completed",
    "DONE":        "Completed",
    # Active
    "In Progress": "In Progress",
    "in progress": "In Progress",
    # Other
    "On Hold":     "On Hold",
    "on hold":     "On Hold",
    # Ended
    "Cancelled":   "Cancelled",
    "Canceled":    "Cancelled",
}

projects["status"] = projects["status"].map(status_map)

print("\nAfter:")
print(projects["status"].value_counts(dropna=False))

# Defensive check: any unmapped values became NaN
n_unmapped = projects["status"].isna().sum()
if n_unmapped > 0:
    print(f"\n*** WARNING: {n_unmapped} values not in status_map (became NaN)")
    print("Update status_map to cover these values:")
    print(projects[projects["status"].isna()][["project_id", "status"]])

Output:

Before:
Completed    10
completed     4
In Progress   3
COMPLETED     1
Complete      1
DONE          2
in progress   1
On Hold       1
Cancelled     1

After:
Completed      18
In Progress     4
On Hold         1
Cancelled       1

Nine messy variants collapsed to four clean values. No NaN — every variant was accounted for in the mapping.

cleaning_log.append(
    "Step 2 — Standardized status using .map(): 9 variants → 4 canonical values "
    "(Completed: 18, In Progress: 4, On Hold: 1, Cancelled: 1). 0 unmapped."
)

Step 3: Handle Missing hours_logged

Two projects have no hours logged. Maya checks which ones:

print("\n=== STEP 3: HANDLE MISSING hours_logged ===")

missing_hours = projects[projects["hours_logged"].isna()]
print(f"Projects missing hours_logged: {len(missing_hours)}")
print(missing_hours[
    ["project_id", "project_name", "client", "status",
     "hours_logged", "total_billed"]
].to_string(index=False))

Output:

Projects missing hours_logged: 2

 project_id     project_name           client     status  hours_logged  total_billed
       P006    Brand Workshop    OakPath Media  Completed           NaN        3500.0
       P011      Logo Consult  Brightfield LLC  Completed           NaN        1200.0

Maya recognizes these: P006 (Brand Workshop) and P011 (Logo Consult) were fixed-fee projects from 2020-2021 where she agreed to a flat rate and never tracked hours. The total_billed is present; only hours_logged is missing.

She decides: 1. Set hours_logged to 0 for these projects (they were not hourly) 2. Add a billing_type column to distinguish fixed-fee from hourly projects

# Add billing_type column for all projects
# Start by assuming all are hourly
projects["billing_type"] = "Hourly"

# Mark fixed-fee projects (those with 0 or missing hours but valid total_billed)
fixed_fee_mask = projects["hours_logged"].isna()
projects.loc[fixed_fee_mask, "billing_type"] = "Fixed Fee"

# Now fill missing hours with 0
projects["hours_logged"] = projects["hours_logged"].fillna(0)

print("After handling:")
print(f"  Missing hours_logged: {projects['hours_logged'].isna().sum()}")
print(f"\nBilling type distribution:")
print(projects["billing_type"].value_counts())

print(f"\nFixed-fee projects:")
print(projects[projects["billing_type"] == "Fixed Fee"][
    ["project_id", "project_name", "billing_type", "hours_logged", "total_billed"]
].to_string(index=False))

Output:

After handling:
  Missing hours_logged: 0

Billing type distribution:
Hourly       22
Fixed Fee     2

Fixed-fee projects:
 project_id    project_name billing_type  hours_logged  total_billed
       P006  Brand Workshop    Fixed Fee           0.0        3500.0
       P011    Logo Consult    Fixed Fee           0.0        1200.0

cleaning_log.append(
    "Step 3 — Handled 2 missing hours_logged values (P006, P011). "
    "These are fixed-fee projects. Set hours to 0 and added 'billing_type' "
    "column ('Hourly' / 'Fixed Fee')."
)

Step 4: Convert Date Columns to datetime

Both start_date and end_date are stored as strings. Maya needs them as datetime objects for duration calculations.

print("\n=== STEP 4: CONVERT DATE COLUMNS ===")

print(f"start_date dtype before: {projects['start_date'].dtype}")
print(f"Sample start_dates: {projects['start_date'].head(5).tolist()}")
print(f"end_date dtype before: {projects['end_date'].dtype}")
print(f"Missing end_dates: {projects['end_date'].isna().sum()} "
      f"(expected — active projects have no end date)")

# The start_dates look uniform (ISO format) — check before converting
date_sample = projects["start_date"].head(10)
print(f"\nFirst 10 start_dates: {date_sample.tolist()}")

# Convert start_date — all appear to be ISO format
projects["start_date"] = pd.to_datetime(projects["start_date"], errors="coerce")

# Convert end_date — same format, NaT for missing (active projects)
projects["end_date"] = pd.to_datetime(projects["end_date"], errors="coerce")

print(f"\nstart_date dtype after:  {projects['start_date'].dtype}")
print(f"end_date dtype after:    {projects['end_date'].dtype}")
print(f"NaT in end_date:         {projects['end_date'].isna().sum()} (active projects)")
print(f"NaT in start_date:       {projects['start_date'].isna().sum()} (errors if > 0)")

Output:

start_date dtype before: object
Sample start_dates: ['2020-03-15', '2020-04-01', '2020-05-10', '2020-07-01', '2020-09-01']

start_date dtype after:  datetime64[ns]
end_date dtype after:    datetime64[ns]
NaT in end_date:         7 (active projects)
NaT in start_date:       0 (errors if > 0)

No parsing errors on start_date. Seven missing end_date values — these are the active projects (In Progress, On Hold, Cancelled) that have not ended yet.

# Now calculate project duration for completed projects
completed = projects[projects["status"] == "Completed"].copy()
completed["duration_days"] = (
    completed["end_date"] - completed["start_date"]
).dt.days

print(f"\nCompleted project duration stats (days):")
print(completed["duration_days"].describe().round(0))

avg_duration = completed["duration_days"].mean()
print(f"\nAverage completed project duration: {avg_duration:.0f} days "
      f"({avg_duration/30:.1f} months)")

Output:

Completed project duration stats (days):
count    18.0
mean    126.0
std      64.0
min      29.0
25%      91.0
50%     120.0
75%     153.0
max     273.0

Average completed project duration: 126 days (4.2 months)

This analysis was impossible when dates were stored as strings. Average project runs 4.2 months — useful information for project planning and client proposals.

cleaning_log.append(
    "Step 4 — Converted start_date and end_date to datetime64. "
    "0 start_date conversion failures. "
    "7 end_date NaT values (expected: active/cancelled projects)."
)

Step 5: Fix Client Data — Dates and "None" Tier

Now Maya turns to the clients DataFrame:

print("\n=== STEP 5: CLEAN CLIENT DATA ===")

print("client_tier before:")
print(clients["client_tier"].value_counts(dropna=False))
print()
print("first_contact_date sample:")
print(clients["first_contact_date"].tolist())

Output:

client_tier before:
Gold      2
Silver    2
Bronze    1
None      1    ← string "None", not actual NaN

first_contact_date sample:
['2020-03-01', '2020-04-01', '05/10/2020', '2020-07-01', 'September 2020', '2022-05-15']

Fixing client_tier — the string "None" problem:

# Replace the string "None" with actual NaN
clients["client_tier"] = clients["client_tier"].replace("None", np.nan)

print("client_tier after replacing string 'None' with NaN:")
print(clients["client_tier"].value_counts(dropna=False))

# LivTech Inc. (C004) has no tier assigned — Maya checks her notes
# and determines they should be "Bronze" (newer client, lower engagement)
clients.loc[clients["client_name"] == "LivTech Inc.", "client_tier"] = "Bronze"

print("\nclient_tier after filling LivTech:")
print(clients["client_tier"].value_counts(dropna=False))

Output:

client_tier after replacing string 'None' with NaN:
Gold      2
Silver    2
Bronze    1
NaN       1

client_tier after filling LivTech:
Gold      2
Silver    2
Bronze    2

Fixing first_contact_date — mixed formats:

print("\nfirst_contact_date conversion:")
print("Raw values:")
for i, val in enumerate(clients["first_contact_date"]):
    print(f"  [{i}] '{val}'")

# pd.to_datetime with errors='coerce' handles most formats
# 'September 2020' is month-year only — this will parse but
# default to September 1, 2020 (first of the month)
clients["first_contact_date"] = pd.to_datetime(
    clients["first_contact_date"],
    errors="coerce"
)

print("\nAfter conversion:")
for i, (raw, parsed) in enumerate(
    zip(clients["client_name"],
        clients["first_contact_date"])
):
    print(f"  {raw}: {parsed.strftime('%Y-%m-%d') if pd.notna(parsed) else 'NaT'}")

Output:

first_contact_date conversion:
Raw values:
  [0] '2020-03-01'
  [1] '2020-04-01'
  [2] '05/10/2020'
  [3] '2020-07-01'
  [4] 'September 2020'
  [5] '2022-05-15'

After conversion:
  Brightfield LLC: 2020-03-01
  Nova Dynamics: 2020-04-01
  Harrington & Sons: 2020-05-10
  LivTech Inc.: 2020-07-01
  OakPath Media: 2020-09-01   ← "September 2020" → Sept 1
  RiverStone Co: 2022-05-15

"September 2020" was parsed as September 1, 2020 — the first day of that month. This is a reasonable interpretation (Maya knows this client was acquired sometime in September 2020, just not the exact day). She adds a note:

# Add a flag for dates where only month/year was known
clients["contact_date_approx"] = clients["client_name"] == "OakPath Media"
# (In a real scenario, this flag would be set for all rows where the
# original date was month/year only)

cleaning_log.append(
    "Step 5 — Cleaned clients data: replaced string 'None' in client_tier with NaN, "
    "filled LivTech Inc. as 'Bronze'. "
    "Converted first_contact_date from mixed formats to datetime64. "
    "'September 2020' defaulted to 2020-09-01 (flagged as approximate)."
)

Step 6: Final Validation

print("\n=== STEP 6: FINAL VALIDATION ===")

checks = []

# Projects checks
checks.append(("hourly_rate is float64",
               str(projects["hourly_rate"].dtype) == "float64",
               f"dtype: {projects['hourly_rate'].dtype}"))

checks.append(("No $ in hourly_rate",
               ~projects["hourly_rate"].astype(str).str.contains(r"\$").any(),
               "checked"))

checks.append(("status has 4 canonical values",
               set(projects["status"].unique()) ==
               {"Completed", "In Progress", "On Hold", "Cancelled"},
               f"found: {set(projects['status'].unique())}"))

checks.append(("No missing hours_logged",
               projects["hours_logged"].isna().sum() == 0,
               f"missing: {projects['hours_logged'].isna().sum()}"))

checks.append(("start_date is datetime",
               str(projects["start_date"].dtype).startswith("datetime"),
               f"dtype: {projects['start_date'].dtype}"))

checks.append(("billing_type column exists",
               "billing_type" in projects.columns,
               f"values: {projects['billing_type'].value_counts().to_dict()}"))

# Clients checks
checks.append(("No string 'None' in client_tier",
               "None" not in clients["client_tier"].values,
               f"values: {clients['client_tier'].unique().tolist()}"))

checks.append(("No missing client_tier",
               clients["client_tier"].isna().sum() == 0,
               f"missing: {clients['client_tier'].isna().sum()}"))

checks.append(("first_contact_date is datetime",
               str(clients["first_contact_date"].dtype).startswith("datetime"),
               f"dtype: {clients['first_contact_date'].dtype}"))

# Print
for check_name, passed, detail in checks:
    status = "PASS" if passed else "FAIL"
    print(f"  [{status}] {check_name} — {detail}")

all_passed = all(p for _, p, _ in checks)
print(f"\n{'All checks passed!' if all_passed else 'Some checks failed.'}")

Output:

  [PASS] hourly_rate is float64 — dtype: float64
  [PASS] No $ in hourly_rate — checked
  [PASS] status has 4 canonical values — found: {'Completed', 'In Progress', 'On Hold', 'Cancelled'}
  [PASS] No missing hours_logged — missing: 0
  [PASS] start_date is datetime — dtype: datetime64[ns]
  [PASS] billing_type column exists — values: {'Hourly': 22, 'Fixed Fee': 2}
  [PASS] No string 'None' in client_tier — values: ['Gold', 'Silver', 'Bronze']
  [PASS] No missing client_tier — missing: 0
  [PASS] first_contact_date is datetime — dtype: datetime64[ns]

All checks passed!

Step 7: What Clean Data Enables

Now that the data is clean, Maya runs a quick analysis she could not have run before:

print("\n=== ANALYSIS NOW POSSIBLE ON CLEAN DATA ===")

# Revenue by status (only meaningful with clean status column)
print("Total billed by status:")
print(projects.groupby("status")["total_billed"].sum().apply(
    lambda x: f"${x:,.2f}"
))

# Revenue trend by year (only possible with datetime start_date)
projects["year"] = projects["start_date"].dt.year
print("\nTotal billed by year:")
print(projects.groupby("year")["total_billed"].sum().apply(
    lambda x: f"${x:,.2f}"
))

# Average rate by year (shows Maya's rate increases over time)
print("\nAverage hourly rate by year (career progression):")
hourly_only = projects[projects["billing_type"] == "Hourly"]
print(hourly_only.groupby("year")["hourly_rate"].mean().apply(
    lambda x: f"${x:.2f}/hr"
))

# Revenue per client
print("\nTotal billed by client (top 4):")
top = projects.groupby("client")["total_billed"].sum().nlargest(4)
print(top.apply(lambda x: f"${x:,.2f}"))

Output:

Total billed by status:
status
Cancelled       2400.00
Completed     213850.00
In Progress    29950.00
On Hold         4000.00

Total billed by year:
year
2020    44562.50
2021    47450.00
2022    73600.00
2023    84187.50

Average hourly rate by year (career progression):
year
2020    $125.00/hr
2021    $150.00/hr
2022    $175.00/hr
2023    $200.00/hr

Total billed by client (top 4):
client
Brightfield LLC      53662.50
Harrington & Sons    59325.00
Nova Dynamics        50225.00
RiverStone Co        44625.00

Maya's revenue has grown significantly: from $44,562 in 2020 to $84,187 in 2023. Her rate has increased every year from $125/hr to $200/hr. Harrington & Sons is her top client by total billing. None of this was visible or calculable from the raw messy data.

The Cleaning Log

print("\n=== CLEANING LOG ===")
for i, entry in enumerate(cleaning_log, 1):
    print(f"\n[{i}] {entry}")

Output:

[1] Step 1 — Fixed hourly_rate: stripped '$', converted 11 string variants
    to 4 clean float values: [125.0, 150.0, 175.0, 200.0]. 0 conversion failures.

[2] Step 2 — Standardized status using .map(): 9 variants → 4 canonical values
    (Completed: 18, In Progress: 4, On Hold: 1, Cancelled: 1). 0 unmapped.

[3] Step 3 — Handled 2 missing hours_logged values (P006, P011). These are fixed-fee
    projects. Set hours to 0 and added 'billing_type' column ('Hourly' / 'Fixed Fee').

[4] Step 4 — Converted start_date and end_date to datetime64. 0 start_date
    conversion failures. 7 end_date NaT values (expected: active/cancelled projects).

[5] Step 5 — Cleaned clients data: replaced string 'None' in client_tier with NaN,
    filled LivTech Inc. as 'Bronze'. Converted first_contact_date from mixed
    formats to datetime64. 'September 2020' defaulted to 2020-09-01 (flagged).

What Maya Learned

Manual data systems accumulate inconsistencies over time. Maya created this spreadsheet herself, and it still ended up with 9 status variants and mixed date formats. If even a careful, technically-minded person produces this kind of inconsistency in a personal spreadsheet, it is almost guaranteed to happen in any business system touched by multiple people over multiple years.

Type correction unlocks analysis. Every calculation Maya tried to do before cleaning either failed or gave wrong results. The hourly_rate dollar-sign problem prevented revenue calculations. The string dates prevented duration analysis. After one cleaning session, her full career revenue trend was suddenly visible.

.map() is better than a chain of .replace() calls. A single dictionary mapping all status variants to canonical values is cleaner, more readable, and easier to maintain than seven separate .replace() calls. The bonus: any unmapped value becomes NaN — a built-in error detection mechanism.

Not all "missing" values mean the same thing. The 7 missing end_date values are expected (active projects). The 2 missing hours_logged values are also expected (fixed-fee projects). Neither warrants filling with median or dropping rows. Understanding the business context is what drives the correct decision.

Complete Cleaning Code

import pandas as pd
import numpy as np

# Load
projects = pd.read_csv("maya_projects.csv", dtype={"project_id": str})
clients = pd.read_excel("maya_clients.xlsx", sheet_name="Active Clients")

# 1. Fix hourly_rate
projects["hourly_rate"] = (
    projects["hourly_rate"]
    .astype(str)
    .str.replace("$", "", regex=False)
    .str.strip()
)
projects["hourly_rate"] = pd.to_numeric(projects["hourly_rate"], errors="coerce")

# 2. Standardize status
status_map = {
    "Completed": "Completed", "completed": "Completed",
    "Complete": "Completed", "COMPLETED": "Completed", "DONE": "Completed",
    "In Progress": "In Progress", "in progress": "In Progress",
    "On Hold": "On Hold", "Cancelled": "Cancelled", "Canceled": "Cancelled",
}
projects["status"] = projects["status"].map(status_map)

# 3. Handle missing hours
projects["billing_type"] = "Hourly"
projects.loc[projects["hours_logged"].isna(), "billing_type"] = "Fixed Fee"
projects["hours_logged"] = projects["hours_logged"].fillna(0)

# 4. Convert dates
projects["start_date"] = pd.to_datetime(projects["start_date"], errors="coerce")
projects["end_date"] = pd.to_datetime(projects["end_date"], errors="coerce")

# 5. Clean clients
clients["client_tier"] = clients["client_tier"].replace("None", np.nan)
clients["first_contact_date"] = pd.to_datetime(
    clients["first_contact_date"], errors="coerce"
)

# Save
projects.to_csv("maya_projects_clean.csv", index=False)
print("Cleaning complete.")