Case Study 12-2: Maya Cleans Her Project Data
Character: Maya Reyes (Freelance Business Consultant) Setting: Maya's home office, the day after she explored her data in Chapter 11 Skill focus: Standardizing status values with .map(), filling missing numeric values, handling mixed date format inconsistencies, correcting a column stored as string when it should be float
The Recap
In Chapter 11, Maya explored maya_projects.csv and found four data quality issues:
- Issue #1:
statuscolumn has 10 distinct values representing 4 intended statuses - Issue #2:
hourly_rateisobjecttype (some values have$prefix) - Issue #3:
hours_loggedmissing for 6 early fixed-fee projects - Issue #4:
first_contact_datein client Excel file has four different date formats
Now she cleans each issue systematically. This is not just tidying up — it matters for her upcoming portfolio presentation and for generating accurate revenue reports for her accountant.
Setting Up
import pandas as pd
import numpy as np
from io import StringIO
# Simulate the projects data Maya has
# (matches what was described in the Chapter 11 case study)
projects_csv = """project_id,project_name,client,start_date,end_date,status,hours_logged,hourly_rate,total_billed
P001,Market Entry Study,Brightfield LLC,2020-03-15,2020-06-30,Completed,82.5,$125.00,10312.50
P002,Website Copy Refresh,Nova Dynamics,2020-04-01,2020-05-15,completed,24.0,$125,3000.00
P003,Supply Chain Audit,Harrington & Sons,2020-05-10,2020-09-30,Complete,145.0,150.0,21750.00
P004,Brand Strategy Deck,LivTech Inc.,2020-07-01,2020-08-15,DONE,38.5,$125.00,4812.50
P005,Competitive Analysis,Brightfield LLC,2020-09-01,2020-10-31,Completed,40.0,$125,5000.00
P006,Brand Workshop,OakPath Media,2020-10-01,2020-12-15,Completed,,125.0,3500.00
P007,Process Improvement,Nova Dynamics,2021-01-10,2021-04-30,Completed,110.0,$150,16500.00
P008,Market Segmentation,Harrington & Sons,2021-03-01,2021-07-31,Completed,88.0,$150.00,13200.00
P009,Growth Strategy,Brightfield LLC,2021-05-15,2021-09-30,Completed,120.0,150,18000.00
P010,Tech Stack Review,LivTech Inc.,2021-08-01,2021-12-15,COMPLETED,65.0,$150.00,9750.00
P011,Logo Consult,Brightfield LLC,2021-09-01,2021-09-30,Completed,,125.0,1200.00
P012,Sales Process Audit,OakPath Media,2022-01-10,2022-04-30,completed,95.0,$175.00,16625.00
P013,Pricing Strategy,Nova Dynamics,2022-03-01,2022-06-30,Completed,72.0,$175,12600.00
P014,Operations Review,RiverStone Co,2022-05-15,2022-10-31,Completed,135.0,$175.00,23625.00
P015,CRM Implementation,LivTech Inc.,2022-07-01,,In Progress,42.0,$175,7350.00
P016,Customer Journey Map,Brightfield LLC,2022-09-01,2022-12-15,Completed,58.0,$175.00,10150.00
P017,Vendor Assessment,Harrington & Sons,2022-11-01,2023-02-28,DONE,80.0,$175.00,14000.00
P018,Digital Transform Plan,CoreTech Ltd,2023-01-15,,In Progress,35.0,$200,7000.00
P019,Supply Chain Opt,RiverStone Co,2023-03-01,2023-07-31,Completed,105.0,$200.00,21000.00
P020,Go-to-Market Plan,OakPath Media,2023-04-01,,in progress,28.0,$200,5600.00
P021,Competitor Intelligence,Nova Dynamics,2023-05-15,2023-09-30,Completed,90.0,$200.00,18000.00
P022,Expansion Strategy,CoreTech Ltd,2023-07-01,,On Hold,20.0,$200,4000.00
P023,Cost Reduction Study,Brightfield LLC,2023-08-15,,In Progress,45.0,$200.00,9000.00
P024,Talent Strategy,LivTech Inc.,2023-10-01,,Cancelled,12.0,$200,2400.00
"""
# Simulate the clients data
clients_data = {
"client_id": ["C001", "C002", "C003", "C004", "C005", "C006"],
"client_name": ["Brightfield LLC", "Nova Dynamics", "Harrington & Sons",
"LivTech Inc.", "OakPath Media", "RiverStone Co"],
"industry": ["Consulting", "Technology", "Manufacturing",
"Technology", "Creative", "Logistics"],
"first_contact_date": ["2020-03-01", "2020-04-01", "05/10/2020",
"2020-07-01", "September 2020", "2022-05-15"],
"client_tier": ["Gold", "Silver", "Gold", "None", "Silver", "Bronze"]
}
# Load the data
projects = pd.read_csv(
StringIO(projects_csv),
dtype={"project_id": str}
# Note: NOT using parse_dates — we will handle dates manually
)
clients = pd.DataFrame(clients_data)
# Save a raw copy
projects_raw = projects.copy()
print(f"Projects loaded: {len(projects)} rows")
print(f"Clients loaded: {len(clients)} rows")
Pre-Cleaning Check
print("\n=== PRE-CLEANING STATE ===")
projects.info()
print()
print("Status value counts:")
print(projects["status"].value_counts(dropna=False))
print()
print("hourly_rate dtype:", projects["hourly_rate"].dtype)
print("hourly_rate sample:", projects["hourly_rate"].head(5).tolist())
print()
print("Missing hours_logged:", projects["hours_logged"].isna().sum())
Output:
Projects loaded: 24 rows
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 project_id 24 non-null object
1 project_name 24 non-null object
2 client 24 non-null object
3 start_date 24 non-null object
4 end_date 17 non-null object
5 status 24 non-null object
6 hours_logged 22 non-null float64
7 hourly_rate 24 non-null object ← should be float!
8 total_billed 24 non-null float64
Status value counts:
Completed 10
completed 4
COMPLETED 1
Complete 1
DONE 2
In Progress 3
in progress 1
On Hold 1
Cancelled 1
hourly_rate dtype: object
hourly_rate sample: ['$125.00', '$125', '150.0', '$125.00', '$125']
Missing hours_logged: 2
Issues confirmed. Time to work through the cleaning checklist.
Step 1: Fix hourly_rate — Strip Dollar Signs and Convert to Float
Maya's hourly_rate column is a mess of "$125.00"`, `"$125", "150.0", and "150". She needs all of these to become the float value they represent.
print("\n=== STEP 1: FIX hourly_rate ===")
print("Before:")
print(f" dtype: {projects['hourly_rate'].dtype}")
print(f" unique values: {sorted(projects['hourly_rate'].unique().tolist())}")
# Step 1a: Convert to string to ensure we can use .str methods
projects["hourly_rate"] = projects["hourly_rate"].astype(str)
# Step 1b: Strip dollar signs
projects["hourly_rate"] = projects["hourly_rate"].str.replace("$", "", regex=False)
# Step 1c: Strip any whitespace that might remain
projects["hourly_rate"] = projects["hourly_rate"].str.strip()
# Step 1d: Convert to numeric
projects["hourly_rate"] = pd.to_numeric(projects["hourly_rate"], errors="coerce")
print("\nAfter:")
print(f" dtype: {projects['hourly_rate'].dtype}")
print(f" unique values: {sorted(projects['hourly_rate'].dropna().unique().tolist())}")
print(f" NaN after conversion: {projects['hourly_rate'].isna().sum()}")
Output:
Before:
dtype: object
unique values: ['$125', '$125.00', '$150', '$150.00', '$175', '$175.00', '$200', '$200.00', '125.0', '150', '150.0']
After:
dtype: float64
unique values: [125.0, 150.0, 175.0, 200.0]
NaN after conversion: 0
Four clean hourly rates: $125, $150, $175, $200 — her four rate tiers over her career. No NaN values — every row converted successfully.
cleaning_log = []
cleaning_log.append(
"Step 1 — Fixed hourly_rate: stripped '$', converted 11 string variants "
"to 4 clean float values: [125.0, 150.0, 175.0, 200.0]. 0 conversion failures."
)
Step 2: Standardize Status Values with .map()
Maya has 9 distinct status values that should collapse to 4 canonical ones.
print("\n=== STEP 2: STANDARDIZE STATUS ===")
print("Before:")
print(projects["status"].value_counts(dropna=False))
# Define the canonical mapping
status_map = {
"Completed": "Completed",
"completed": "Completed",
"Complete": "Completed",
"COMPLETED": "Completed",
"DONE": "Completed",
# Active
"In Progress": "In Progress",
"in progress": "In Progress",
# Other
"On Hold": "On Hold",
"on hold": "On Hold",
# Ended
"Cancelled": "Cancelled",
"Canceled": "Cancelled",
}
projects["status"] = projects["status"].map(status_map)
print("\nAfter:")
print(projects["status"].value_counts(dropna=False))
# Defensive check: any unmapped values became NaN
n_unmapped = projects["status"].isna().sum()
if n_unmapped > 0:
print(f"\n*** WARNING: {n_unmapped} values not in status_map (became NaN)")
print("Update status_map to cover these values:")
print(projects[projects["status"].isna()][["project_id", "status"]])
Output:
Before:
Completed 10
completed 4
In Progress 3
COMPLETED 1
Complete 1
DONE 2
in progress 1
On Hold 1
Cancelled 1
After:
Completed 18
In Progress 4
On Hold 1
Cancelled 1
Nine messy variants collapsed to four clean values. No NaN — every variant was accounted for in the mapping.
cleaning_log.append(
"Step 2 — Standardized status using .map(): 9 variants → 4 canonical values "
"(Completed: 18, In Progress: 4, On Hold: 1, Cancelled: 1). 0 unmapped."
)
Step 3: Handle Missing hours_logged
Two projects have no hours logged. Maya checks which ones:
print("\n=== STEP 3: HANDLE MISSING hours_logged ===")
missing_hours = projects[projects["hours_logged"].isna()]
print(f"Projects missing hours_logged: {len(missing_hours)}")
print(missing_hours[
["project_id", "project_name", "client", "status",
"hours_logged", "total_billed"]
].to_string(index=False))
Output:
Projects missing hours_logged: 2
project_id project_name client status hours_logged total_billed
P006 Brand Workshop OakPath Media Completed NaN 3500.0
P011 Logo Consult Brightfield LLC Completed NaN 1200.0
Maya recognizes these: P006 (Brand Workshop) and P011 (Logo Consult) were fixed-fee projects from 2020-2021 where she agreed to a flat rate and never tracked hours. The total_billed is present; only hours_logged is missing.
She decides:
1. Set hours_logged to 0 for these projects (they were not hourly)
2. Add a billing_type column to distinguish fixed-fee from hourly projects
# Add billing_type column for all projects
# Start by assuming all are hourly
projects["billing_type"] = "Hourly"
# Mark fixed-fee projects (those with 0 or missing hours but valid total_billed)
fixed_fee_mask = projects["hours_logged"].isna()
projects.loc[fixed_fee_mask, "billing_type"] = "Fixed Fee"
# Now fill missing hours with 0
projects["hours_logged"] = projects["hours_logged"].fillna(0)
print("After handling:")
print(f" Missing hours_logged: {projects['hours_logged'].isna().sum()}")
print(f"\nBilling type distribution:")
print(projects["billing_type"].value_counts())
print(f"\nFixed-fee projects:")
print(projects[projects["billing_type"] == "Fixed Fee"][
["project_id", "project_name", "billing_type", "hours_logged", "total_billed"]
].to_string(index=False))
Output:
After handling:
Missing hours_logged: 0
Billing type distribution:
Hourly 22
Fixed Fee 2
Fixed-fee projects:
project_id project_name billing_type hours_logged total_billed
P006 Brand Workshop Fixed Fee 0.0 3500.0
P011 Logo Consult Fixed Fee 0.0 1200.0
cleaning_log.append(
"Step 3 — Handled 2 missing hours_logged values (P006, P011). "
"These are fixed-fee projects. Set hours to 0 and added 'billing_type' "
"column ('Hourly' / 'Fixed Fee')."
)
Step 4: Convert Date Columns to datetime
Both start_date and end_date are stored as strings. Maya needs them as datetime objects for duration calculations.
print("\n=== STEP 4: CONVERT DATE COLUMNS ===")
print(f"start_date dtype before: {projects['start_date'].dtype}")
print(f"Sample start_dates: {projects['start_date'].head(5).tolist()}")
print(f"end_date dtype before: {projects['end_date'].dtype}")
print(f"Missing end_dates: {projects['end_date'].isna().sum()} "
f"(expected — active projects have no end date)")
# The start_dates look uniform (ISO format) — check before converting
date_sample = projects["start_date"].head(10)
print(f"\nFirst 10 start_dates: {date_sample.tolist()}")
# Convert start_date — all appear to be ISO format
projects["start_date"] = pd.to_datetime(projects["start_date"], errors="coerce")
# Convert end_date — same format, NaT for missing (active projects)
projects["end_date"] = pd.to_datetime(projects["end_date"], errors="coerce")
print(f"\nstart_date dtype after: {projects['start_date'].dtype}")
print(f"end_date dtype after: {projects['end_date'].dtype}")
print(f"NaT in end_date: {projects['end_date'].isna().sum()} (active projects)")
print(f"NaT in start_date: {projects['start_date'].isna().sum()} (errors if > 0)")
Output:
start_date dtype before: object
Sample start_dates: ['2020-03-15', '2020-04-01', '2020-05-10', '2020-07-01', '2020-09-01']
start_date dtype after: datetime64[ns]
end_date dtype after: datetime64[ns]
NaT in end_date: 7 (active projects)
NaT in start_date: 0 (errors if > 0)
No parsing errors on start_date. Seven missing end_date values — these are the active projects (In Progress, On Hold, Cancelled) that have not ended yet.
# Now calculate project duration for completed projects
completed = projects[projects["status"] == "Completed"].copy()
completed["duration_days"] = (
completed["end_date"] - completed["start_date"]
).dt.days
print(f"\nCompleted project duration stats (days):")
print(completed["duration_days"].describe().round(0))
avg_duration = completed["duration_days"].mean()
print(f"\nAverage completed project duration: {avg_duration:.0f} days "
f"({avg_duration/30:.1f} months)")
Output:
Completed project duration stats (days):
count 18.0
mean 126.0
std 64.0
min 29.0
25% 91.0
50% 120.0
75% 153.0
max 273.0
Average completed project duration: 126 days (4.2 months)
This analysis was impossible when dates were stored as strings. Average project runs 4.2 months — useful information for project planning and client proposals.
cleaning_log.append(
"Step 4 — Converted start_date and end_date to datetime64. "
"0 start_date conversion failures. "
"7 end_date NaT values (expected: active/cancelled projects)."
)
Step 5: Fix Client Data — Dates and "None" Tier
Now Maya turns to the clients DataFrame:
print("\n=== STEP 5: CLEAN CLIENT DATA ===")
print("client_tier before:")
print(clients["client_tier"].value_counts(dropna=False))
print()
print("first_contact_date sample:")
print(clients["first_contact_date"].tolist())
Output:
client_tier before:
Gold 2
Silver 2
Bronze 1
None 1 ← string "None", not actual NaN
first_contact_date sample:
['2020-03-01', '2020-04-01', '05/10/2020', '2020-07-01', 'September 2020', '2022-05-15']
Fixing client_tier — the string "None" problem:
# Replace the string "None" with actual NaN
clients["client_tier"] = clients["client_tier"].replace("None", np.nan)
print("client_tier after replacing string 'None' with NaN:")
print(clients["client_tier"].value_counts(dropna=False))
# LivTech Inc. (C004) has no tier assigned — Maya checks her notes
# and determines they should be "Bronze" (newer client, lower engagement)
clients.loc[clients["client_name"] == "LivTech Inc.", "client_tier"] = "Bronze"
print("\nclient_tier after filling LivTech:")
print(clients["client_tier"].value_counts(dropna=False))
Output:
client_tier after replacing string 'None' with NaN:
Gold 2
Silver 2
Bronze 1
NaN 1
client_tier after filling LivTech:
Gold 2
Silver 2
Bronze 2
Fixing first_contact_date — mixed formats:
print("\nfirst_contact_date conversion:")
print("Raw values:")
for i, val in enumerate(clients["first_contact_date"]):
print(f" [{i}] '{val}'")
# pd.to_datetime with errors='coerce' handles most formats
# 'September 2020' is month-year only — this will parse but
# default to September 1, 2020 (first of the month)
clients["first_contact_date"] = pd.to_datetime(
clients["first_contact_date"],
errors="coerce"
)
print("\nAfter conversion:")
for i, (raw, parsed) in enumerate(
zip(clients["client_name"],
clients["first_contact_date"])
):
print(f" {raw}: {parsed.strftime('%Y-%m-%d') if pd.notna(parsed) else 'NaT'}")
Output:
first_contact_date conversion:
Raw values:
[0] '2020-03-01'
[1] '2020-04-01'
[2] '05/10/2020'
[3] '2020-07-01'
[4] 'September 2020'
[5] '2022-05-15'
After conversion:
Brightfield LLC: 2020-03-01
Nova Dynamics: 2020-04-01
Harrington & Sons: 2020-05-10
LivTech Inc.: 2020-07-01
OakPath Media: 2020-09-01 ← "September 2020" → Sept 1
RiverStone Co: 2022-05-15
"September 2020" was parsed as September 1, 2020 — the first day of that month. This is a reasonable interpretation (Maya knows this client was acquired sometime in September 2020, just not the exact day). She adds a note:
# Add a flag for dates where only month/year was known
clients["contact_date_approx"] = clients["client_name"] == "OakPath Media"
# (In a real scenario, this flag would be set for all rows where the
# original date was month/year only)
cleaning_log.append(
"Step 5 — Cleaned clients data: replaced string 'None' in client_tier with NaN, "
"filled LivTech Inc. as 'Bronze'. "
"Converted first_contact_date from mixed formats to datetime64. "
"'September 2020' defaulted to 2020-09-01 (flagged as approximate)."
)
Step 6: Final Validation
print("\n=== STEP 6: FINAL VALIDATION ===")
checks = []
# Projects checks
checks.append(("hourly_rate is float64",
str(projects["hourly_rate"].dtype) == "float64",
f"dtype: {projects['hourly_rate'].dtype}"))
checks.append(("No $ in hourly_rate",
~projects["hourly_rate"].astype(str).str.contains(r"\$").any(),
"checked"))
checks.append(("status has 4 canonical values",
set(projects["status"].unique()) ==
{"Completed", "In Progress", "On Hold", "Cancelled"},
f"found: {set(projects['status'].unique())}"))
checks.append(("No missing hours_logged",
projects["hours_logged"].isna().sum() == 0,
f"missing: {projects['hours_logged'].isna().sum()}"))
checks.append(("start_date is datetime",
str(projects["start_date"].dtype).startswith("datetime"),
f"dtype: {projects['start_date'].dtype}"))
checks.append(("billing_type column exists",
"billing_type" in projects.columns,
f"values: {projects['billing_type'].value_counts().to_dict()}"))
# Clients checks
checks.append(("No string 'None' in client_tier",
"None" not in clients["client_tier"].values,
f"values: {clients['client_tier'].unique().tolist()}"))
checks.append(("No missing client_tier",
clients["client_tier"].isna().sum() == 0,
f"missing: {clients['client_tier'].isna().sum()}"))
checks.append(("first_contact_date is datetime",
str(clients["first_contact_date"].dtype).startswith("datetime"),
f"dtype: {clients['first_contact_date'].dtype}"))
# Print
for check_name, passed, detail in checks:
status = "PASS" if passed else "FAIL"
print(f" [{status}] {check_name} — {detail}")
all_passed = all(p for _, p, _ in checks)
print(f"\n{'All checks passed!' if all_passed else 'Some checks failed.'}")
Output:
[PASS] hourly_rate is float64 — dtype: float64
[PASS] No $ in hourly_rate — checked
[PASS] status has 4 canonical values — found: {'Completed', 'In Progress', 'On Hold', 'Cancelled'}
[PASS] No missing hours_logged — missing: 0
[PASS] start_date is datetime — dtype: datetime64[ns]
[PASS] billing_type column exists — values: {'Hourly': 22, 'Fixed Fee': 2}
[PASS] No string 'None' in client_tier — values: ['Gold', 'Silver', 'Bronze']
[PASS] No missing client_tier — missing: 0
[PASS] first_contact_date is datetime — dtype: datetime64[ns]
All checks passed!
Step 7: What Clean Data Enables
Now that the data is clean, Maya runs a quick analysis she could not have run before:
print("\n=== ANALYSIS NOW POSSIBLE ON CLEAN DATA ===")
# Revenue by status (only meaningful with clean status column)
print("Total billed by status:")
print(projects.groupby("status")["total_billed"].sum().apply(
lambda x: f"${x:,.2f}"
))
# Revenue trend by year (only possible with datetime start_date)
projects["year"] = projects["start_date"].dt.year
print("\nTotal billed by year:")
print(projects.groupby("year")["total_billed"].sum().apply(
lambda x: f"${x:,.2f}"
))
# Average rate by year (shows Maya's rate increases over time)
print("\nAverage hourly rate by year (career progression):")
hourly_only = projects[projects["billing_type"] == "Hourly"]
print(hourly_only.groupby("year")["hourly_rate"].mean().apply(
lambda x: f"${x:.2f}/hr"
))
# Revenue per client
print("\nTotal billed by client (top 4):")
top = projects.groupby("client")["total_billed"].sum().nlargest(4)
print(top.apply(lambda x: f"${x:,.2f}"))
Output:
Total billed by status:
status
Cancelled 2400.00
Completed 213850.00
In Progress 29950.00
On Hold 4000.00
Total billed by year:
year
2020 44562.50
2021 47450.00
2022 73600.00
2023 84187.50
Average hourly rate by year (career progression):
year
2020 $125.00/hr
2021 $150.00/hr
2022 $175.00/hr
2023 $200.00/hr
Total billed by client (top 4):
client
Brightfield LLC 53662.50
Harrington & Sons 59325.00
Nova Dynamics 50225.00
RiverStone Co 44625.00
Maya's revenue has grown significantly: from $44,562 in 2020 to $84,187 in 2023. Her rate has increased every year from $125/hr to $200/hr. Harrington & Sons is her top client by total billing. None of this was visible or calculable from the raw messy data.
The Cleaning Log
print("\n=== CLEANING LOG ===")
for i, entry in enumerate(cleaning_log, 1):
print(f"\n[{i}] {entry}")
Output:
[1] Step 1 — Fixed hourly_rate: stripped '$', converted 11 string variants
to 4 clean float values: [125.0, 150.0, 175.0, 200.0]. 0 conversion failures.
[2] Step 2 — Standardized status using .map(): 9 variants → 4 canonical values
(Completed: 18, In Progress: 4, On Hold: 1, Cancelled: 1). 0 unmapped.
[3] Step 3 — Handled 2 missing hours_logged values (P006, P011). These are fixed-fee
projects. Set hours to 0 and added 'billing_type' column ('Hourly' / 'Fixed Fee').
[4] Step 4 — Converted start_date and end_date to datetime64. 0 start_date
conversion failures. 7 end_date NaT values (expected: active/cancelled projects).
[5] Step 5 — Cleaned clients data: replaced string 'None' in client_tier with NaN,
filled LivTech Inc. as 'Bronze'. Converted first_contact_date from mixed
formats to datetime64. 'September 2020' defaulted to 2020-09-01 (flagged).
What Maya Learned
Manual data systems accumulate inconsistencies over time. Maya created this spreadsheet herself, and it still ended up with 9 status variants and mixed date formats. If even a careful, technically-minded person produces this kind of inconsistency in a personal spreadsheet, it is almost guaranteed to happen in any business system touched by multiple people over multiple years.
Type correction unlocks analysis. Every calculation Maya tried to do before cleaning either failed or gave wrong results. The hourly_rate dollar-sign problem prevented revenue calculations. The string dates prevented duration analysis. After one cleaning session, her full career revenue trend was suddenly visible.
.map() is better than a chain of .replace() calls. A single dictionary mapping all status variants to canonical values is cleaner, more readable, and easier to maintain than seven separate .replace() calls. The bonus: any unmapped value becomes NaN — a built-in error detection mechanism.
Not all "missing" values mean the same thing. The 7 missing end_date values are expected (active projects). The 2 missing hours_logged values are also expected (fixed-fee projects). Neither warrants filling with median or dropping rows. Understanding the business context is what drives the correct decision.
Complete Cleaning Code
import pandas as pd
import numpy as np
# Load
projects = pd.read_csv("maya_projects.csv", dtype={"project_id": str})
clients = pd.read_excel("maya_clients.xlsx", sheet_name="Active Clients")
# 1. Fix hourly_rate
projects["hourly_rate"] = (
projects["hourly_rate"]
.astype(str)
.str.replace("$", "", regex=False)
.str.strip()
)
projects["hourly_rate"] = pd.to_numeric(projects["hourly_rate"], errors="coerce")
# 2. Standardize status
status_map = {
"Completed": "Completed", "completed": "Completed",
"Complete": "Completed", "COMPLETED": "Completed", "DONE": "Completed",
"In Progress": "In Progress", "in progress": "In Progress",
"On Hold": "On Hold", "Cancelled": "Cancelled", "Canceled": "Cancelled",
}
projects["status"] = projects["status"].map(status_map)
# 3. Handle missing hours
projects["billing_type"] = "Hourly"
projects.loc[projects["hours_logged"].isna(), "billing_type"] = "Fixed Fee"
projects["hours_logged"] = projects["hours_logged"].fillna(0)
# 4. Convert dates
projects["start_date"] = pd.to_datetime(projects["start_date"], errors="coerce")
projects["end_date"] = pd.to_datetime(projects["end_date"], errors="coerce")
# 5. Clean clients
clients["client_tier"] = clients["client_tier"].replace("None", np.nan)
clients["first_contact_date"] = pd.to_datetime(
clients["first_contact_date"], errors="coerce"
)
# Save
projects.to_csv("maya_projects_clean.csv", index=False)
print("Cleaning complete.")