Case Study 34-02: Maya Builds a Project Overrun Predictor
Character: Maya Reyes, freelance business analytics consultant Client: Thornfield Advisory, a 25-person management consulting firm Setting: Thornfield has a persistent problem: projects that look profitable at proposal stage frequently run over budget. Claire Whitmore, the engagement director, wants a model that predicts overrun risk early enough to act on it.
The Problem
Thornfield Advisory closes roughly 80 consulting engagements per year. Each project is sold on a fixed-fee basis — the client pays an agreed amount regardless of how many hours Thornfield spends. When a project runs over its internal hour budget, it eats directly into margin.
Claire had pulled together data on every engagement from the past three years: 243 projects. Forty-four had run over budget by more than 20%. Those forty-four projects had cost Thornfield an average of $31,000 each in absorbed overrun costs — roughly $1.4 million in lost margin across three years.
"We know some kinds of projects always bleed," Claire told Maya at their first meeting. "Healthcare regulatory work, anything that involves a merger, multinational clients. But we take the work anyway because the relationships matter. What I want to know is: can we flag these at proposal stage so the engagement manager knows to price in a buffer?"
Maya examined the dataset. Thornfield had tracked everything that mattered at proposal stage: client industry, project type, team size, initial fee estimate, number of deliverables scoped, whether a legal review was required, previous work with the client, and the engagement manager assigned. She also had the post-engagement actuals: hours spent, final invoice amount, and whether the project had run over.
Step 1: Understanding the Data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
classification_report, roc_auc_score, confusion_matrix,
)
def load_and_profile_engagements(csv_path: str) -> pd.DataFrame:
"""
Load Thornfield engagement data and print a diagnostic profile.
Args:
csv_path: Path to the engagement history CSV.
Returns:
Cleaned DataFrame ready for feature engineering.
"""
df = pd.read_csv(csv_path)
print(f"Total projects: {len(df)}")
print(f"Projects with >20% overrun: {df['over_budget'].sum()} ({df['over_budget'].mean():.1%})")
print(f"\nOverrun rate by project type:")
print(df.groupby("project_type")["over_budget"].mean().sort_values(ascending=False).to_string())
print(f"\nOverrun rate by client industry:")
print(df.groupby("client_industry")["over_budget"].mean().sort_values(ascending=False).to_string())
return df
When Maya ran the profile, the patterns jumped out immediately. Regulatory compliance projects had a 51% overrun rate. Healthcare clients ran over 44% of the time. Projects requiring legal review ran over 38% of the time versus 12% for those without. The raw data confirmed what Claire's team already knew anecdotally — and pointed toward which features would matter in a model.
Step 2: Feature Engineering
The proposal-stage features available to Thornfield were a mix of categorical and numeric data. Maya's feature engineering work focused on creating variables that captured the known risk drivers:
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Create predictive features from raw proposal-stage data.
Args:
df: Raw engagement DataFrame from load_and_profile_engagements().
Returns:
Feature-engineered DataFrame with encoded categoricals.
"""
engineered = df.copy()
# Interaction: fee per deliverable (proxy for scope tightness)
# Tightly scoped projects (high fee per deliverable) tend to run leaner
engineered["fee_per_deliverable"] = (
engineered["initial_fee_estimate"] / engineered["num_deliverables"].clip(1)
)
# Interaction: team size relative to project duration
# Large teams on short projects create coordination overhead
engineered["team_density"] = (
engineered["team_size"] / engineered["estimated_duration_weeks"].clip(1)
)
# Flag: is this a new client? (no previous work history means more unknowns)
engineered["is_new_client"] = (engineered["prior_engagements"] == 0).astype(int)
# Categorical encoding
engineered = pd.get_dummies(
engineered,
columns=["client_industry", "project_type", "engagement_manager_tier"],
drop_first=True,
)
# Drop columns not available at proposal stage
cols_to_drop = [
"project_id", "client_name", "actual_hours", "final_invoice",
"engagement_manager_name", # Use tier instead (anonymized)
]
engineered.drop(columns=cols_to_drop, errors="ignore", inplace=True)
return engineered
The two interaction terms were deliberate choices based on domain logic, not just data mining. A project scoped at $5,000 per deliverable will be managed very differently from one scoped at $50,000 per deliverable. And a team of eight consultants on a three-week engagement faces different coordination pressures than the same team on a six-month engagement.
Step 3: Model Training and Selection
With 243 projects, the dataset was small enough that cross-validation results could be meaningfully affected by random variance. Maya ran five-fold stratified cross-validation and examined not just mean performance but the spread.
def compare_models(
df_features: pd.DataFrame,
target_col: str = "over_budget",
random_state: int = 42,
) -> None:
"""
Compare logistic regression vs. random forest with cross-validation.
Args:
df_features: Engineered feature DataFrame.
target_col: Binary target column name.
random_state: Random seed.
"""
feature_cols = [c for c in df_features.columns if c != target_col]
X = df_features[feature_cols]
y = df_features[target_col]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
logistic = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(
class_weight="balanced",
C=0.5,
max_iter=1000,
random_state=random_state,
)),
])
forest = RandomForestClassifier(
n_estimators=200,
max_depth=5,
min_samples_leaf=8,
class_weight="balanced",
random_state=random_state,
)
for name, model in [("Logistic Regression", logistic), ("Random Forest", forest)]:
f1_scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
auc_scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")
recall_scores = cross_val_score(model, X, y, cv=cv, scoring="recall")
print(f"\n{name}:")
print(f" F1: {f1_scores.mean():.3f} ± {f1_scores.std():.3f}")
print(f" AUC: {auc_scores.mean():.3f} ± {auc_scores.std():.3f}")
print(f" Recall: {recall_scores.mean():.3f} ± {recall_scores.std():.3f}")
print(f" F1 scores by fold: {[f'{s:.3f}' for s in f1_scores]}")
The output:
Logistic Regression:
F1: 0.571 ± 0.071
AUC: 0.821 ± 0.048
Recall: 0.682 ± 0.082
F1 scores by fold: ['0.500', '0.632', '0.545', '0.600', '0.579']
Random Forest:
F1: 0.593 ± 0.058
AUC: 0.847 ± 0.039
Recall: 0.705 ± 0.071
F1 scores by fold: ['0.533', '0.632', '0.581', '0.611', '0.607']
Random Forest edged ahead on all three metrics. Equally important: the standard deviation was lower — more consistent performance across different subsets of the small dataset. With only 243 rows, consistency matters more than peak performance.
Step 4: The Final Model and Feature Interpretation
Maya trained the production model on 80% of the data and evaluated it on the held-out 20%.
def train_overrun_model(
df_features: pd.DataFrame,
target_col: str = "over_budget",
random_state: int = 42,
) -> tuple:
"""
Train final Random Forest model and return model plus test results.
Args:
df_features: Engineered feature DataFrame.
target_col: Binary target column.
random_state: Random seed.
Returns:
Tuple of (trained model, feature column list, test metrics dict).
"""
feature_cols = [c for c in df_features.columns if c != target_col]
X = df_features[feature_cols]
y = df_features[target_col]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, stratify=y, random_state=random_state
)
model = RandomForestClassifier(
n_estimators=200,
max_depth=5,
min_samples_leaf=8,
class_weight="balanced",
random_state=random_state,
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print("\n=== Test Set Performance ===")
print(classification_report(y_test, y_pred, target_names=["On Budget", "Over Budget"]))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")
# Feature importance
importance_df = pd.DataFrame({
"feature": feature_cols,
"importance": model.feature_importances_,
}).sort_values("importance", ascending=False).head(10)
print("\n=== Top 10 Features by Importance ===")
for _, row in importance_df.iterrows():
bar = "#" * int(row["importance"] * 150)
print(f" {row['feature']:<35} {row['importance']:.3f} {bar}")
return model, feature_cols, {"auc": roc_auc_score(y_test, y_prob)}
The test set results:
=== Test Set Performance ===
precision recall f1-score support
On Budget 0.93 0.89 0.91 40
Over Budget 0.62 0.75 0.68 9
accuracy 0.88 49
macro avg 0.78 0.82 0.80 49
weighted avg 0.89 0.88 0.88 49
ROC AUC: 0.882
=== Top 10 Features by Importance ===
project_type_Regulatory 0.201 ##############################
fee_per_deliverable 0.158 #######################
client_industry_Healthcare 0.134 ####################
requires_legal_review 0.112 ################
team_density 0.098 ##############
is_new_client 0.071 ##########
estimated_duration_weeks 0.058 ########
initial_fee_estimate 0.047 #######
prior_engagements 0.039 #####
num_deliverables 0.032 ####
The model caught 75% of the over-budget projects (recall = 0.75) with 62% precision. The feature importances confirmed the domain logic: regulatory project type and healthcare industry dominated, followed by the engineered fee_per_deliverable variable — which captures how tightly the scope is priced.
Step 5: The Practical Output
Maya built a scoring function that Claire's team could run on any new proposal:
def score_new_proposal(
model,
feature_cols: list,
proposal_data: dict,
) -> None:
"""
Score a single new proposal and print overrun risk assessment.
Args:
model: Trained RandomForestClassifier.
feature_cols: Ordered list of feature columns from training.
proposal_data: Dictionary of feature values for the new proposal.
"""
proposal_df = pd.DataFrame([proposal_data])
# Align columns to match training data
for col in feature_cols:
if col not in proposal_df.columns:
proposal_df[col] = 0
proposal_df = proposal_df[feature_cols]
prob = model.predict_proba(proposal_df)[0][1]
risk_level = "HIGH" if prob >= 0.60 else "MEDIUM" if prob >= 0.35 else "LOW"
print(f"\n=== Overrun Risk Assessment ===")
print(f"Predicted overrun probability: {prob:.1%}")
print(f"Risk level: {risk_level}")
if risk_level == "HIGH":
print("Recommendation: Conduct pre-engagement scope review.")
print(" Consider 15-20% buffer in internal hour budget.")
elif risk_level == "MEDIUM":
print("Recommendation: Schedule mid-project checkpoint at 40% completion.")
else:
print("Recommendation: Standard engagement management procedures.")
Running this on a new regulatory compliance project with a healthcare client returned: Predicted overrun probability: 71.4% — Risk level: HIGH.
The proposal was not rejected — the relationship was too important — but the engagement manager built a 25% hour buffer into the internal budget and scheduled two mid-project scope reviews. The project ultimately ran 11% over internal hours rather than the typical 40%+ for this category.
What Claire Got — and What She Did Not
Maya was direct with Claire about what the model could and could not do.
What it can do: Flag high-risk proposals at the point of signing, before any work begins. This gives engagement managers time to price in a buffer, have a more explicit scope conversation with the client, or schedule additional checkpoints.
What it cannot do: Predict the specific reason a project will overrun, control for project manager quality in a way the data cannot see, or account for market events that change client priorities mid-engagement.
What surprised Claire: The fee_per_deliverable feature — the engineered interaction term — was the second most important predictor. Projects that were priced loosely (low fee per deliverable) ran over more often than projects priced tightly. The implication: Thornfield's pricing process itself contained information about project risk that was not being surfaced in engagement management.
That finding was worth more than the model.
Summary
| Item | Value |
|---|---|
| Dataset | 243 projects, 3 years |
| Overrun rate | 18.1% |
| Model | Random Forest (200 trees, max_depth=5) |
| Cross-validated AUC | 0.847 ± 0.039 |
| Test-set recall (overrun class) | 75% |
| Test-set precision (overrun class) | 62% |
| Most predictive features | Regulatory project type, fee per deliverable, healthcare industry |
| Business outcome | 15-25% reduction in absorbed overrun costs on flagged projects |