Case Study 34-02: Maya Builds a Project Overrun Predictor

Character: Maya Reyes, freelance business analytics consultant Client: Thornfield Advisory, a 25-person management consulting firm Setting: Thornfield has a persistent problem: projects that look profitable at proposal stage frequently run over budget. Claire Whitmore, the engagement director, wants a model that predicts overrun risk early enough to act on it.


The Problem

Thornfield Advisory closes roughly 80 consulting engagements per year. Each project is sold on a fixed-fee basis — the client pays an agreed amount regardless of how many hours Thornfield spends. When a project runs over its internal hour budget, it eats directly into margin.

Claire had pulled together data on every engagement from the past three years: 243 projects. Forty-four had run over budget by more than 20%. Those forty-four projects had cost Thornfield an average of $31,000 each in absorbed overrun costs — roughly $1.4 million in lost margin across three years.

"We know some kinds of projects always bleed," Claire told Maya at their first meeting. "Healthcare regulatory work, anything that involves a merger, multinational clients. But we take the work anyway because the relationships matter. What I want to know is: can we flag these at proposal stage so the engagement manager knows to price in a buffer?"

Maya examined the dataset. Thornfield had tracked everything that mattered at proposal stage: client industry, project type, team size, initial fee estimate, number of deliverables scoped, whether a legal review was required, previous work with the client, and the engagement manager assigned. She also had the post-engagement actuals: hours spent, final invoice amount, and whether the project had run over.


Step 1: Understanding the Data

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report, roc_auc_score, confusion_matrix,
)


def load_and_profile_engagements(csv_path: str) -> pd.DataFrame:
    """
    Load Thornfield engagement data and print a diagnostic profile.

    Args:
        csv_path: Path to the engagement history CSV.

    Returns:
        Cleaned DataFrame ready for feature engineering.
    """
    df = pd.read_csv(csv_path)

    print(f"Total projects: {len(df)}")
    print(f"Projects with >20% overrun: {df['over_budget'].sum()} ({df['over_budget'].mean():.1%})")
    print(f"\nOverrun rate by project type:")
    print(df.groupby("project_type")["over_budget"].mean().sort_values(ascending=False).to_string())
    print(f"\nOverrun rate by client industry:")
    print(df.groupby("client_industry")["over_budget"].mean().sort_values(ascending=False).to_string())

    return df

When Maya ran the profile, the patterns jumped out immediately. Regulatory compliance projects had a 51% overrun rate. Healthcare clients ran over 44% of the time. Projects requiring legal review ran over 38% of the time versus 12% for those without. The raw data confirmed what Claire's team already knew anecdotally — and pointed toward which features would matter in a model.


Step 2: Feature Engineering

The proposal-stage features available to Thornfield were a mix of categorical and numeric data. Maya's feature engineering work focused on creating variables that captured the known risk drivers:

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create predictive features from raw proposal-stage data.

    Args:
        df: Raw engagement DataFrame from load_and_profile_engagements().

    Returns:
        Feature-engineered DataFrame with encoded categoricals.
    """
    engineered = df.copy()

    # Interaction: fee per deliverable (proxy for scope tightness)
    # Tightly scoped projects (high fee per deliverable) tend to run leaner
    engineered["fee_per_deliverable"] = (
        engineered["initial_fee_estimate"] / engineered["num_deliverables"].clip(1)
    )

    # Interaction: team size relative to project duration
    # Large teams on short projects create coordination overhead
    engineered["team_density"] = (
        engineered["team_size"] / engineered["estimated_duration_weeks"].clip(1)
    )

    # Flag: is this a new client? (no previous work history means more unknowns)
    engineered["is_new_client"] = (engineered["prior_engagements"] == 0).astype(int)

    # Categorical encoding
    engineered = pd.get_dummies(
        engineered,
        columns=["client_industry", "project_type", "engagement_manager_tier"],
        drop_first=True,
    )

    # Drop columns not available at proposal stage
    cols_to_drop = [
        "project_id", "client_name", "actual_hours", "final_invoice",
        "engagement_manager_name",  # Use tier instead (anonymized)
    ]
    engineered.drop(columns=cols_to_drop, errors="ignore", inplace=True)

    return engineered

The two interaction terms were deliberate choices based on domain logic, not just data mining. A project scoped at $5,000 per deliverable will be managed very differently from one scoped at $50,000 per deliverable. And a team of eight consultants on a three-week engagement faces different coordination pressures than the same team on a six-month engagement.


Step 3: Model Training and Selection

With 243 projects, the dataset was small enough that cross-validation results could be meaningfully affected by random variance. Maya ran five-fold stratified cross-validation and examined not just mean performance but the spread.

def compare_models(
    df_features: pd.DataFrame,
    target_col: str = "over_budget",
    random_state: int = 42,
) -> None:
    """
    Compare logistic regression vs. random forest with cross-validation.

    Args:
        df_features: Engineered feature DataFrame.
        target_col: Binary target column name.
        random_state: Random seed.
    """
    feature_cols = [c for c in df_features.columns if c != target_col]
    X = df_features[feature_cols]
    y = df_features[target_col]

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

    logistic = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(
            class_weight="balanced",
            C=0.5,
            max_iter=1000,
            random_state=random_state,
        )),
    ])

    forest = RandomForestClassifier(
        n_estimators=200,
        max_depth=5,
        min_samples_leaf=8,
        class_weight="balanced",
        random_state=random_state,
    )

    for name, model in [("Logistic Regression", logistic), ("Random Forest", forest)]:
        f1_scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
        auc_scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")
        recall_scores = cross_val_score(model, X, y, cv=cv, scoring="recall")

        print(f"\n{name}:")
        print(f"  F1:     {f1_scores.mean():.3f} ± {f1_scores.std():.3f}")
        print(f"  AUC:    {auc_scores.mean():.3f} ± {auc_scores.std():.3f}")
        print(f"  Recall: {recall_scores.mean():.3f} ± {recall_scores.std():.3f}")
        print(f"  F1 scores by fold: {[f'{s:.3f}' for s in f1_scores]}")

The output:

Logistic Regression:
  F1:     0.571 ± 0.071
  AUC:    0.821 ± 0.048
  Recall: 0.682 ± 0.082
  F1 scores by fold: ['0.500', '0.632', '0.545', '0.600', '0.579']

Random Forest:
  F1:     0.593 ± 0.058
  AUC:    0.847 ± 0.039
  Recall: 0.705 ± 0.071
  F1 scores by fold: ['0.533', '0.632', '0.581', '0.611', '0.607']

Random Forest edged ahead on all three metrics. Equally important: the standard deviation was lower — more consistent performance across different subsets of the small dataset. With only 243 rows, consistency matters more than peak performance.


Step 4: The Final Model and Feature Interpretation

Maya trained the production model on 80% of the data and evaluated it on the held-out 20%.

def train_overrun_model(
    df_features: pd.DataFrame,
    target_col: str = "over_budget",
    random_state: int = 42,
) -> tuple:
    """
    Train final Random Forest model and return model plus test results.

    Args:
        df_features: Engineered feature DataFrame.
        target_col: Binary target column.
        random_state: Random seed.

    Returns:
        Tuple of (trained model, feature column list, test metrics dict).
    """
    feature_cols = [c for c in df_features.columns if c != target_col]
    X = df_features[feature_cols]
    y = df_features[target_col]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.20, stratify=y, random_state=random_state
    )

    model = RandomForestClassifier(
        n_estimators=200,
        max_depth=5,
        min_samples_leaf=8,
        class_weight="balanced",
        random_state=random_state,
    )
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    print("\n=== Test Set Performance ===")
    print(classification_report(y_test, y_pred, target_names=["On Budget", "Over Budget"]))
    print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")

    # Feature importance
    importance_df = pd.DataFrame({
        "feature": feature_cols,
        "importance": model.feature_importances_,
    }).sort_values("importance", ascending=False).head(10)

    print("\n=== Top 10 Features by Importance ===")
    for _, row in importance_df.iterrows():
        bar = "#" * int(row["importance"] * 150)
        print(f"  {row['feature']:<35} {row['importance']:.3f}  {bar}")

    return model, feature_cols, {"auc": roc_auc_score(y_test, y_prob)}

The test set results:

=== Test Set Performance ===
              precision    recall  f1-score   support

   On Budget       0.93      0.89      0.91        40
 Over Budget       0.62      0.75      0.68         9

    accuracy                           0.88        49
   macro avg       0.78      0.82      0.80        49
weighted avg       0.89      0.88      0.88        49

ROC AUC: 0.882

=== Top 10 Features by Importance ===
  project_type_Regulatory          0.201  ##############################
  fee_per_deliverable              0.158  #######################
  client_industry_Healthcare       0.134  ####################
  requires_legal_review            0.112  ################
  team_density                     0.098  ##############
  is_new_client                    0.071  ##########
  estimated_duration_weeks         0.058  ########
  initial_fee_estimate             0.047  #######
  prior_engagements                0.039  #####
  num_deliverables                 0.032  ####

The model caught 75% of the over-budget projects (recall = 0.75) with 62% precision. The feature importances confirmed the domain logic: regulatory project type and healthcare industry dominated, followed by the engineered fee_per_deliverable variable — which captures how tightly the scope is priced.


Step 5: The Practical Output

Maya built a scoring function that Claire's team could run on any new proposal:

def score_new_proposal(
    model,
    feature_cols: list,
    proposal_data: dict,
) -> None:
    """
    Score a single new proposal and print overrun risk assessment.

    Args:
        model: Trained RandomForestClassifier.
        feature_cols: Ordered list of feature columns from training.
        proposal_data: Dictionary of feature values for the new proposal.
    """
    proposal_df = pd.DataFrame([proposal_data])

    # Align columns to match training data
    for col in feature_cols:
        if col not in proposal_df.columns:
            proposal_df[col] = 0

    proposal_df = proposal_df[feature_cols]

    prob = model.predict_proba(proposal_df)[0][1]
    risk_level = "HIGH" if prob >= 0.60 else "MEDIUM" if prob >= 0.35 else "LOW"

    print(f"\n=== Overrun Risk Assessment ===")
    print(f"Predicted overrun probability: {prob:.1%}")
    print(f"Risk level:                    {risk_level}")
    if risk_level == "HIGH":
        print("Recommendation: Conduct pre-engagement scope review.")
        print("                Consider 15-20% buffer in internal hour budget.")
    elif risk_level == "MEDIUM":
        print("Recommendation: Schedule mid-project checkpoint at 40% completion.")
    else:
        print("Recommendation: Standard engagement management procedures.")

Running this on a new regulatory compliance project with a healthcare client returned: Predicted overrun probability: 71.4% — Risk level: HIGH.

The proposal was not rejected — the relationship was too important — but the engagement manager built a 25% hour buffer into the internal budget and scheduled two mid-project scope reviews. The project ultimately ran 11% over internal hours rather than the typical 40%+ for this category.


What Claire Got — and What She Did Not

Maya was direct with Claire about what the model could and could not do.

What it can do: Flag high-risk proposals at the point of signing, before any work begins. This gives engagement managers time to price in a buffer, have a more explicit scope conversation with the client, or schedule additional checkpoints.

What it cannot do: Predict the specific reason a project will overrun, control for project manager quality in a way the data cannot see, or account for market events that change client priorities mid-engagement.

What surprised Claire: The fee_per_deliverable feature — the engineered interaction term — was the second most important predictor. Projects that were priced loosely (low fee per deliverable) ran over more often than projects priced tightly. The implication: Thornfield's pricing process itself contained information about project risk that was not being surfaced in engagement management.

That finding was worth more than the model.


Summary

Item Value
Dataset 243 projects, 3 years
Overrun rate 18.1%
Model Random Forest (200 trees, max_depth=5)
Cross-validated AUC 0.847 ± 0.039
Test-set recall (overrun class) 75%
Test-set precision (overrun class) 62%
Most predictive features Regulatory project type, fee per deliverable, healthcare industry
Business outcome 15-25% reduction in absorbed overrun costs on flagged projects