30 min read

> "Machine learning is not magic. It is pattern-finding at scale. Understanding that distinction is the difference between using it wisely and wasting six months chasing a solution to a problem you didn't actually have."

Chapter 33: Introduction to Machine Learning for Business

"Machine learning is not magic. It is pattern-finding at scale. Understanding that distinction is the difference between using it wisely and wasting six months chasing a solution to a problem you didn't actually have." — Maya Reyes, consulting debrief notes


Who This Chapter Is For

You have heard the words "machine learning," "AI," and "predictive analytics" in boardrooms, vendor pitches, and headlines. You may have nodded along while privately wondering what any of it actually means. Or you may have started reading an ML tutorial, hit a wall of mathematical notation, and closed the tab.

This chapter is the chapter that should have come first in every one of those tutorials.

By the end of it you will understand what machine learning actually is, how it fits into a business context, when it is worth pursuing, and when a better spreadsheet or a cleaner dashboard would serve you just as well. You will also have a working mental model of the entire ML workflow — from framing a business problem to deploying a solution — and hands-on experience with scikit-learn, Python's standard ML library.

No calculus required. No statistics degree required. A healthy dose of skepticism is actively encouraged.


33.1 What Machine Learning Actually Is

Let's start with the most important thing you will read in this chapter.

Machine learning is pattern-finding in data.

That is it. That is the whole definition. Everything else is details.

A machine learning algorithm looks at historical data and extracts patterns. Then it uses those patterns to make predictions or decisions about new data it has never seen. The "learning" is not mysterious: the algorithm is adjusting its internal parameters to get better at a task as it sees more examples, in the same way that a new employee gets better at routing customer calls after handling a few hundred of them.

Let's make this concrete with a business example.

Suppose you work at a subscription software company. You have data on 50,000 customers: how often they log in, how many features they use, how often they contact support, whether they have an active payment method, and a dozen other things. You also know which of those customers renewed their subscription last year and which did not.

You give all of that data to a machine learning algorithm. The algorithm finds patterns: customers who log in fewer than twice per month AND have contacted support three or more times in the last quarter AND have not used the mobile app are much more likely to cancel. It quantifies these patterns into a model.

Now you give the model a new customer's data. It has never seen this customer before. Based on the patterns it learned, it estimates: "this customer has a 73% probability of canceling in the next 90 days."

That is machine learning. Pattern-finding in historical data, applied to new cases.

What Machine Learning Is NOT

Understanding what ML is not will save you enormous amounts of time and money.

ML is not magic. The model can only find patterns that exist in your data. If your data does not contain information that predicts the outcome you care about, no algorithm — however sophisticated — will manufacture it. Garbage in, garbage out is not a cliche; it is the most important law in applied ML.

ML is not Artificial General Intelligence. The phrase "AI" in popular usage conjures images of systems that think, reason, understand, and have goals. The ML models we will work with are none of these things. A churn prediction model does not "understand" churn. It is a mathematical function that maps input numbers to output numbers. It will fail in ways that no human would fail if the data changes in unexpected ways.

ML is not a replacement for business judgment. A model can tell you that a customer is likely to churn. It cannot tell you whether you should call them, offer them a discount, or accept the loss. Those are business decisions that require context, ethics, and judgment that the model does not have.

ML is not a solution to every problem. This point deserves its own section, and it gets one later in this chapter. For now: many business problems are better solved with clearer reporting, better processes, or a well-designed spreadsheet.

ML is not necessarily expensive or slow. Simple, effective ML models can be trained in seconds on a laptop. You do not need a data science team of 20 people and a cloud cluster to get value from ML. You also do not need the most sophisticated algorithm available.


33.2 Three Types of Machine Learning

The field of machine learning is organized around three main paradigms. All three appear in business contexts, though one dominates practical applications by a wide margin.

Supervised Learning: Learning from Labeled Examples

Supervised learning is the most common type of ML in business, and it is the type we will spend most of our time on.

In supervised learning, you train a model using data where the answer is already known. Each training example has:

  • Features (also called inputs or predictors): the things you measure or observe
  • A label (also called the target or outcome): the thing you are trying to predict

The algorithm learns the relationship between features and labels from historical examples, then applies that relationship to new cases where you know the features but not the label.

Business examples of supervised learning:

Business Problem Features Label
Customer churn prediction Login frequency, support contacts, plan type, tenure Did the customer cancel? (Yes/No)
Sales forecasting Month, region, marketing spend, seasonal factors Revenue for the period
Credit risk scoring Income, debt-to-income ratio, payment history Will this applicant default? (Yes/No)
Lead scoring Company size, industry, source, engagement score Will this lead convert? (Yes/No)
Customer inquiry classification Text of the support ticket Category: Billing / Technical / Complaint / Other
Demand forecasting Historical sales, price, promotions, weather Units sold next week

Supervised learning breaks down into two sub-types:

Classification: The label is a category. Will this customer churn (yes/no)? What category is this email (spam/not spam)? What product is in this image? Classification algorithms output either a predicted class or a probability for each class.

Regression: The label is a continuous number. What will revenue be next quarter? What price should we set for this item? How long will this manufacturing process take? Regression algorithms output a number.

Unsupervised Learning: Finding Structure in Unlabeled Data

In unsupervised learning, there are no labels. The algorithm explores the data and finds structure on its own. You are not telling it what the "right answer" is; you are asking it to discover what's interesting.

Business examples of unsupervised learning:

Clustering: Grouping customers into segments based on their behavior, without pre-defining the segments. You might discover that your customers fall into four distinct groups — power users, casual users, price-sensitive users, and dormant accounts — without ever having explicitly defined those categories. This feeds into product development, marketing strategy, and customer success prioritization.

Anomaly detection: Finding data points that don't fit the normal pattern. Fraud detection is the classic example: most transactions follow predictable patterns; the anomalies are the suspicious ones. Manufacturing quality control works similarly: most units are fine; the statistical outliers warrant inspection.

Dimensionality reduction: Compressing data with many features into a smaller set of features that captures most of the important variation. This is mostly useful as a preprocessing step or for visualization.

Unsupervised learning is more exploratory and harder to evaluate (there's no "right answer" to compare against), but it is genuinely useful in business contexts, especially in the early stages of understanding a new dataset.

Reinforcement Learning: Learning from Feedback

Reinforcement learning is the most technically complex of the three paradigms, and the least commonly used in standard business applications, though that is changing.

In reinforcement learning, an agent learns by taking actions in an environment and receiving rewards or penalties based on the outcomes. It learns to maximize cumulative reward over time.

Business examples where RL appears:

  • Dynamic pricing (adjusting prices in real time based on demand signals)
  • Recommendation systems (learning which content or products to show to maximize engagement or purchases)
  • Supply chain optimization (learning ordering policies to minimize cost and stockouts)
  • Automated trading systems

For most business ML projects, reinforcement learning is not the right tool. We mention it for completeness and because you will encounter the term. The chapters that follow focus on supervised learning, which solves the vast majority of practical business prediction problems.


33.3 The Machine Learning Workflow

Machine learning is a process, not a button. Here is the complete workflow, from business problem to deployed solution.

Step 1: Frame the Business Problem as an ML Problem

This is the most underrated step, and where most ML projects fail before they even begin.

A business problem ("we're losing customers") is not yet an ML problem. To convert it, you need to answer five questions:

1. What exactly are you trying to predict? Be precise. "Customer churn" is vague. "Whether a customer with an active subscription will cancel in the next 90 days" is specific. The prediction target must be something you can measure.

2. When does the prediction need to happen? A churn prediction made 90 days before cancellation is useful — you have time to intervene. A prediction made the day before is not. The timing of the prediction determines what data you can use as features (you can't use features measured after the prediction is made).

3. What data do you have, and is it labeled? Do you have historical records where the outcome is known? Supervised learning requires this. How much data do you have? A few hundred labeled examples is often not enough.

4. How will the prediction be used? Will a human review each prediction and decide what to do? Will it be automated? This affects how much accuracy you need and how much interpretability you need.

5. What does success look like? Define this before you build anything. "Better than chance" is not a success criterion. What accuracy rate makes the model worth deploying? What is the cost of a wrong prediction? Both false positives (predicting churn for a customer who would have stayed) and false negatives (missing a customer who actually does churn) have different business costs.

Step 2: Collect and Prepare Data

Once the problem is framed, you need data. In practice, this step consumes 60–80% of the total time on most ML projects.

Data collection: Identify what data sources you have. Transactional databases, CRM systems, marketing platforms, support ticket systems, web analytics. Understanding what data exists and how to access it is a project in itself.

Data cleaning: Handle missing values, fix encoding errors, remove duplicates, standardize formats. We covered these skills extensively in earlier chapters.

Feature engineering: This is the art of the field. Raw data is rarely in the best form for a model. You create new features from existing data: days since last purchase, ratio of support contacts to account age, whether the customer has used the mobile app this month. Good features encode domain knowledge that the algorithm would otherwise have to discover on its own.

Data splitting: You hold out a portion of your labeled data to evaluate the model after training. The model never sees this test set during training. This is how you get an honest estimate of how well the model will perform on new data.

Step 3: Choose and Train a Model

"Training" a model means finding the parameter values that make the model's predictions match the labels in your training data as closely as possible.

The choice of algorithm matters less than beginners think. Start simple. A logistic regression model or a decision tree trained on good data will often outperform a deep neural network trained on messy data with poorly-engineered features.

For business problems, start with: - Logistic regression for classification (simple, interpretable, fast) - Linear regression for regression (same advantages) - Decision trees when you need interpretability and non-linear patterns - Random forests when you need higher accuracy and can sacrifice some interpretability

We cover all of these in Chapter 34.

Step 4: Evaluate the Model

Never trust a model's performance on the data it was trained on. A model that perfectly memorizes the training data may perform terribly on new data — this is called overfitting, and it is the central challenge of ML.

Evaluate on the held-out test set. Use the appropriate metrics for your problem type (classification or regression). Compare against a baseline — what would a naive prediction strategy achieve? If your model doesn't beat the baseline by a meaningful margin, it has not learned anything useful.

We cover evaluation metrics in depth in Section 33.7.

Step 5: Deploy and Monitor

A model that lives in a Jupyter notebook helps no one. Deployment means integrating the model into the workflow where predictions will be used — a dashboard, a database query, an automated email system, a CRM workflow.

Deployment also requires monitoring. Models degrade over time as the world changes. A churn model trained on data from 2022 may perform poorly in 2025 if customer behavior has shifted. You need to track model performance over time and retrain periodically.


33.4 Introducing scikit-learn

Scikit-learn (imported as sklearn) is the standard Python library for machine learning. It is not the only library — TensorFlow, PyTorch, and XGBoost all play important roles in specialized contexts — but for the types of business ML we cover in this book, scikit-learn is the right tool.

What makes scikit-learn exceptional is not any single algorithm. It is the design philosophy.

The sklearn API: Fit, Transform, Predict

Every estimator (model, preprocessor, or transformer) in scikit-learn follows the same interface. Once you understand it, every new algorithm feels familiar.

.fit(X, y) — Train the model. Give it your training data (features X and labels y). The model learns from this data and stores what it learned internally. For unsupervised methods, .fit(X) takes only features.

.predict(X) — Make predictions. Give it new data (features only). The model applies what it learned to generate predictions.

.transform(X) — For preprocessing steps: transform data into a new form (scale it, encode it, fill missing values). Used with preprocessors, not predictive models.

.fit_transform(X) — Fit and transform in one step. Common shorthand for preprocessing on training data.

.score(X, y) — Evaluate the model. Returns a default metric (accuracy for classifiers, R² for regressors).

.predict_proba(X) — For classifiers: return probabilities for each class rather than a hard prediction. This is often more useful for business decisions.

Here is the complete, canonical sklearn workflow in code:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 2. Create the model
model = LogisticRegression()

# 3. Train (fit) the model
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)

# 5. Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

This pattern is the same regardless of whether you are using logistic regression, a decision tree, a random forest, or a gradient boosting model. That consistency is the entire value of the scikit-learn API.

Pipelines: Chaining Steps Together

In practice, ML workflows involve multiple steps: preprocessing, then modeling. Scikit-learn's Pipeline object chains these together so they behave as a single estimator. This prevents a common and subtle bug called data leakage (where information from the test set accidentally influences preprocessing steps fit on the training data).

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression()),
])

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

We use pipelines throughout this and the following chapters.


33.5 When ML Is Worth It — and When It's Overkill

This is the section that distinguishes a skilled data practitioner from someone chasing trends.

Machine learning is not the answer to every question. It has real prerequisites, real costs, and real failure modes. Before investing time in an ML solution, answer these questions honestly.

The ML Worthiness Checklist

Do you have enough labeled data?

Supervised learning requires historical examples with known outcomes. How many? It depends on the complexity of the problem and the number of features. As a rough guide: fewer than a few hundred labeled examples is usually not enough. A few thousand is a reasonable starting point for simple models. Complex models with many features may need tens of thousands.

If you do not have labeled historical data, supervised learning is not an option. You would need to collect it first, which may take months or years.

Is the pattern stable over time?

ML models assume the future will resemble the past in the ways that matter. If the relationship between your features and your target is changing rapidly, the model will struggle to keep up. A churn model trained before a major product overhaul may be worthless afterward.

Would a simpler approach solve the problem?

Before building a model, ask: could I solve this with a rule? With a threshold? With a better dashboard or report?

"Customers who haven't logged in for 60 days are at risk" is a simple, interpretable, actionable rule. You do not need an ML model to produce it. If a business analyst can articulate the key rules from domain knowledge, those rules might perform nearly as well as a trained model — and they will be much easier to understand, explain, and maintain.

Can you afford to be wrong — and in what direction?

Every model makes mistakes. In business, different kinds of mistakes have different costs. A false positive (predicting churn when the customer was not going to churn) might mean sending an unnecessary retention offer, which costs money but is not catastrophic. A false negative (missing a customer who does churn) means lost revenue.

Some applications have much more severe asymmetric costs: in medical screening, missing a disease (false negative) is catastrophic; in fraud detection, blocking a legitimate transaction (false positive) damages customer trust. These constraints should shape your choice of model and your definition of "good enough."

Do you need to explain the prediction?

Regulatory requirements in financial services, healthcare, and other industries may require that you can explain why a model made a particular prediction. A simple linear model or a decision tree can be explained. A complex ensemble of hundreds of trees is much harder. If interpretability is a hard requirement, it constrains your algorithm choices.

Do you have the infrastructure to deploy and maintain a model?

Training a model is the easy part. Deploying it so it generates predictions on new data in production, monitoring it, retraining it when performance degrades, tracking model versions — all of this requires engineering infrastructure and ongoing attention. If your organization cannot support this, a model that lives on a data scientist's laptop helps no one.

The Simpler-Tool Decision Tree

Work through this before starting any ML project:

  1. Do you understand the problem well enough to write explicit rules? → Try rules first.
  2. Can a better summary statistic or dashboard answer the question? → Build the dashboard.
  3. Do you have enough historical labeled data? → If no, collect it or reconsider the problem.
  4. Is the current state measurably worse than "good enough"? → If no, nothing to fix.
  5. Have you established a baseline that ML must beat? → Do this before training any model.

If you get through all five questions with ML still looking like the right tool, proceed.


33.6 The ML Hype Antidote

The business technology press is filled with case studies of organizations achieving miraculous results with AI and machine learning. Here is what those case studies almost never tell you.

The Base Rate Problem

When someone tells you their model is "95% accurate," the first question to ask is: what would you get if you just predicted the most common outcome every time?

If 95% of your customers do not churn, a model that always predicts "will not churn" is 95% accurate — and completely useless. This is called the base rate or majority class baseline. Your model must beat this baseline by a meaningful margin to be worth anything.

For class-imbalanced problems (where one outcome is much more common than the other), accuracy is almost always the wrong metric. We cover better metrics in Section 33.7.

The Benchmark Problem

Real ML results should be compared to the best non-ML alternative, not to doing nothing. Often the right baseline is not random guessing but an existing rule-based system or the performance of an experienced human.

If an experienced sales rep can identify at-risk accounts with 70% precision based on intuition, a model that achieves 72% precision is not a dramatic improvement. It may still be worthwhile (scale, consistency, speed), but the bar should be the existing solution, not zero.

The Deployment Gap

In 2019, Gartner estimated that fewer than 10% of ML models ever made it into production. The number has improved since then, but the gap between "working model in a notebook" and "deployed model generating business value" remains one of the biggest challenges in the field. Factor in deployment complexity when deciding whether to pursue an ML approach.

The Interpretability Trade-off

Complex models (deep neural networks, large ensembles) often achieve higher accuracy on test data than simple models, but at the cost of interpretability. In many business contexts, a slightly less accurate model that can be explained, audited, and trusted is more valuable than a highly accurate black box.

"Why did the model predict this customer would churn?" is a question that stakeholders, customers, and regulators will ask. Make sure you can answer it.


33.7 Key Terminology

Before we get into code, let's establish the vocabulary you will encounter throughout this and the following chapters.

Features and Labels

Features (also called independent variables, inputs, predictors, or X): the data columns used as inputs to the model. In a churn prediction model, features might include: account_age_days, logins_last_30_days, support_contacts_last_90_days, has_mobile_app, plan_type.

Labels (also called the target variable, dependent variable, output, or y): the column you are trying to predict. In churn prediction: churned (1) or not churned (0).

Feature engineering: creating new features from existing data. Example: instead of using raw login counts, you create a "logins per day of tenure" ratio that normalizes for account age.

Training and Test Sets

Training set: the data used to fit (train) the model. The algorithm adjusts its parameters to minimize errors on this data.

Test set: data held out from training and used only for final evaluation. This is your honest estimate of how well the model will perform on data it has never seen.

Validation set: a third split sometimes used to tune model hyperparameters without contaminating the test set.

The key principle: never evaluate a model on the data it was trained on. A model that has memorized the training data will appear to perform perfectly on training data while failing on new data. This failure mode is called overfitting.

Overfitting and Underfitting

Overfitting: the model learned the training data too well — including noise and idiosyncrasies that are not real patterns. It performs well on training data and poorly on new data. Symptom: large gap between training performance and test performance.

Underfitting: the model is too simple to capture the real patterns in the data. It performs poorly on both training and test data. Symptom: poor performance everywhere.

The goal is to find the sweet spot between the two — a model that is complex enough to capture the real signal in the data but not so complex that it memorizes the noise.

Overfitting is more common than underfitting in practice. The standard remedies are: more training data, simpler model, regularization, or feature selection.

Hyperparameters

Parameters are what the model learns from data (e.g., the coefficients in a linear regression).

Hyperparameters are settings you choose before training that control how the model learns (e.g., the maximum depth of a decision tree, the regularization strength of a logistic regression). Choosing hyperparameters is the main activity in model tuning.


33.8 The Train/Test Split and Why It Matters

The train/test split is the foundational technique for honest model evaluation.

The basic approach: shuffle your labeled data randomly, then allocate some fraction (commonly 70–80%) to training and hold the rest (20–30%) for testing. Train the model only on the training set. Evaluate only on the test set. Never let the model see the test set during training.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,           # features
    y,           # labels
    test_size=0.2,       # 20% held out for testing
    random_state=42,     # reproducible split
    stratify=y           # maintain class proportions in both splits
)

print(f"Training examples: {len(X_train)}")
print(f"Test examples: {len(X_test)}")

The stratify=y parameter is important for classification problems with imbalanced classes. It ensures that both the training and test sets have the same proportion of each class as the original dataset. Without it, by chance, your test set might have very few positive examples, making evaluation unreliable.

The random_state parameter: Setting this to any integer makes the split reproducible. When you run the code again next week, you get the same split. This is essential for debugging and for ensuring that reported results are reproducible.

The One Sacred Rule

Fit your preprocessing on the training data only, and apply it to the test data.

This sounds technical, but the intuition is simple: you cannot use information from the test set to make decisions during training — that would be cheating. If you scale your features using the mean and standard deviation of the entire dataset (training + test), you have allowed information from the test set to influence the preprocessing. The result is an overly optimistic estimate of model performance.

This is called data leakage, and it is one of the most common sources of overly optimistic results in published ML work.

The fix is always to use scikit-learn Pipelines (introduced in Section 33.4), which handle this correctly by design.


33.9 Cross-Validation: A Better Evaluation Approach

The train/test split has a weakness: its results depend on which examples happened to end up in the test set. With a different random split, you might get different performance estimates.

Cross-validation solves this by using multiple train/test splits.

K-Fold Cross-Validation

The most common approach is k-fold cross-validation (k is typically 5 or 10):

  1. Shuffle the data
  2. Split it into k equal-sized "folds"
  3. For each fold: train the model on all other folds, evaluate on this fold
  4. Average the k evaluation scores

The result is a more reliable estimate of model performance than a single train/test split, along with an estimate of the variability (standard deviation across folds).

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

model = LogisticRegression()

# 5-fold cross-validation
scores = cross_val_score(
    model, X, y,
    cv=5,
    scoring="accuracy"
)

print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

The +/- 2 * std range gives you a rough 95% confidence interval. A model with accuracy 0.84 +/- 0.02 is more reliable than one with accuracy 0.87 +/- 0.09 — the second one's performance estimate is much noisier.

When to Use Cross-Validation vs. Train/Test Split

Use cross-validation when: - Your dataset is small (less than 10,000 examples) — you cannot afford to hold out 20% for testing - You are comparing multiple models or hyperparameter settings — cross-validation gives more reliable comparisons - You want to understand the variance of your performance estimate

Use a held-out test set when: - Your dataset is large enough that you can afford it - You want a final, honest evaluation after all model development decisions are made

Best practice: use cross-validation during development, and evaluate once on a held-out test set at the very end to report final performance.


33.10 Evaluation Metrics

Choosing the right metric for your problem is as important as choosing the right algorithm. The wrong metric will send you in the wrong direction.

For Classification Problems

Accuracy = (correct predictions) / (total predictions)

The simplest metric. Percentage of predictions that are correct. Do not use this as your primary metric when classes are imbalanced — it will mislead you (see the base rate problem in Section 33.6).

Confusion Matrix: A table showing how predictions map to actual outcomes. For a binary classification problem:

                 Predicted: No Churn    Predicted: Churn
Actual: No Churn    True Negative (TN)    False Positive (FP)
Actual: Churn       False Negative (FN)   True Positive (TP)
  • True Positive (TP): correctly predicted churn
  • True Negative (TN): correctly predicted no churn
  • False Positive (FP): predicted churn, but customer was fine (Type I error)
  • False Negative (FN): predicted no churn, but customer actually churned (Type II error)

Precision = TP / (TP + FP)

Of all the customers the model predicted would churn, what fraction actually did? High precision means few false alarms. Relevant when false positives are costly (e.g., you don't want to offer unnecessary discounts).

Recall (also called Sensitivity or True Positive Rate) = TP / (TP + FN)

Of all customers who actually churned, what fraction did the model catch? High recall means few missed churns. Relevant when false negatives are costly (e.g., you cannot afford to miss at-risk customers).

The Precision-Recall Trade-off: Increasing the prediction threshold (making the model more conservative about predicting churn) increases precision but decreases recall. Decreasing the threshold has the opposite effect. The right threshold depends on the relative costs of false positives and false negatives in your specific business context.

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean of precision and recall. Useful single-number summary when both matter. Ranges from 0 (worst) to 1 (best).

ROC AUC (Area Under the Receiver Operating Characteristic Curve): measures the model's ability to discriminate between classes across all possible thresholds. Ranges from 0.5 (no better than random) to 1.0 (perfect). Less sensitive to class imbalance than accuracy. A good default metric for classification.

Practical guidance for business classification:

Situation Recommended Primary Metric
Classes are roughly balanced Accuracy or F1
False negatives are much costlier (don't miss the disease/fraud/churn) Recall
False positives are much costlier (don't waste resources on false alarms) Precision
You need a single threshold-independent metric ROC AUC
Severe class imbalance PR AUC (area under precision-recall curve)

For Regression Problems

MAE (Mean Absolute Error) = average of |actual - predicted|

The average error in the same units as the target. Easy to interpret. A sales forecast with MAE of $5,000 is off by $5,000 on average. Robust to large individual errors.

RMSE (Root Mean Squared Error) = sqrt(average of (actual - predicted)²)

Penalizes large errors more than MAE does (because it squares the errors before averaging). More sensitive to outliers. A good choice when large errors are disproportionately costly.

(R-squared, Coefficient of Determination): measures how much of the variance in the target is explained by the model. Ranges from 0 (the model explains nothing, does no better than predicting the mean) to 1 (perfect predictions). Can be negative for very bad models.

Plain-English interpretation of R²: "If R² = 0.75, the model explains 75% of the variation in the outcome. 25% of the variation is unexplained — attributable to factors not in the model or to inherent unpredictability."

Practical guidance for business regression:

Use RMSE when large errors are especially costly (e.g., inventory planning where being way off means running out of stock). Use MAE when errors are roughly symmetric in their costs and you want interpretability. Always report R² as a sense-check — it tells you whether the model is capturing real signal at all.


33.11 Business Framing for Common ML Applications

Let's apply everything we have covered to four common business ML problems, framing each one properly before any code is written.

Churn Prediction

The business problem: We are losing customers and want to do something about it before it happens.

The ML framing: Predict which active customers will cancel within the next 90 days.

Label: churned_in_90_days (1 = yes, 0 = no). You construct this from historical data: for each customer at some point in the past, look forward 90 days and record whether they churned.

Features: Account age, login frequency (last 7, 30, 90 days), support contacts, plan type, payment method status, feature usage, NPS scores if available.

Evaluation metric: Recall (you cannot afford to miss many at-risk customers) with a precision floor (you cannot call every customer). ROC AUC for model selection.

Baseline: What is your current churn rate? A model predicting all customers will churn would have 100% recall but 0% precision. Your baseline should be the current detection method (e.g., account manager intuition, manual review).

How the output is used: A sorted list of at-risk customers delivered to account managers weekly. They decide whom to call.

Demand Forecasting

The business problem: We are either over-ordering inventory (capital tied up, waste) or under-ordering (stockouts, lost sales).

The ML framing: Predict unit sales for each SKU-location combination for the next 2 weeks.

Label: units_sold for each (SKU, location, week) combination.

Features: Historical sales (lag features), promotional calendar, seasonality indicators (day of week, month, holiday proximity), weather (for relevant categories), price.

Evaluation metric: RMSE if over/understocking are both costly. MAE if errors are roughly symmetric. Business teams often prefer MAPE (Mean Absolute Percentage Error) for its interpretability.

Baseline: Naive forecasting (last week's sales), seasonal naive (same week last year), exponential smoothing.

Caution: Time-series data requires special care with train/test splitting — you cannot randomly shuffle time-series data. The test set must be chronologically after the training set.

Anomaly Detection

The business problem: We have too many transactions to review manually; we need to flag suspicious ones.

The ML framing: Assign each transaction a "suspiciousness score" — how different is this from typical transactions?

This is often unsupervised if you do not have labeled fraud examples. If you have historical fraud labels, it becomes a supervised classification problem (with severe class imbalance — fraud is rare).

Baseline: Rule-based systems (flag transactions above $X, or from unusual geographies). These often capture the obvious cases; the ML value is in the subtle ones.

Key challenge: False positives (blocking legitimate transactions) damage customer trust. Set thresholds conservatively.

Customer Inquiry Classification

The business problem: Support tickets arrive and need to be routed to the right team, but manual routing is slow and error-prone.

The ML framing: Classify each incoming ticket into a category (Billing, Technical, Feature Request, Complaint, Other).

Label: Category, assigned from historical tickets (using past human-assigned categories as labels).

Features: Text of the ticket (requires text feature extraction — TF-IDF or embeddings). Subject line, customer tier, product version.

Evaluation metric: Accuracy (if classes are balanced). Weighted F1 (if some categories are rarer but equally important).

Key consideration: When the model is uncertain, route to a human rather than guessing. A confidence threshold for automated routing reduces the impact of mistakes.


33.12 From Theory to Practice: A First Look at the Code

Before diving into the full workflow in the companion code file, let's preview the structure of a complete ML pipeline in scikit-learn. This preview is meant to show you that the code is approachable — each piece maps directly to a concept in this chapter.

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import Pipeline

# --- 1. Load and prepare data ---
df = pd.read_csv("customer_data.csv")

# Features and label
feature_cols = [
    "account_age_days",
    "logins_last_30_days",
    "support_contacts_last_90_days",
    "features_used_count",
    "payment_failures_last_year",
]
X = df[feature_cols]
y = df["churned"]

# --- 2. Split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- 3. Build pipeline: preprocessing + model ---
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(random_state=42)),
])

# --- 4. Train ---
pipeline.fit(X_train, y_train)

# --- 5. Evaluate ---
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")

# --- 6. Cross-validation for reliable estimate ---
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"CV ROC AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})")

Read through this code against the workflow steps from Section 33.3. Each step in the workflow corresponds to one section of the code. This mapping — conceptual workflow to concrete code — is what the companion code file ml_workflow.py develops in full detail.


33.13 Chapter Summary

Machine learning is pattern-finding at scale. The fundamental workflow is always: frame the problem, prepare data, train a model, evaluate honestly, deploy, monitor.

Scikit-learn provides a consistent API (fit, predict, transform) that makes every new algorithm feel familiar once you understand the basics.

The most important skills in applied ML are not mathematical — they are judgment calls: Is this the right problem to solve with ML? Do I have enough data? Am I evaluating honestly? Does the model beat a sensible baseline? Can I explain the predictions to stakeholders?

Chapter 34 builds on everything here, implementing complete regression and classification models for business prediction problems.


Companion Files

  • code/ml_workflow.py — Complete end-to-end ML workflow: data loading, splitting, training, evaluation, and interpretation for a customer churn classification problem
  • case-study-01.md — Priya frames the churn prediction problem at Acme Corp before writing a single line of code
  • case-study-02.md — Maya walks a client through the ML decision framework and discovers that better reporting, not ML, solves their problem
  • exercises.md — 20 exercises across 5 difficulty tiers
  • quiz.md — 20 questions with answer key
  • key-takeaways.md — Chapter summary and core concepts
  • further-reading.md — Curated resources for going deeper