Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions

DataField.Dev

29 min read

Every chapter in this book will use math. Not theorem-proof math. Not "derive this from first principles" math. The kind of math that shows up when your model won't converge at 11 PM and you need to understand why — not just which hyperparameter to...

In This Chapter

4.1 Probability: The Language of Uncertainty
4.2 Linear Algebra: How Machines See Data
4.3 Calculus and Gradient Descent: How Models Learn
4.4 Loss Functions: What Models Learn
4.5 Regularization: Taming the Loss Landscape
4.6 Progressive Project: Math in Action
4.7 Connecting the Pieces
Key Vocabulary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 4: The Math Behind ML — Probability, Linear Algebra, Calculus, and Loss Functions

Every chapter in this book will use math. Not theorem-proof math. Not "derive this from first principles" math. The kind of math that shows up when your model won't converge at 11 PM and you need to understand why — not just which hyperparameter to wiggle next.

This chapter covers the four pillars of machine learning mathematics: probability (how we represent uncertainty), linear algebra (how we represent data), calculus (how models learn), and loss functions (what models are trying to learn). If you remember these four things well, you will understand what scikit-learn is doing under the hood. If you understand what scikit-learn is doing under the hood, you will debug failures that stump your colleagues.

A promise before we start: every formula in this chapter gets three presentations.

The intuition — a plain English explanation, often with a physical analogy
The math — formal notation, so you can read papers and documentation
The numpy code — a computational demonstration you can run, modify, and break

If you skipped calculus in college, you will survive this chapter. If you took calculus, you will finally understand what it was for.

Math Panic? Read This — If the sight of Greek letters triggers anxiety, you are not alone. Here is the survival strategy: read the intuition first. Read the numpy code second. If those two make sense, the notation in between is just a compact way of writing what you already understand. You do not need to memorize any formula in this chapter. You need to recognize them when they appear in documentation, error messages, and conference talks. That is a much lower bar, and you are going to clear it.

4.1 Probability: The Language of Uncertainty

Machine learning is fundamentally about uncertainty. When your churn model says a customer has a 73% probability of canceling, it is not saying "this customer will cancel." It is saying "given what we know, cancellation is about three times more likely than retention." Probability is the language that makes this statement precise.

Probability Distributions

A probability distribution describes how likely different outcomes are. You already know this intuitively — if someone asks "how tall is a random adult American male?" you would not give a single number. You would say something like "probably around 5'9", but it could be anywhere from 5'2" to 6'4"." That informal answer is a probability distribution.

The three distributions you will encounter most often in ML:

The Normal (Gaussian) Distribution. The bell curve. It shows up everywhere because of the Central Limit Theorem: when you average enough independent random things, the result is approximately normal, regardless of what the individual things look like.

The intuition: Most values cluster around the mean. The further from the mean, the less likely.

The math:

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

The numpy code:

import numpy as np
import matplotlib.pyplot as plt

# Generate 10,000 samples from a normal distribution
# Mean usage = 45 hours/month, std dev = 12 hours
usage_hours = np.random.default_rng(42).normal(loc=45, scale=12, size=10_000)

print(f"Mean: {usage_hours.mean():.1f}")
print(f"Std:  {usage_hours.std():.1f}")
print(f"68% of values between {usage_hours.mean() - usage_hours.std():.1f} "
      f"and {usage_hours.mean() + usage_hours.std():.1f}")

Mean: 44.9
Std:  11.9
68% of values between 33.0 and 56.8

In the StreamFlow churn context, subscriber usage hours per month are roughly normally distributed. Most customers watch around 45 hours. A few watch 80+. A few watch under 10. The customers under 10 are the ones your churn model should be worried about.

The Binomial Distribution. Counts the number of successes in a fixed number of independent trials. Each trial has two outcomes (success/failure) with the same probability.

The intuition: Flip a biased coin $n$ times. How many heads?

The math:

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

The numpy code:

# Out of 100 customers, how many churn if each has 8.2% probability?
rng = np.random.default_rng(42)
churned_counts = rng.binomial(n=100, p=0.082, size=10_000)

print(f"Most likely count: {np.median(churned_counts):.0f}")
print(f"Range (5th-95th percentile): {np.percentile(churned_counts, 5):.0f} "
      f"to {np.percentile(churned_counts, 95):.0f}")

Most likely count: 8
Range (5th-95th percentile): 4 to 13

StreamFlow's monthly churn rate is 8.2%. In a cohort of 100 customers, you would expect about 8 to churn — but it could easily be 4 or 13. That variability is not noise you should ignore. It is the reality your model must navigate.

The Poisson Distribution. Models the count of events in a fixed interval when events occur independently at a constant rate.

The intuition: How many support tickets does a customer file per month?

The math:

$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

where $\lambda$ is the average rate.

The numpy code:

# Average support tickets per customer per month: 0.3
tickets = rng.poisson(lam=0.3, size=10_000)

print(f"0 tickets: {(tickets == 0).mean():.1%}")
print(f"1 ticket:  {(tickets == 1).mean():.1%}")
print(f"2+ tickets: {(tickets >= 2).mean():.1%}")

0 tickets: 74.2%
1 ticket:  22.2%
2+ tickets: 3.6%

Most customers never contact support. A few contact support once. The ones who contact twice or more in a single month are rare — and they churn at 4x the base rate. That is a feature worth engineering.

Bayes' Theorem: Updating Beliefs with Evidence

Bayes' theorem is one of the most important ideas in all of data science. It tells you how to update your beliefs when you get new evidence. The formula is simple. The implications are profound.

The intuition: You have a prior belief about something (say, the probability that a patient will be readmitted). You observe new evidence (say, the patient's lab results). Bayes' theorem tells you exactly how to combine your prior belief with the new evidence to get an updated belief.

The math:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Or, in ML language:

$$\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}$$

Let us make this concrete with the Metro General Hospital anchor.

The Metro General Readmission Problem. Metro General's base readmission rate is 15% — that is, 15% of discharged patients return within 30 days. This is the prior: before you know anything about a specific patient, your best guess is 15%.

Now a patient is discharged. Their blood work shows elevated inflammatory markers. You know from historical data that among patients who were readmitted, 60% had elevated markers. Among patients who were not readmitted, only 20% had elevated markers.

What is the probability this patient will be readmitted?

The math:

$P(\text{readmit}) = 0.15$ (prior)
$P(\text{elevated} | \text{readmit}) = 0.60$ (likelihood — how likely is this evidence if the patient will be readmitted?)
$P(\text{elevated} | \text{no readmit}) = 0.20$ (likelihood under the alternative)
$P(\text{elevated}) = 0.60 \times 0.15 + 0.20 \times 0.85 = 0.26$ (total probability of elevated markers)

$$P(\text{readmit} | \text{elevated}) = \frac{0.60 \times 0.15}{0.26} = 0.346$$

The numpy code:

prior_readmit = 0.15
likelihood_elevated_given_readmit = 0.60
likelihood_elevated_given_no_readmit = 0.20

# Total probability of elevated markers (law of total probability)
p_elevated = (likelihood_elevated_given_readmit * prior_readmit +
              likelihood_elevated_given_no_readmit * (1 - prior_readmit))

# Bayes' theorem
posterior_readmit = (likelihood_elevated_given_readmit * prior_readmit) / p_elevated

print(f"Prior (before lab results):   {prior_readmit:.1%}")
print(f"Posterior (after lab results): {posterior_readmit:.1%}")
print(f"Risk increase:                {posterior_readmit / prior_readmit:.1f}x")

Prior (before lab results):   15.0%
Posterior (after lab results): 34.6%
Risk increase:                2.3x

The elevated markers more than doubled the readmission probability — from 15% to 34.6%. Notice that the posterior (34.6%) is not the same as the likelihood (60%). This is the most common Bayesian mistake: confusing $P(\text{evidence} | \text{hypothesis})$ with $P(\text{hypothesis} | \text{evidence})$. The fact that 60% of readmitted patients had elevated markers does not mean that 60% of patients with elevated markers will be readmitted. Bayes' theorem accounts for the base rate.

War Story — A hospital IT team once built a readmission alert that flagged any patient whose predicted probability exceeded 50%. Sounds reasonable. But the base rate was 12%, and their model's strongest signal — a complex comorbidity index — only pushed the posterior to about 35%. The result: the alert never fired. Zero patients flagged. Zero interventions. The clinical team assumed the model was broken. The model was fine. The threshold was wrong because someone confused "high risk relative to baseline" with "more likely than not." They changed the threshold to 25% (roughly 2x the base rate), flagged the right patients, and reduced readmissions by 11% in the first quarter.

Conditional Probability: The Foundation of Everything

Before we move on, one more probability concept that pervades ML. Conditional probability is the probability of something given that you already know something else.

The math:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

The intuition: What fraction of StreamFlow customers who filed a support ticket (B) also churned (A)? That is $P(\text{churn} | \text{ticket})$.

# Simulated StreamFlow data
n = 10_000
rng = np.random.default_rng(42)

# 8.2% churn rate overall
churned = rng.random(n) < 0.082

# Support ticket probability: 5% for retained, 25% for churners
filed_ticket = np.where(churned,
                         rng.random(n) < 0.25,
                         rng.random(n) < 0.05)

# Conditional probabilities
p_churn = churned.mean()
p_churn_given_ticket = churned[filed_ticket].mean()
p_churn_given_no_ticket = churned[~filed_ticket].mean()

print(f"P(churn):                   {p_churn:.3f}")
print(f"P(churn | filed ticket):    {p_churn_given_ticket:.3f}")
print(f"P(churn | no ticket):       {p_churn_given_no_ticket:.3f}")
print(f"Relative risk (ticket):     {p_churn_given_ticket / p_churn:.1f}x")

P(churn):                   0.081
P(churn | filed ticket):    0.270
P(churn | no ticket):       0.069
Relative risk (ticket):     3.3x

Filing a support ticket increases the churn probability by 3.3x. This is exactly the kind of signal that a churn model captures as a feature weight. Conditional probability is not an abstract concept — it is the mathematical engine behind every feature in your model. When the model assigns a negative weight to tenure_months, it is encoding the fact that $P(\text{churn} | \text{high tenure}) < P(\text{churn})$.

Why Probability Matters for ML

Every classifier you build in this book is, at its core, estimating a probability distribution. When logistic regression outputs 0.73 for a StreamFlow subscriber, it is saying "the estimated probability of churn is 73%." When a random forest outputs 0.73, it is saying the same thing, just computed differently. Understanding probability distributions gives you:

Calibration awareness — Is a predicted 70% actually right 70% of the time? (Chapter 16 goes deep on this.)
Decision threshold intuition — Why the default 0.5 threshold is almost never optimal.
Loss function understanding — Cross-entropy loss directly comes from probability theory, as we will see in Section 4.4.

4.2 Linear Algebra: How Machines See Data

When you look at a pandas DataFrame, you see rows and columns with names and meaning. When your ML model looks at the same data, it sees a matrix — a rectangular grid of numbers. Linear algebra is the mathematics of matrices, and understanding it tells you what your model is actually computing.

Vectors: One Row, One Customer

A vector is an ordered list of numbers. In ML, a single data point (one customer, one patient, one sensor reading) is a vector.

The intuition: A StreamFlow customer described by three features — tenure (months), monthly usage (hours), and support tickets (count) — is a point in three-dimensional space.

The math: $\mathbf{x} = [24, 38.5, 1]$ is a vector in $\mathbb{R}^3$.

The numpy code:

import numpy as np

# One customer: 24 months tenure, 38.5 hours/month, 1 support ticket
customer = np.array([24, 38.5, 1])
print(f"Shape: {customer.shape}")   # (3,) — a 1D array with 3 elements
print(f"Dimensions: {customer.ndim}")  # 1

Shape: (3,)
Dimensions: 1

Matrices: The Feature Matrix

A matrix is a rectangular grid of numbers. In ML, your entire dataset is a matrix: rows are observations, columns are features. This matrix is universally called $\mathbf{X}$ (capital, bold).

The intuition: Stack all your customer vectors vertically, and you get the feature matrix.

The math:

$$\mathbf{X} = \begin{bmatrix} 24 & 38.5 & 1 \\ 6 & 12.0 & 3 \\ 48 & 62.1 & 0 \\ 11 & 8.3 & 5 \end{bmatrix}$$

This is a $4 \times 3$ matrix: 4 customers, 3 features.

The numpy code:

# Four customers, three features each
X = np.array([
    [24, 38.5, 1],   # Long tenure, moderate usage, 1 ticket
    [ 6, 12.0, 3],   # Short tenure, low usage, 3 tickets (high churn risk)
    [48, 62.1, 0],   # Long tenure, heavy usage, 0 tickets (low churn risk)
    [11,  8.3, 5],   # Short tenure, very low usage, 5 tickets (highest risk)
])

print(f"Shape: {X.shape}")  # (4, 3) — 4 rows, 3 columns
print(f"Customer 2: {X[1]}")  # Zero-indexed
print(f"Feature 'usage' (all customers): {X[:, 1]}")

Shape: (4, 3)
Customer 2: [ 6.  12.   3. ]
Feature 'usage' (all customers): [38.5 12.  62.1  8.3]

Common Mistake — Confusing $n$ and $p$. In ML, $n$ is the number of samples (rows) and $p$ is the number of features (columns). The feature matrix $\mathbf{X}$ has shape $(n, p)$. When you see X.shape return (2400000, 47), that is 2.4 million customers with 47 features each. Many beginners transpose this in their heads and wonder why the matrix seems "too tall."

The Dot Product: How Models Score Customers

The dot product is the single most important operation in linear algebra for ML. When a linear model "makes a prediction," it is computing a dot product.

The intuition: Multiply each feature by a weight, then add them all up. Features with large (positive) weights push the prediction higher. Features with large (negative) weights push it lower.

The math:

$$\hat{y} = \mathbf{w} \cdot \mathbf{x} = \sum_{j=1}^{p} w_j x_j = w_1 x_1 + w_2 x_2 + \cdots + w_p x_p$$

The numpy code:

# Model weights (learned during training)
# Negative weight on tenure/usage = more tenure/usage means less churn
# Positive weight on tickets = more tickets means more churn
weights = np.array([-0.02, -0.03, 0.15])

# Score each customer
for i, customer in enumerate(X):
    score = np.dot(weights, customer)
    print(f"Customer {i+1}: features={customer}, score={score:.3f}")

Customer 1: features=[24.  38.5  1. ], score=-1.485
Customer 2: features=[ 6.  12.   3. ], score=-0.030
Customer 3: features=[48.  62.1  0. ], score=-2.823
Customer 4: features=[11.   8.3  5. ], score=-0.219

Lower scores mean less churn risk. Customer 3 (long tenure, heavy usage, no tickets) has the lowest score. Customer 2 (short tenure, low usage, 3 tickets) has the highest. That is exactly what you would expect. The dot product formalized your intuition into a computation.

Matrix Multiplication: Scoring All Customers at Once

When you compute dot products for all customers simultaneously, you are doing matrix multiplication.

The math:

$$\hat{\mathbf{y}} = \mathbf{X} \mathbf{w}$$

The numpy code:

# Score all customers at once with matrix multiplication
scores = X @ weights  # The @ operator is matrix multiplication in numpy

print(f"All scores: {scores}")
# Same as: np.dot(X, weights)

All scores: [-1.485 -0.03  -2.823 -0.219]

Same result as the loop, but computed in one vectorized operation. When StreamFlow has 2.4 million customers, the difference between a loop and a matrix multiply is the difference between minutes and milliseconds.

Transpose and Inverse

Two more operations you will see constantly:

Transpose ($\mathbf{X}^T$) flips rows and columns. An $(n, p)$ matrix becomes $(p, n)$.

print(f"X shape:    {X.shape}")
print(f"X^T shape:  {X.T.shape}")

X shape:    (4, 3)
X^T shape:  (3, 4)

You will see $\mathbf{X}^T \mathbf{X}$ everywhere in ML — it computes the covariance structure of your features and shows up in the closed-form solution for linear regression.

Inverse ($\mathbf{A}^{-1}$) is the matrix equivalent of division. For a square matrix $\mathbf{A}$, the inverse satisfies $\mathbf{A} \mathbf{A}^{-1} = \mathbf{I}$ (the identity matrix). The closed-form solution for linear regression is:

$$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

# The normal equation for linear regression
# (We'll use gradient descent instead in practice, but this shows the math)
y = np.array([0, 1, 0, 1])  # 0 = retained, 1 = churned

XtX = X.T @ X              # (3, 4) @ (4, 3) = (3, 3)
Xty = X.T @ y              # (3, 4) @ (4,)   = (3,)
w_exact = np.linalg.solve(XtX, Xty)  # More numerically stable than inv()

print(f"Exact weights: {w_exact}")

Exact weights: [-0.01616162  -0.01338384   0.11555556]

Math Sidebar — You rarely compute the inverse directly in practice. np.linalg.solve(A, b) solves $\mathbf{Ax} = \mathbf{b}$ without explicitly forming $\mathbf{A}^{-1}$, which is faster and more numerically stable. When you see $(\mathbf{X}^T \mathbf{X})^{-1}$ in a textbook, read it as "solve this system" not "compute the inverse." And for large datasets, even the exact solution is too slow — which is why we need gradient descent.

A Preview: Why Linear Algebra Keeps Coming Back

Linear algebra is not just a one-chapter topic. It is the computational backbone of nearly every algorithm in this book:

PCA (Chapter 21) finds the directions of maximum variance in your data by computing eigenvectors of the covariance matrix $\mathbf{X}^T\mathbf{X}$.
SVMs (Chapter 12) find the maximum-margin hyperplane using matrix operations.
Recommender systems (Chapter 24) decompose a user-item matrix into low-rank factors.
Every scikit-learn model stores its learned parameters as numpy arrays and computes predictions via matrix multiplication.

When you see a scikit-learn model take 3 seconds on 1,000 rows but 5 minutes on 1,000,000 rows (instead of the expected 50x slowdown), the reason is usually that a matrix operation crossed a threshold where it no longer fits in CPU cache. Understanding matrix shapes and operations gives you the vocabulary to reason about computational cost.

# Computational cost comparison
import time

# Small matrix multiply
X_small = np.random.default_rng(42).normal(0, 1, (1_000, 50))
w_small = np.random.default_rng(42).normal(0, 1, 50)

start = time.perf_counter()
for _ in range(1000):
    _ = X_small @ w_small
small_time = time.perf_counter() - start

# Larger matrix multiply
X_large = np.random.default_rng(42).normal(0, 1, (100_000, 50))

start = time.perf_counter()
for _ in range(1000):
    _ = X_large @ w_small
large_time = time.perf_counter() - start

print(f"1,000 rows x 1000 repeats:   {small_time:.3f}s")
print(f"100,000 rows x 1000 repeats: {large_time:.3f}s")
print(f"Ratio: {large_time / small_time:.1f}x (data is 100x larger)")

The ratio will not be exactly 100x. Numpy leverages BLAS (Basic Linear Algebra Subprograms) libraries that are optimized for specific matrix sizes — another reason to let libraries handle the linear algebra rather than writing loops.

4.3 Calculus and Gradient Descent: How Models Learn

Here is the central question of machine learning: you have a loss function that measures how wrong your model is, and you have parameters (weights) you can adjust. How do you find the parameters that make the loss as small as possible?

Calculus provides the answer: take the derivative of the loss with respect to each parameter, and it tells you which direction to adjust. Gradient descent is the algorithm that does this repeatedly until the loss stops decreasing.

Derivatives: The Slope of the Loss

The intuition: A derivative tells you the slope of a function at a specific point. If you are standing on a hillside, the slope tells you which direction is downhill and how steep it is.

The math: For a function $f(x)$, the derivative $f'(x)$ (or $\frac{df}{dx}$) gives the rate of change at point $x$.

The numpy code:

def f(x):
    """A simple loss function: L(w) = (w - 3)^2"""
    return (x - 3) ** 2

def f_derivative(x):
    """Derivative: dL/dw = 2(w - 3)"""
    return 2 * (x - 3)

# Evaluate at different points
for w in [0, 1, 3, 5]:
    print(f"w={w}: loss={f(w):.1f}, slope={f_derivative(w):.1f}")

w=0: loss=9.0, slope=-6.0
w=1: loss=4.0, slope=-4.0
w=3: loss=0.0, slope=0.0
w=5: loss=4.0, slope=4.0

At $w=0$, the slope is $-6$ — the loss decreases if you increase $w$ (move right). At $w=5$, the slope is $+4$ — the loss decreases if you decrease $w$ (move left). At $w=3$, the slope is $0$ — you are at the minimum. This is the key insight: the sign of the derivative tells you which direction to go, and the magnitude tells you how far.

Partial Derivatives and the Gradient

Real ML models have many parameters, not just one. A model with 47 features has 47 weights (plus a bias term). You need to know the slope in every direction simultaneously.

The intuition: If you are standing on a 47-dimensional hillside (just go with it), you need to know which direction is steepest downhill. The gradient is a vector that points in that direction.

The math: For a loss function $L(w_1, w_2, \ldots, w_p)$, the gradient is:

$$\nabla L = \left[\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots, \frac{\partial L}{\partial w_p}\right]$$

Each $\frac{\partial L}{\partial w_j}$ is a partial derivative — the slope of the loss if you wiggle just that one parameter while holding the others fixed.

The numpy code:

def compute_gradient_mse(X, y, w):
    """
    Gradient of MSE loss for linear regression.

    MSE = (1/n) * sum((y - Xw)^2)
    Gradient = -(2/n) * X^T @ (y - Xw)
    """
    n = len(y)
    predictions = X @ w
    residuals = y - predictions
    gradient = -(2 / n) * X.T @ residuals
    return gradient

# Using our 4-customer example
y = np.array([0.0, 1.0, 0.0, 1.0])
w = np.zeros(3)  # Start with all zeros

grad = compute_gradient_mse(X, y, w)
print(f"Gradient at w=[0,0,0]: {grad}")
print(f"Interpretation: move w1 by {-grad[0]:+.3f}, w2 by {-grad[1]:+.3f}, w3 by {-grad[2]:+.3f}")

Gradient at w=[0,0,0]: [ -8.5   -10.15   -4.  ]
Interpretation: move w1 by +8.500, w2 by +10.150, w3 by +4.000

The negative gradient points toward the minimum. The gradient says "increase all three weights" — which makes sense because all weights start at zero and we need positive weights to predict the churners (customers 2 and 4 both churned).

Gradient Descent: Walking Downhill in Fog

Gradient descent is the algorithm that repeatedly follows the gradient downhill until it reaches a minimum.

The intuition: You are lost in fog on a hillside. You cannot see the valley, but you can feel the slope under your feet. Strategy: take a step in the steepest downhill direction. Feel the slope again. Take another step. Repeat until the ground feels flat.

The math:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \nabla L(\mathbf{w}_t)$$

where $\alpha$ is the learning rate — how big a step you take each time.

The numpy code:

def gradient_descent(X, y, learning_rate=0.0001, n_iterations=1000):
    """
    Gradient descent for linear regression (MSE loss).

    Returns: final weights, loss history
    """
    n, p = X.shape
    w = np.zeros(p)
    loss_history = []

    for i in range(n_iterations):
        # Forward pass: compute predictions
        predictions = X @ w

        # Compute loss (MSE)
        loss = np.mean((y - predictions) ** 2)
        loss_history.append(loss)

        # Backward pass: compute gradient
        gradient = -(2 / n) * X.T @ (y - predictions)

        # Update weights
        w = w - learning_rate * gradient

        # Print progress every 200 iterations
        if i % 200 == 0:
            print(f"Iteration {i:4d}: loss={loss:.6f}, w={w}")

    return w, loss_history

# Run gradient descent on our 4-customer example
w_final, losses = gradient_descent(X, y, learning_rate=0.00005, n_iterations=1000)
print(f"\nFinal weights: {w_final}")
print(f"Final loss:    {losses[-1]:.6f}")

Iteration    0: loss=0.500000, w=[0.000425  0.00050750 0.0002    ]
Iteration  200: loss=0.189735, w=[-0.00499832 -0.00647498  0.06042498]
Iteration  400: loss=0.113982, w=[-0.00903329 -0.00951459  0.08411498]
Iteration  600: loss=0.092637, w=[-0.01137949 -0.01106689  0.09611133]
Iteration  800: loss=0.086185, w=[-0.01274948 -0.01194826  0.10247321]

Final weights: [-0.01360447 -0.01247698  0.10617413]
Final loss:    0.083839

The weights converge toward the exact solution we computed earlier. The loss decreases monotonically. This is gradient descent working correctly.

The Learning Rate: Too Big, Too Small, Just Right

The learning rate $\alpha$ is the most important hyperparameter in gradient descent.

The intuition: - Too small: You take tiny steps. You will eventually reach the minimum, but it will take forever. - Too large: You take huge steps. You overshoot the minimum, bounce back, overshoot again, and the loss increases instead of decreasing. Your model diverges. - Just right: You take reasonably large steps that get smaller as you approach the minimum.

# Demonstrate learning rate effects
for lr in [0.00001, 0.00005, 0.001]:
    w, losses = gradient_descent(X, y, learning_rate=lr, n_iterations=500)
    print(f"lr={lr:.5f}: final_loss={losses[-1]:.6f}, "
          f"converged={'yes' if losses[-1] < 0.1 else 'no'}")

lr=0.00001: final_loss=0.210901, converged=no
lr=0.00005: final_loss=0.088703, converged=yes
lr=0.00100: final_loss=inf, converged=no

The learning rate of 0.00001 is too slow — still far from converged after 500 iterations. The learning rate of 0.001 is too large — the loss explodes to infinity. Only 0.00005 converges properly. In practice, you will use optimizers like Adam that adapt the learning rate automatically, but understanding this tradeoff is essential for debugging.

Try It — Modify the gradient descent code above to track the learning rate sensitivity. Try learning rates of 0.00002, 0.00005, 0.0001, and 0.0005. Plot the loss curves on the same axes. At what learning rate does the loss start to oscillate instead of decreasing smoothly?

Batch vs. Stochastic vs. Mini-Batch

The gradient descent we implemented above is batch gradient descent — it computes the gradient using every data point before updating the weights. With 4 customers, this is instant. With 2.4 million StreamFlow subscribers, each iteration requires a matrix multiplication with a $(2400000, 47)$ matrix. That is slow.

Stochastic Gradient Descent (SGD) computes the gradient using a single random data point per iteration. This is much faster per iteration but noisier — the gradient estimate bounces around because one customer is not representative of the full dataset.

Mini-batch Gradient Descent is the compromise: compute the gradient using a random subset (a "batch") of, say, 256 data points. Faster than batch, less noisy than stochastic. This is what nearly every modern ML library actually uses.

def mini_batch_gradient_descent(X, y, batch_size=32, lr=0.001, n_epochs=50):
    """
    Mini-batch gradient descent for linear regression.

    An epoch = one full pass through the data.
    """
    rng = np.random.default_rng(42)
    n, p = X.shape
    w = np.zeros(p)
    loss_history = []

    for epoch in range(n_epochs):
        # Shuffle data each epoch
        indices = rng.permutation(n)
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        for start in range(0, n, batch_size):
            X_batch = X_shuffled[start:start + batch_size]
            y_batch = y_shuffled[start:start + batch_size]

            predictions = X_batch @ w
            gradient = -(2 / len(y_batch)) * X_batch.T @ (y_batch - predictions)
            w = w - lr * gradient

        # Track loss on full dataset after each epoch
        full_loss = np.mean((y - X @ w) ** 2)
        loss_history.append(full_loss)

    return w, loss_history

In practice, you will never choose between these variants manually. When you call SGDClassifier() or SGDRegressor() in scikit-learn, it uses stochastic gradient descent with configurable batch size. When you train a neural network in PyTorch or TensorFlow, mini-batch is the default. The point is understanding why: batch is exact but slow, stochastic is fast but noisy, mini-batch balances both.

Production Tip — The batch size affects not just speed but also generalization. Empirically, smaller batch sizes (32-256) often produce models that generalize better than very large batch sizes (8192+). The noise from small batches acts as a form of implicit regularization — it prevents the optimizer from settling into sharp, narrow minima that overfit the training data. This observation, originally from deep learning research, applies to any model trained with SGD.

Convergence: When to Stop

How do you know when gradient descent is done? Three common criteria:

Maximum iterations reached — A safety valve. Stop after 1,000 or 10,000 steps regardless.
Loss change below threshold — If $|L_{t} - L_{t-1}| < \epsilon$ (say, $10^{-6}$), the loss has effectively stopped improving.
Gradient norm below threshold — If $\|\nabla L\| < \epsilon$, you are on flat ground. You might be at a minimum.

def gradient_descent_with_convergence(X, y, learning_rate=0.00005,
                                       max_iterations=10000, tol=1e-7):
    """Gradient descent with early stopping on loss change."""
    n, p = X.shape
    w = np.zeros(p)
    prev_loss = float('inf')

    for i in range(max_iterations):
        predictions = X @ w
        loss = np.mean((y - predictions) ** 2)
        gradient = -(2 / n) * X.T @ (y - predictions)
        w = w - learning_rate * gradient

        # Check convergence
        if abs(prev_loss - loss) < tol:
            print(f"Converged at iteration {i}, loss={loss:.8f}")
            return w, i
        prev_loss = loss

    print(f"Did not converge in {max_iterations} iterations, loss={loss:.8f}")
    return w, max_iterations

w_converged, n_iter = gradient_descent_with_convergence(X, y)

Converged at iteration 4217, loss=0.08295570

Common Mistake — Declaring convergence too early. A tolerance of $10^{-3}$ might look converged on a loss plot, but the weights can still be far from optimal. For most problems, $10^{-6}$ or $10^{-7}$ is a safer tolerance. On the other hand, in production you sometimes want early stopping to prevent overfitting — but that is a different mechanism (Chapter 18).

4.4 Loss Functions: What Models Learn

Gradient descent needs a loss function to minimize. The loss function defines what "wrong" means. Choose the wrong loss function and your model will optimize the wrong thing — sometimes spectacularly.

Mean Squared Error (MSE): For Regression

MSE is the default loss function for regression problems. It measures the average squared difference between predictions and actual values.

The intuition: Penalize big mistakes quadratically. Being off by 10 is 100 times worse than being off by 1.

The math:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

The numpy code:

y_true = np.array([45.0, 12.0, 62.0, 8.0])  # Actual usage hours
y_pred = np.array([42.0, 18.0, 58.0, 15.0])  # Predicted usage hours

mse = np.mean((y_true - y_pred) ** 2)
print(f"MSE: {mse:.2f}")

# Breakdown by customer
for i in range(len(y_true)):
    error = y_true[i] - y_pred[i]
    print(f"  Customer {i+1}: actual={y_true[i]:.0f}, pred={y_pred[i]:.0f}, "
          f"error={error:.0f}, squared={error**2:.0f}")

MSE: 23.25
  Customer 1: actual=45, pred=42, error=3, squared=9
  Customer 2: actual=12, pred=18, error=-6, squared=36
  Customer 3: actual=62, pred=58, error=4, squared=16
  Customer 4: actual=8, pred=15, error=-7, squared=49

Notice that Customer 4's error of 7 contributes 49 to the total — more than twice as much as Customer 1's error of 3 (which contributes 9). MSE penalizes outliers heavily.

Cross-Entropy (Log-Loss): For Classification

This is the loss function you will use for churn prediction. Cross-entropy measures how well predicted probabilities match actual binary outcomes.

The intuition: If you predict 90% probability of churn and the customer churns, you were mostly right — small penalty. If you predict 10% probability and the customer churns, you were confidently wrong — huge penalty. The penalty for being confidently wrong is much larger than the reward for being confidently right.

The math:

$$\text{Log-Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]$$

where $\hat{p}_i$ is the predicted probability of class 1 (churn).

The numpy code:

def log_loss(y_true, y_pred_proba, eps=1e-15):
    """
    Compute binary cross-entropy (log-loss).

    eps clips predictions away from 0 and 1 to avoid log(0).
    """
    y_pred_proba = np.clip(y_pred_proba, eps, 1 - eps)
    return -np.mean(
        y_true * np.log(y_pred_proba) +
        (1 - y_true) * np.log(1 - y_pred_proba)
    )

y_true = np.array([0, 1, 0, 1])  # Actual: retained, churned, retained, churned
y_pred = np.array([0.1, 0.8, 0.2, 0.7])  # Good predictions

print(f"Good predictions — log-loss: {log_loss(y_true, y_pred):.4f}")

# Now try bad predictions
y_pred_bad = np.array([0.9, 0.2, 0.8, 0.3])  # Inverted — confidently wrong
print(f"Bad predictions —  log-loss: {log_loss(y_true, y_pred_bad):.4f}")

# And mediocre predictions
y_pred_meh = np.array([0.5, 0.5, 0.5, 0.5])  # Just predicting 50/50
print(f"Uninformed (0.5) — log-loss: {log_loss(y_true, y_pred_meh):.4f}")

Good predictions — log-loss: 0.2107
Bad predictions —  log-loss: 1.5560
Uninformed (0.5) — log-loss: 0.6931

The good model (0.21) is far better than the uninformed model (0.69). The bad model (1.56) is punished harshly for being confidently wrong. Notice that the uninformed model's log-loss of 0.693 is $-\log(0.5) = \log(2)$. This is your baseline. If your model's log-loss is above 0.693, it is performing worse than a coin flip.

War Story — A junior data scientist at StreamFlow built a churn classifier and evaluated it with MSE instead of log-loss. The model predicted probabilities of 0.4 and 0.6 for most customers — hedging its bets — because MSE penalizes a prediction of 0.9 equally whether the true label is 0 or 1. Switching to log-loss as the training objective made the model produce sharper, better-calibrated probabilities. MSE is for regression. Log-loss is for classification. Using the wrong one is like measuring temperature in kilograms — the units are wrong and the optimization goes sideways.

Why MSE Is Wrong for Binary Classification

Let us make this concrete with the StreamFlow churn data.

# Four customers: actual churn labels
y_true = np.array([0, 1, 0, 1])

# A model that predicts probabilities
y_pred = np.array([0.2, 0.9, 0.1, 0.8])

# MSE treats 0/1 as numbers
mse = np.mean((y_true - y_pred) ** 2)
print(f"MSE:      {mse:.4f}")

# Log-loss treats them as probabilities
ll = log_loss(y_true, y_pred)
print(f"Log-loss: {ll:.4f}")

# Now consider: what if the model predicts 0.51 for a churner?
# MSE says: error = (1 - 0.51)^2 = 0.2401 — not bad!
# Log-loss says: -log(0.51) = 0.6733 — barely better than guessing

print(f"\nPredicting 0.51 for a churner:")
print(f"  MSE contribution:      {(1 - 0.51)**2:.4f}")
print(f"  Log-loss contribution:  {-np.log(0.51):.4f}")

print(f"\nPredicting 0.99 for a churner:")
print(f"  MSE contribution:      {(1 - 0.99)**2:.4f}")
print(f"  Log-loss contribution:  {-np.log(0.99):.4f}")

MSE:      0.0225
Log-loss: 0.1643

Predicting 0.51 for a churner:
  MSE contribution:      0.2401
  Log-loss contribution:  0.6733

Predicting 0.99 for a churner:
  MSE contribution:      0.0001
  Log-loss contribution:  0.0101

MSE says predicting 0.51 for a churner costs 0.24 and predicting 0.99 costs 0.0001 — a 2,400x difference. Log-loss says 0.67 vs 0.01 — a 67x difference. Log-loss provides a much stronger incentive to push probabilities toward the correct value, which is why every classification algorithm uses it (or a close relative).

Hinge Loss: For SVMs

Hinge loss is the loss function behind Support Vector Machines (Chapter 12). It measures not just whether you got the classification right, but whether you got it right by a margin.

The intuition: "I do not care that you got it right. I care that you got it right confidently." If a customer should be labeled +1 (churn) and your model scores them at +0.01, hinge loss says that is almost as bad as getting it wrong. You need a score of at least +1 to pay zero penalty.

The math:

$$\text{Hinge Loss} = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i \cdot \hat{y}_i)$$

where $y_i \in \{-1, +1\}$ (note: SVM convention uses -1/+1 instead of 0/1).

The numpy code:

def hinge_loss(y_true, y_scores):
    """Hinge loss. y_true should be -1 or +1."""
    return np.mean(np.maximum(0, 1 - y_true * y_scores))

# Convert 0/1 labels to -1/+1
y_svm = np.array([-1, +1, -1, +1])
scores = np.array([-2.0, +1.5, -0.5, +0.3])

loss = hinge_loss(y_svm, scores)
print(f"Hinge loss: {loss:.4f}")

# Breakdown
for i in range(len(y_svm)):
    margin = y_svm[i] * scores[i]
    contrib = max(0, 1 - margin)
    status = "correct + margin" if margin >= 1 else (
             "correct, no margin" if margin > 0 else "wrong")
    print(f"  Customer {i+1}: y={y_svm[i]:+d}, score={scores[i]:+.1f}, "
          f"margin={margin:.1f}, loss={contrib:.1f} ({status})")

Hinge loss: 0.3000
  Customer 1: y=-1, score=-2.0, margin=2.0, loss=0.0 (correct + margin)
  Customer 2: y=+1, score=+1.5, margin=1.5, loss=0.0 (correct + margin)
  Customer 3: y=-1, score=-0.5, margin=0.5, loss=0.5 (correct, no margin)
  Customer 4: y=+1, score=+0.3, margin=0.3, loss=0.7 (correct, no margin)

Customer 3 is classified correctly (negative score for a negative class) but without sufficient margin — it pays a penalty of 0.5. Customer 4 is also correct but barely — penalty 0.7. Customers 1 and 2 are correct with margin and pay nothing. Hinge loss rewards confident correctness.

Choosing the Right Loss Function

Problem Type	Default Loss	When to Use
Regression	MSE	Continuous target, outliers acceptable
Regression (robust)	MAE or Huber	Continuous target, outliers should not dominate
Binary classification	Log-loss (cross-entropy)	When you need calibrated probabilities
Binary classification (margin)	Hinge loss	SVM, when you want maximum-margin classification
Multi-class classification	Categorical cross-entropy	When there are 3+ classes

Production Tip — In practice, you rarely choose a loss function explicitly. When you call LogisticRegression() in scikit-learn, it uses log-loss. When you call LinearRegression(), it uses MSE. When you call SVC(), it uses hinge loss. Knowing which loss function your model uses matters when (a) your model is not converging, (b) you need to explain predictions, or (c) you are building a custom objective for gradient boosting (Chapter 14).

4.5 Regularization: Taming the Loss Landscape

Regularization is a penalty you add to the loss function to prevent the model from fitting the training data too closely. It is one of the most important tools in the ML practitioner's toolkit, and it has a beautiful geometric interpretation.

The Problem: Overfitting Through Large Weights

When a model overfits, its weights tend to be large. Large weights mean the model is very sensitive to small changes in the input — a tiny change in one feature causes a wild swing in the prediction. Regularization fights this by adding a penalty for large weights.

L2 Regularization (Ridge): The Circle Constraint

The intuition: Add a penalty equal to the sum of squared weights. This encourages all weights to be small but non-zero. Geometrically, it constrains the weights to lie within a circle (in 2D) or a sphere (in higher dimensions).

The math:

$$L_{\text{ridge}} = \text{MSE} + \lambda \sum_{j=1}^{p} w_j^2 = \text{MSE} + \lambda \|\mathbf{w}\|_2^2$$

where $\lambda$ controls the strength of the penalty. Higher $\lambda$ means more regularization (smaller weights).

The numpy code:

def mse_loss_l2(X, y, w, lambda_reg):
    """MSE with L2 regularization (Ridge)."""
    n = len(y)
    predictions = X @ w
    mse = np.mean((y - predictions) ** 2)
    l2_penalty = lambda_reg * np.sum(w ** 2)
    return mse + l2_penalty

# Compare unregularized vs regularized loss
y = np.array([0.0, 1.0, 0.0, 1.0])
w_large = np.array([0.5, -0.3, 0.8])   # Large weights
w_small = np.array([0.05, -0.03, 0.08])  # Small weights

for lam in [0, 0.1, 1.0]:
    loss_large = mse_loss_l2(X, y, w_large, lam)
    loss_small = mse_loss_l2(X, y, w_small, lam)
    print(f"lambda={lam:.1f}: large_w_loss={loss_large:.4f}, "
          f"small_w_loss={loss_small:.4f}")

lambda=0.0: large_w_loss=36.7709, small_w_loss=0.2539
lambda=0.1: large_w_loss=36.8689, small_w_loss=0.2548
lambda=1.0: large_w_loss=37.7509, small_w_loss=0.2627

With no regularization ($\lambda=0$), only the MSE matters. With strong regularization ($\lambda=1$), the penalty for large weights becomes significant. The model with large weights (0.5, -0.3, 0.8) pays an extra penalty of about 1.0, while the model with small weights (0.05, -0.03, 0.08) barely pays anything.

L1 Regularization (Lasso): The Diamond Constraint

The intuition: Add a penalty equal to the sum of the absolute values of the weights. This encourages weights to be exactly zero — effectively performing feature selection. Geometrically, it constrains the weights to lie within a diamond (in 2D) or a cross-polytope (in higher dimensions).

The math:

$$L_{\text{lasso}} = \text{MSE} + \lambda \sum_{j=1}^{p} |w_j| = \text{MSE} + \lambda \|\mathbf{w}\|_1$$

The numpy code:

def mse_loss_l1(X, y, w, lambda_reg):
    """MSE with L1 regularization (Lasso)."""
    n = len(y)
    predictions = X @ w
    mse = np.mean((y - predictions) ** 2)
    l1_penalty = lambda_reg * np.sum(np.abs(w))
    return mse + l1_penalty

# L1 vs L2 on the same weights
w = np.array([0.5, -0.3, 0.8])
lam = 0.5

l1_penalty = lam * np.sum(np.abs(w))
l2_penalty = lam * np.sum(w ** 2)

print(f"Weights: {w}")
print(f"L1 penalty: {l1_penalty:.4f}  (sum of |w|)")
print(f"L2 penalty: {l2_penalty:.4f}  (sum of w^2)")

Weights: [ 0.5 -0.3  0.8]
L1 penalty: 0.8000  (sum of |w|)
L2 penalty: 0.4900  (sum of w^2)

The Geometric Picture: Why L1 Produces Zeros

The reason L1 regularization produces sparse solutions (many weights exactly zero) while L2 produces small-but-nonzero weights has a beautiful geometric explanation.

Think of it this way: the loss function defines a bowl-shaped surface in weight space. The minimum of the unregularized loss is at the bottom of the bowl. Regularization says "you cannot go to the bottom — you must stay within a budget." L2's budget is a circle. L1's budget is a diamond.

Now picture the contour lines of the loss function (concentric ellipses) meeting the constraint region:

L2 (circle): The ellipses meet the circle at some smooth point. The intersection almost never lands exactly on an axis, so all weights are non-zero.
L1 (diamond): The ellipses meet the diamond at one of its corners. The corners sit on the axes — which means one or more weights are exactly zero.

# Demonstrate L1 sparsity vs L2 non-sparsity using scikit-learn
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

# Scale features for regularization to work properly
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Strong regularization
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.1).fit(X_scaled, y)

print("Ridge (L2) weights:", np.round(ridge.coef_, 4))
print("Lasso (L1) weights:", np.round(lasso.coef_, 4))
print(f"Lasso zeros: {np.sum(lasso.coef_ == 0)} out of {len(lasso.coef_)}")

Ridge (L2) weights: [-0.1839 -0.2411  0.2411]
Lasso (L1) weights: [-0.     -0.1403  0.1478]
Lasso zeros: 1 out of 3

L1 (Lasso) set the tenure weight to exactly zero — it decided tenure was not worth keeping given the regularization budget. L2 (Ridge) shrank all weights but kept them all non-zero.

Elastic Net: The Best of Both Worlds

In practice, you often want both L1's feature selection and L2's smooth shrinkage. Elastic Net combines them:

The math:

$$L_{\text{elastic}} = \text{MSE} + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2$$

In scikit-learn, this is controlled by two parameters: alpha (overall regularization strength) and l1_ratio (the balance between L1 and L2, from 0 to 1).

from sklearn.linear_model import ElasticNet

# l1_ratio=1.0 is pure Lasso, l1_ratio=0.0 is pure Ridge
enet = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_scaled, y)
print(f"Elastic Net weights: {np.round(enet.coef_, 4)}")
print(f"Non-zero weights:    {np.sum(enet.coef_ != 0)} out of {len(enet.coef_)}")

Elastic Net weights: [-0.0343 -0.0924  0.1088]
Non-zero weights:    3 out of 3

Elastic Net is the default regularization strategy for many practitioners. It handles correlated features better than pure Lasso (which tends to arbitrarily pick one of a group of correlated features and drop the rest) while still performing some feature selection.

Regularization Strength: The Bias-Variance Tradeoff

The regularization parameter $\lambda$ (called alpha in scikit-learn) controls a tradeoff:

Low $\lambda$: The model is nearly unregularized. Low bias, high variance. Risk of overfitting.
High $\lambda$: The model is heavily regularized. High bias, low variance. Risk of underfitting.
Just right $\lambda$: The model generalizes well to unseen data.

You will find the right $\lambda$ through cross-validation (Chapter 18).

War Story — A data science team at a SaaS company built a churn model with 200+ engineered features and no regularization. The model achieved 0.92 AUC on training data and 0.71 on test data — a classic overfitting gap. Adding L2 regularization (C=0.1 in LogisticRegression, which corresponds to strong regularization since scikit-learn uses $C = 1/\lambda$) dropped training AUC to 0.83 but raised test AUC to 0.79. Switching to L1 (penalty='l1') zeroed out 160 of the 200 features and achieved 0.78 test AUC with a model that was 5x faster to score. The final production model used Elastic Net, keeping 52 features with 0.80 test AUC. Regularization was not just a mathematical nicety — it was the difference between a model that impressed in a notebook and one that worked in production.

Math Sidebar — The connection between regularization and Bayesian priors is deep. L2 regularization is equivalent to placing a Gaussian prior on the weights: $w_j \sim N(0, 1/\lambda)$. L1 regularization is equivalent to a Laplace prior: $w_j \sim \text{Laplace}(0, 1/\lambda)$. The Laplace distribution has a sharp peak at zero, which is why L1 produces sparse solutions. If you have taken a Bayesian statistics course, regularization is not a hack — it is maximum a posteriori (MAP) estimation with a specific prior.

4.6 Progressive Project: Math in Action

Time to apply what you have learned to the StreamFlow churn problem. Two tasks.

Task 1: Compute Log-Loss by Hand

Four StreamFlow customers. Your model predicts churn probabilities. The actual outcomes are known.

Customer	Predicted P(churn)	Actual
A	0.85	Churned (1)
B	0.30	Retained (0)
C	0.60	Churned (1)
D	0.10	Retained (0)

Step 1: Compute each customer's contribution to log-loss.

customers = {
    'A': {'pred': 0.85, 'actual': 1},
    'B': {'pred': 0.30, 'actual': 0},
    'C': {'pred': 0.60, 'actual': 1},
    'D': {'pred': 0.10, 'actual': 0},
}

total_loss = 0
for name, c in customers.items():
    y, p = c['actual'], c['pred']
    # Log-loss: -[y * log(p) + (1-y) * log(1-p)]
    contribution = -(y * np.log(p) + (1 - y) * np.log(1 - p))
    total_loss += contribution
    print(f"Customer {name}: y={y}, p={p:.2f}, loss={contribution:.4f}")

avg_loss = total_loss / len(customers)
print(f"\nAverage log-loss: {avg_loss:.4f}")

Customer A: y=1, p=0.85, loss=0.1625
Customer B: y=0, p=0.30, loss=0.3567
Customer C: y=1, p=0.60, loss=0.5108
Customer D: y=0, p=0.10, loss=0.1054

Average log-loss: 0.2839

Notice that Customer C contributes the most to the loss. The model predicted 60% churn — correct direction, but not confident enough. Customer A (85% predicted, actually churned) contributes the least. The model was right and confident.

Step 2: Verify with scikit-learn.

from sklearn.metrics import log_loss as sklearn_log_loss

y_true = np.array([1, 0, 1, 0])
y_pred = np.array([0.85, 0.30, 0.60, 0.10])

print(f"Our computation:     {avg_loss:.4f}")
print(f"scikit-learn:        {sklearn_log_loss(y_true, y_pred):.4f}")

Our computation:     0.2839
scikit-learn:        0.2839

They match. You now understand exactly what log_loss computes and why.

Task 2: Gradient Descent from Scratch vs. scikit-learn

Build a linear regression on a slightly larger synthetic StreamFlow dataset, first with your own gradient descent, then with scikit-learn.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Generate synthetic StreamFlow data: 200 customers
rng = np.random.default_rng(42)
n_customers = 200

tenure = rng.uniform(1, 60, n_customers)
usage = rng.normal(40, 15, n_customers)
tickets = rng.poisson(0.5, n_customers)

# True relationship (unknown to the model):
# churn_score = -0.02*tenure - 0.03*usage + 0.15*tickets + noise
noise = rng.normal(0, 0.1, n_customers)
y = -0.02 * tenure - 0.03 * usage + 0.15 * tickets + noise

# Build feature matrix
X_raw = np.column_stack([tenure, usage, tickets])

# IMPORTANT: Scale features before gradient descent
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)

# --- Our gradient descent ---
def gradient_descent_full(X, y, lr=0.01, n_iter=5000, tol=1e-8):
    n, p = X.shape
    w = np.zeros(p)
    bias = 0.0

    for i in range(n_iter):
        preds = X @ w + bias
        residuals = y - preds
        loss = np.mean(residuals ** 2)

        # Gradients
        grad_w = -(2 / n) * X.T @ residuals
        grad_b = -(2 / n) * np.sum(residuals)

        w = w - lr * grad_w
        bias = bias - lr * grad_b

        if i > 0 and abs(prev_loss - loss) < tol:
            print(f"  Converged at iteration {i}")
            break
        prev_loss = loss

    return w, bias, loss

print("Our gradient descent:")
w_gd, b_gd, loss_gd = gradient_descent_full(X_scaled, y)
print(f"  Weights: {w_gd}")
print(f"  Bias:    {b_gd:.4f}")
print(f"  MSE:     {loss_gd:.6f}")

# --- scikit-learn ---
lr_sklearn = LinearRegression().fit(X_scaled, y)
preds_sk = lr_sklearn.predict(X_scaled)
mse_sk = np.mean((y - preds_sk) ** 2)

print(f"\nscikit-learn LinearRegression:")
print(f"  Weights: {lr_sklearn.coef_}")
print(f"  Bias:    {lr_sklearn.intercept_:.4f}")
print(f"  MSE:     {mse_sk:.6f}")

Our gradient descent:
  Converged at iteration 1847
  Weights: [-0.33442968 -0.44667039  0.14054285]
  Bias:    -1.8003
  MSE:     0.010129

scikit-learn LinearRegression:
  Weights: [-0.33442968 -0.44667039  0.14054285]
  Bias:    -1.8003
  MSE:     0.010129

The weights are identical (to many decimal places). Your from-scratch implementation found the same solution as scikit-learn. In practice, you would always use scikit-learn — it is faster, handles edge cases, and has been tested by thousands of users. But now you know what it is doing inside: it is finding the weights that minimize MSE, either through the normal equation or through an optimized version of the gradient descent you just wrote.

Try It — Modify the gradient descent above to use L2 regularization. Add lambda_reg * w to the weight gradient (the derivative of $\frac{\lambda}{2}\|\mathbf{w}\|^2$ with respect to $\mathbf{w}$ is $\lambda \mathbf{w}$). Compare the results with Ridge(alpha=lambda_reg) from scikit-learn. Do the weights match?

4.7 Connecting the Pieces

Everything in this chapter connects:

Probability tells you how to represent uncertainty, which is what your model's predictions actually are.
Linear algebra tells you how data is represented as matrices and how models compute predictions via matrix operations.
Calculus tells you how to optimize — how to adjust weights to reduce the loss.
Loss functions tell you what to optimize — the precise definition of "wrong."
Regularization tells you how to constrain the optimization so the model generalizes instead of memorizing.

When your model won't converge, one of these five components is the problem. The learning rate is wrong (calculus). The features are not scaled (linear algebra). The loss function is inappropriate (loss functions). The model is overfitting (regularization). The predictions are poorly calibrated (probability). You now have the vocabulary to diagnose each one.

In the next chapter, you will use SQL to extract the raw features that become the columns of $\mathbf{X}$. In Part III, you will use the loss functions from this chapter to train real classifiers. And in Chapter 18, you will use cross-validation to find the regularization strength $\lambda$ that makes your StreamFlow churn model generalize best.

The vegetables are eaten. They were not so bad.

Key Vocabulary

Term	Definition
Probability distribution	A function that describes the likelihood of different outcomes (normal, binomial, Poisson)
Bayes' theorem	$P(A\|B) = P(B\|A) \cdot P(A) / P(B)$ — updating beliefs with evidence
Prior	Your belief about a hypothesis before seeing the evidence
Posterior	Your updated belief after incorporating the evidence
Likelihood	The probability of the evidence, given the hypothesis
Matrix	A rectangular grid of numbers; in ML, the feature matrix $\mathbf{X}$ has shape $(n, p)$
Vector	An ordered list of numbers; one data point is a vector in $\mathbb{R}^p$
Dot product	$\mathbf{w} \cdot \mathbf{x} = \sum w_j x_j$ — how linear models make predictions
Transpose	Flipping rows and columns: $(n, p) \to (p, n)$
Inverse	The matrix $\mathbf{A}^{-1}$ such that $\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$
Gradient	Vector of partial derivatives $\nabla L = [\partial L / \partial w_1, \ldots, \partial L / \partial w_p]$ — points uphill
Partial derivative	Rate of change of a function with respect to one variable, holding others fixed
Learning rate	Step size $\alpha$ in gradient descent — controls speed vs. stability
Gradient descent	Iterative optimization: $\mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \nabla L(\mathbf{w}_t)$
Convergence	When the loss stops decreasing meaningfully — the optimization has found a (local) minimum
Loss function	Measures how wrong the model is — the function that gradient descent minimizes
MSE	Mean Squared Error: $(1/n) \sum (y_i - \hat{y}_i)^2$ — default for regression
Cross-entropy / Log-loss	$-(1/n) \sum [y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i)]$ — default for classification
Hinge loss	$(1/n) \sum \max(0, 1 - y_i \hat{y}_i)$ — used in SVMs, rewards margin
L1 regularization (Lasso)	Penalty $\lambda \sum \|w_j\|$ — produces sparse weights (feature selection)
L2 regularization (Ridge)	Penalty $\lambda \sum w_j^2$ — produces small weights (shrinkage)
Convexity	A function is convex if any line segment between two points on the function lies above the function — guarantees gradient descent finds the global minimum