Case Study 1: Gradient Descent Debugging — When the Model Won't Converge

DataField.Dev

Case Study 1: Gradient Descent Debugging — When the Model Won't Converge

The Situation

It is Wednesday afternoon. Priya, a data scientist at StreamFlow, has been tasked with building a quick proof-of-concept model to predict monthly subscriber churn. She has 200,000 subscribers with 12 features each. She decides to build logistic regression from scratch using gradient descent — partly because she wants to understand the internals before handing the project off to the production team, and partly because she read Chapter 4 of a certain textbook and wants to prove she can do it.

She writes the following code.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def logistic_gradient_descent(X, y, lr=0.01, n_iter=1000):
    n, p = X.shape
    w = np.zeros(p)
    b = 0.0
    losses = []

    for i in range(n_iter):
        z = X @ w + b
        preds = sigmoid(z)

        # Log-loss
        loss = -np.mean(y * np.log(preds) + (1 - y) * np.log(1 - preds))
        losses.append(loss)

        # Gradients
        error = preds - y
        grad_w = (1 / n) * X.T @ error
        grad_b = (1 / n) * np.sum(error)

        w -= lr * grad_w
        b -= lr * grad_b

        if i % 100 == 0:
            print(f"Iter {i}: loss={loss:.6f}")

    return w, b, losses

She loads her data, calls the function, and sees this:

Iter 0:   loss=0.693147
Iter 100: loss=nan

The loss went to NaN after 100 iterations. Something is very wrong.

Diagnosis 1: The Exploding Sigmoid

Priya adds some debugging output inside the loop:

if i < 5 or i % 100 == 0:
    print(f"Iter {i}: loss={loss:.6f}, "
          f"z range=[{z.min():.1f}, {z.max():.1f}], "
          f"max |grad_w|={np.abs(grad_w).max():.4f}")

Iter 0: loss=0.693147, z range=[ 0.0,  0.0], max |grad_w|=0.0820
Iter 1: loss=0.672841, z range=[-0.5,  0.3], max |grad_w|=0.1203
Iter 2: loss=0.691501, z range=[-2.1,  1.8], max |grad_w|=0.4518
Iter 3: loss=0.823419, z range=[-15.2, 12.7], max |grad_w|=3.8012
Iter 4: loss=3.241890, z range=[-142.3, 118.9], max |grad_w|=41.2301

The z values are exploding. By iteration 4, the linear combination $z = \mathbf{X}\mathbf{w} + b$ ranges from -142 to +119. When $z$ is that large, sigmoid(z) returns values so close to 0 or 1 that np.log(preds) produces -inf, and the loss becomes nan.

Root cause: The features are not scaled. One of her features is monthly_revenue in dollars, ranging from $9.99 to $199.99. Another is account_age_days, ranging from 1 to 2,190. When the gradient updates the weight for account_age_days, even a small weight change produces enormous changes in $z$ because the feature values are so large.

Fix: Standardize the features.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)

# Now all features have mean 0 and std 1
print(f"Before scaling — Feature ranges: {X_raw.min(axis=0)[:3]} to {X_raw.max(axis=0)[:3]}")
print(f"After scaling  — Feature ranges: {X_scaled.min(axis=0)[:3].round(2)} to {X_scaled.max(axis=0)[:3].round(2)}")

Before scaling — Feature ranges: [  1.    9.99  0.  ] to [2190.   199.99   47.  ]
After scaling  — Feature ranges: [-1.73 -1.69 -0.58] to [1.72 1.74 4.12]

She reruns with scaled features:

Iter 0:   loss=0.693147
Iter 100: loss=0.412389
Iter 200: loss=0.389712
Iter 300: loss=0.381044

The loss is now decreasing smoothly. Problem partially solved.

Diagnosis 2: Painfully Slow Convergence

After scaling, the model converges but slowly. After 1,000 iterations, the loss is 0.3621. After 5,000 iterations, it is 0.3589. After 10,000 iterations: 0.3577. It is creeping toward some minimum, but each iteration barely moves the needle.

Priya suspects the learning rate is too conservative. She experiments:

for lr in [0.01, 0.1, 1.0, 5.0]:
    w, b, losses = logistic_gradient_descent(X_scaled, y, lr=lr, n_iter=1000)
    print(f"lr={lr}: final_loss={losses[-1]:.6f}")

lr=0.01: final_loss=0.362108
lr=0.1:  final_loss=0.357814
lr=1.0:  final_loss=0.357702
lr=5.0:  final_loss=nan

A learning rate of 1.0 converges to 0.3577 in 1,000 iterations — what took 10,000 iterations at 0.01. A learning rate of 5.0 blows up. The sweet spot for this problem is somewhere between 1.0 and 5.0.

Key Insight — Scaling and learning rate are deeply connected. Before scaling, a learning rate of 0.01 was too large (NaN). After scaling, 0.01 was too small (slow convergence). Scaling normalizes the loss landscape so that a single learning rate works reasonably for all parameters. This is why every ML library scales features by default in their pipelines.

Diagnosis 3: The Wrong Loss Function

Priya's colleague Malik sees her code and asks: "Why are you using logistic regression? Have you tried linear regression first? It is simpler."

Priya tries it:

def linear_regression_gd(X, y, lr=0.01, n_iter=1000):
    n, p = X.shape
    w = np.zeros(p)
    b = 0.0
    losses = []

    for i in range(n_iter):
        preds = X @ w + b
        loss = np.mean((y - preds) ** 2)  # MSE
        losses.append(loss)

        error = preds - y
        grad_w = (2 / n) * X.T @ error
        grad_b = (2 / n) * np.sum(error)

        w -= lr * grad_w
        b -= lr * grad_b

    return w, b, losses

w_lin, b_lin, losses_lin = linear_regression_gd(X_scaled, y, lr=0.1, n_iter=1000)
preds_lin = X_scaled @ w_lin + b_lin

print(f"Prediction range: [{preds_lin.min():.3f}, {preds_lin.max():.3f}]")
print(f"Predictions < 0: {(preds_lin < 0).sum()}")
print(f"Predictions > 1: {(preds_lin > 1).sum()}")

Prediction range: [-0.214, 0.347]
Predictions < 0: 18432
Predictions > 1: 0

The linear regression model predicts negative values for 18,432 customers. Probabilities cannot be negative. And the maximum prediction is 0.347 — the model never predicts above 35% churn probability, even for the highest-risk customers.

Root cause: MSE on binary labels treats 0 and 1 as numbers on a continuous scale. The optimal MSE prediction for a class with 8.2% base rate is to predict close to 0.082 for everyone — hedging toward the mean. MSE does not "know" that the target is a probability.

Log-loss, by contrast, operates on the probability scale through the sigmoid function. It penalizes confident wrong predictions severely (as we saw in the chapter) and produces calibrated probability estimates.

Malik was wrong. For binary classification, log-loss is the correct objective. MSE gives technically valid predictions but they are not calibrated and they can fall outside [0, 1].

Diagnosis 4: Numerical Instability in Log-Loss

After fixing the scaling and learning rate, Priya runs her final training for 5,000 iterations. At iteration 3,847, she gets a RuntimeWarning: divide by zero encountered in log.

The issue: for some customers, sigmoid(z) returns values so close to 1.0 that 1 - preds rounds to 0.0, and np.log(0) is negative infinity.

Fix: Clip the predictions away from 0 and 1.

def safe_log_loss(y, preds, eps=1e-15):
    preds = np.clip(preds, eps, 1 - eps)
    return -np.mean(y * np.log(preds) + (1 - y) * np.log(1 - preds))

This is exactly what sklearn.metrics.log_loss does internally. The clipping introduces a negligible error (predictions of 0.999999999999999 vs 1.0) but prevents numerical catastrophe.

The Final Working Implementation

After all four fixes:

import numpy as np
from sklearn.preprocessing import StandardScaler

def train_logistic_regression(X_raw, y, lr=1.0, n_iter=2000, tol=1e-7):
    """
    Logistic regression via gradient descent.
    Includes: scaling, numerical stability, convergence checking.
    """
    # Fix 1: Scale features
    scaler = StandardScaler()
    X = scaler.fit_transform(X_raw)

    n, p = X.shape
    w = np.zeros(p)
    b = 0.0
    prev_loss = float('inf')

    for i in range(n_iter):
        z = X @ w + b
        preds = 1 / (1 + np.exp(-z))

        # Fix 4: Numerical stability
        preds_clipped = np.clip(preds, 1e-15, 1 - 1e-15)
        loss = -np.mean(y * np.log(preds_clipped) +
                        (1 - y) * np.log(1 - preds_clipped))

        # Fix 3: Using log-loss, not MSE
        error = preds - y
        grad_w = (1 / n) * X.T @ error
        grad_b = (1 / n) * np.sum(error)

        # Fix 2: Appropriate learning rate (works because features are scaled)
        w -= lr * grad_w
        b -= lr * grad_b

        if abs(prev_loss - loss) < tol:
            print(f"Converged at iteration {i}, loss={loss:.6f}")
            return w, b, scaler
        prev_loss = loss

    print(f"Completed {n_iter} iterations, loss={loss:.6f}")
    return w, b, scaler

Diagnosis 5: Feature Correlation and Oscillation

Priya's model converges, but she notices something odd when she adds more features. With 12 features, convergence takes 2,341 iterations. When she adds 8 interaction features (bringing the total to 20), convergence takes 12,800 iterations — even though the learning rate and scaling are the same.

She plots the loss curve and sees a new pattern: the loss decreases, then briefly increases, then decreases again. It oscillates, slowly dampening.

Root cause: Several of the interaction features are highly correlated with the original features. For example, tenure_months * usage_hours is correlated with both tenure_months and usage_hours. Correlated features create an elongated, valley-shaped loss landscape. Gradient descent bounces back and forth across the narrow valley instead of walking straight down it.

Fix: There are three options.

Option A: Remove correlated features (check the correlation matrix, drop features with $|r| > 0.9$).

Option B: Use L2 regularization. Ridge regression handles correlated features gracefully by distributing weight among them instead of fighting over it.

Option C: Use a smarter optimizer. Algorithms like Adam and L-BFGS adapt their step size per-parameter, so they can take large steps along the valley and small steps across it.

from sklearn.linear_model import LogisticRegression

# Scikit-learn's default solver (lbfgs) handles correlations automatically
model = LogisticRegression(max_iter=1000, C=1.0)  # C=1/lambda, so C=1 means moderate regularization
model.fit(X_with_interactions_scaled, y)
print(f"scikit-learn converged in {model.n_iter_[0]} iterations")

scikit-learn converged in 47 iterations

Forty-seven iterations with L-BFGS versus 12,800 with vanilla gradient descent. This is not a small difference. It is the difference between a model that trains in 0.1 seconds and one that takes 30 seconds — and on production-sized data, the gap would be minutes versus hours.

Comparison with scikit-learn

After all five diagnoses, Priya puts her from-scratch implementation side by side with scikit-learn.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Our implementation
w_ours, b_ours, scaler_ours = train_logistic_regression(X_raw, y, lr=1.0, n_iter=5000)

# scikit-learn
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=5000, C=1e10))  # C=1e10 disables regularization
])
pipe.fit(X_raw, y)

sk_w = pipe.named_steps['model'].coef_[0]
sk_b = pipe.named_steps['model'].intercept_[0]

print("Our weights:     ", np.round(w_ours, 4))
print("sklearn weights: ", np.round(sk_w, 4))
print(f"\nMax weight difference: {np.abs(w_ours - sk_w).max():.6f}")

Converged at iteration 2341, loss=0.357702

Our weights:      [-0.2134 -0.0891  0.1523  0.3012 -0.0472  0.0983 -0.1845  0.0211  0.0567 -0.0134  0.2789  0.0045]
sklearn weights:  [-0.2134 -0.0891  0.1523  0.3012 -0.0472  0.0983 -0.1845  0.0211  0.0567 -0.0134  0.2789  0.0045]

Max weight difference: 0.000003

The results are virtually identical. The difference: scikit-learn's implementation handles all five of Priya's issues automatically — scaling in the pipeline, optimized solver with adaptive learning rate, log-loss by default, numerical stability built in, and L-BFGS for efficient convergence with correlated features. It also runs about 50x faster.

The Debugging Checklist

Priya documents her debugging journey as a checklist for her team. When a gradient-based model will not converge, check these in order:

Step	Check	Symptom	Fix
1	Are features scaled?	Loss goes to NaN within a few iterations	`StandardScaler()` in your pipeline
2	Is the learning rate appropriate?	Loss decreases very slowly (too small) or oscillates/explodes (too large)	Grid search over [0.001, 0.01, 0.1, 1.0]
3	Is the loss function correct?	Model converges but predictions are nonsensical (e.g., probabilities outside [0,1])	MSE for regression, log-loss for classification
4	Is there numerical instability?	NaN/inf appearing after many good iterations	Clip predictions, use numerically stable implementations
5	Are features correlated?	Slow convergence with oscillating loss curve	Remove correlated features, add regularization, or switch to L-BFGS

This checklist, in this order, resolves 95% of gradient descent convergence issues. The remaining 5% usually involve label errors in the training data — but that is a data quality problem, not a math problem.

Lessons Learned

Always scale features before gradient descent. Unscaled features create an elongated loss landscape where the gradient oscillates along steep dimensions and crawls along flat ones.
The learning rate and feature scaling are entangled. A learning rate that works perfectly on scaled data will diverge on raw data, and vice versa.
Use the right loss function for the right problem type. MSE for regression, log-loss for classification. Using MSE for binary classification produces uncalibrated predictions that can fall outside [0, 1].
Numerical stability matters in practice. Clip predictions away from 0 and 1 before taking logarithms. Every production ML library does this.
Gradient descent debugging is systematic, not magical. Check, in order: (a) Are features scaled? (b) Is the learning rate appropriate? (c) Is the loss function correct for the problem type? (d) Is there numerical instability?

Discussion Questions

Priya implemented gradient descent from scratch. In what situations would this be justified in a production environment, versus simply using scikit-learn?
The learning rate of 1.0 worked well for this problem after scaling. Would the same learning rate work for a dataset with 500 features instead of 12? Why or why not?
Malik suggested using linear regression for a classification problem. Under what (narrow) circumstances might this actually be acceptable?
How would you modify Priya's debugging approach if the model converged but produced poor predictions on a held-out test set?