Case Study 1: The Feynman Technique in Action
David Applies Plain-Language Explanation to Gradient Descent — and Discovers What He Didn't Know
David had been studying machine learning for seven months when he encountered gradient descent for the first time. He read the chapter in his textbook. He watched a YouTube lecture. He followed along with the code in his Jupyter notebook. He got the math — calculus, partial derivatives, the chain rule. He felt like he understood it.
He was about to find out he didn't.
The Setup
In month eight, David decided to apply the Feynman technique systematically to his ML concepts. He'd read about it in a blog post and thought it seemed useful. His rule: every major concept gets a Feynman session before he considers it "understood." He starts a notebook entry with the concept name, the date, and the words "Explain this to a 12-year-old."
Gradient descent. Date: March 14. He makes himself a fresh coffee, sits in his office chair, and begins writing.
First Attempt: The Explanation That Fell Apart
David's first explanation attempt, reconstructed:
"Gradient descent is an optimization algorithm used to minimize the loss function of a machine learning model. You calculate the gradient of the loss function with respect to the model's parameters, then update the parameters in the direction of the negative gradient. The learning rate controls the size of each step. Repeat until convergence."
He read it back. Correct. Complete. Technically accurate.
And a twelve-year-old would understand none of it.
He tried again, with the constraint of no technical vocabulary:
"Imagine you're on a hillside wearing a blindfold. You want to get to the bottom of the valley — that's where the model is performing best. You can't see, but you can feel which direction is downhill from where you're standing. You take a step in that direction. Then you check again — which direction is now downhill? Take another step. Keep going until you're as low as you can get.
In gradient descent, the 'hill' is the error of your model. The 'direction downhill' is the gradient — a calculation that tells you which way to adjust your model's numbers to reduce error. The 'step size' is how much you adjust at once. Keep adjusting, keep checking, until the error stops decreasing much."
Better. But now the questions started. David's rule: after writing the explanation, ask himself the questions a skeptical, curious student would ask.
Question 1: Why the NEGATIVE gradient? You said you go in the "downhill" direction — but you called it the negative gradient. Why negative?
David knew the answer in abstract: the gradient points uphill (toward steepest increase), so you go in the opposite direction (negative) to decrease. But could he explain why the gradient specifically points uphill rather than some other direction? Could he explain this using the "blindfold on a hillside" metaphor in a way that made it intuitively obvious?
He tried. His explanation got circular: "the gradient is the direction of steepest increase, which is why you go negative..." But when the imaginary student asked "why is it the steepest direction? Why not just any uphill direction?" David's answer collapsed into "because that's what calculus says."
First real gap identified.
The Gap Investigation
David went back to his calculus textbook. He specifically wanted to understand, intuitively, why the gradient vector points in the direction of steepest ascent — not just that it does, but why.
He worked through the geometric interpretation for two hours. The gradient is a vector whose components are the partial derivatives in each dimension. The direction of steepest ascent is the direction in which the directional derivative (rate of change in that direction) is maximized. A theorem from multivariate calculus shows that this maximum is achieved precisely in the direction of the gradient vector.
He could now explain this intuitively: "The gradient tells you 'if you move in this exact direction, you'll go uphill faster than in any other direction.' We want to go down fastest, so we move exactly opposite to this direction."
He updated his explanation. Then continued.
Question 2: Why don't you just find the bottom of the valley directly? Why this step-by-step process?
David's answer: in real ML problems, you can't see the whole landscape. You can only calculate your local slope. You don't have a map of the whole hill — you only know which direction is downhill from where you're standing right now. Gradient descent is necessary because you can't solve for the minimum directly in problems with thousands or millions of parameters.
Could he explain WHY you can't solve directly? He had a vague sense ("the equations are too complex") but couldn't articulate it precisely.
Second gap: the connection between gradient descent and why closed-form solutions (like ordinary least squares for linear regression) aren't available for more complex models.
Second Attempt: The Revised Explanation
After two more hours of reading and thinking, David produced a second explanation that addressed both gaps:
"When a machine learning model makes predictions, it makes mistakes — the error between its predictions and the right answers. We want to reduce that error as much as possible by adjusting the model's internal numbers (called parameters or weights).
For simple models like basic linear regression, there's a formula that directly calculates the best parameters — like solving an equation. But for complex models with millions of parameters and non-linear relationships, there's no such formula. You can't solve it directly.
Instead, we use gradient descent. Imagine the error as a landscape of hills and valleys. We want to be in the lowest valley. We start somewhere random, and we ask: which direction is downhill from here? The answer is given by the gradient — a calculation using calculus that points in the direction of steepest uphill. We go the opposite direction (downhill).
We take a small step in that direction. Then recalculate: where is downhill from our new position? Take another step. Repeat thousands or millions of times until we're approximately at the bottom.
The tricky part: we might get stuck in a small local valley that isn't the deepest valley in the whole landscape. This is called a local minimum, and it's a genuine problem for gradient descent. More advanced versions of gradient descent (stochastic, mini-batch, with momentum) are partly designed to help escape these local traps."
That last paragraph appeared spontaneously — because in explaining the straightforward version, David found himself thinking about its limitations, which forced him to address local vs. global minima more concretely than he had before.
What the Feynman Session Revealed
Gap 1: Geometric intuition for why the gradient points uphill. Filled by two hours of calculus review. David now has this as a Anki card: "Explain intuitively why the gradient vector points in the direction of steepest ascent" — with an answer that uses the directional derivative concept.
Gap 2: Why closed-form solutions aren't available for complex models. Filled by reading about the computational complexity of finding exact optima in high-dimensional non-convex spaces. David adds a card: "Why can't we just solve for the optimal ML model parameters directly (without gradient descent)?"
Unexpected discovery: local vs. global minima. By forcing himself to explain gradient descent clearly and then think about what could go wrong, David arrived at a more concrete understanding of local minima than his textbook had given him. He had read the words "local minimum problem" but the Feynman session made the problem viscerally clear: if the landscape is lumpy (not convex), gradient descent might get stuck in a small dip and mistake it for the bottom.
This led to a third hour of reading about why deep neural networks can avoid local minima in practice (saddle points are more common than local minima in high dimensions; stochastic gradient descent naturally escapes saddle points). None of this was in his original reading. The explanation generated the questions that generated the learning.
The Longer-Term Effect
Two months later, David was explaining gradient descent to a colleague at work — a software engineer who was curious about ML but hadn't studied it. He used the hill-and-blindfold metaphor, explained the necessity of gradient descent for complex models, and walked through the local minimum issue without hesitation.
"The explanation came out fluently," he says. "Not because I'd memorized it — because I genuinely understood it. And I knew I understood it because I had found and filled the actual gaps in my understanding, rather than just being able to repeat the right words."
He has since applied the Feynman technique to: backpropagation, the attention mechanism in transformers, the bias-variance tradeoff, Bayesian inference, and the intuition behind regularization. Each session has revealed gaps he didn't know he had and produced understanding he hadn't gotten from his primary sources.
"The Feynman technique is uncomfortable in the best possible way. Every time you try to explain something and your explanation collapses, you learn exactly what you need to learn next. That's actually useful. That's rare."