Case Study 18.2: David Designs His Machine Learning Practice

Case Study 18.2: David Designs His Machine Learning Practice

The Setup

David has been at this for eight months. Eight months of evenings and weekends, three completed online courses, a textbook, numerous articles, and two "project" attempts that stalled. By any measure of input hours, he has put in real effort.

By the measure that matters — can he actually do machine learning on novel problems? — the answer is uncomfortable: sort of, but not reliably, and not well.

He sits down with a notebook and commits to an honest post-mortem on his ML learning to date.

The Post-Mortem

What he knows: The concepts. He can explain what gradient descent does. He understands why regularization prevents overfitting. He knows the difference between precision and recall. He can implement a basic neural network and a logistic regression. He knows the vocabulary of the field.

What he can't do reliably: - Look at a real dataset and know where to start - Evaluate whether a model's performance is good, bad, or irrelevant to the actual problem - Diagnose why a model is performing poorly in a specific way - Make sensible feature engineering decisions - Recognize when cross-validation results are meaningful versus misleading

The gap is specific and stark. He knows the maps of the territory. He cannot actually navigate it.

Why this happened: His primary learning activity — completing online courses — is structured to teach concepts clearly and efficiently. This is genuinely valuable. But course learning is pedagogically guided: the instructor knows what problem is being solved, presents it clearly, walks through the solution, provides the conceptual framework. David was always operating with a navigator. Real ML problems don't come with a navigator.

He'd been learning the theory of driving without getting behind the wheel.

The Deliberate Practice Design

David approaches this as an engineering problem: he needs to identify specific, targetable weaknesses and design practice that directly addresses them.

He identifies three primary gaps from his post-mortem:

Gap 1: Model evaluation intuition He can compute metrics. He doesn't yet have reliable intuition for what metrics mean in context — when an F1 score of 0.72 is excellent or terrible, when cross-validation results are trustworthy or misleading, when held-out set performance will generalize to real deployment.

Gap 2: Failure diagnosis When a model performs poorly, he doesn't have a systematic diagnostic approach. He adjusts things somewhat randomly — try more layers, try regularization, try different features — without a principled hypothesis-driven approach.

Gap 3: Feature engineering judgment He understands what feature engineering is. He doesn't have well-developed intuitions for which features matter, when to create interaction terms, how to handle different data types, when categorical encoding choices affect model behavior.

For each gap, he designs a specific practice regime:

Practice for Gap 1: Model Evaluation Intuition

David finds twenty datasets from an open repository covering a range of domains — medical, financial, natural language, time series, tabular, imbalanced, clean, messy. For each:

He builds a baseline model without looking at any performance benchmarks or prior work
He evaluates it with cross-validation and records his assessment: "I believe this model is [strong/adequate/poor] for the intended purpose because..."
He records his confidence in that assessment (1–5)
He then looks up any available benchmarks and compares his assessment to what experts say about similar tasks

The key: he's not just building models, he's building and evaluating his calibration — how well his judgments match reality. When he's wrong, he writes down why. What did he misread? What did he not consider?

Practice for Gap 2: Failure Diagnosis

David deliberately introduces specific known problems into working models and practices diagnosing them:

Remove the most informative features and observe performance impact
Introduce severe class imbalance and observe recall/precision behavior
Use an incorrect loss function for the problem type
Train on a different distribution than test data
Include data leakage in features

For each sabotaged model, he practices diagnosis without knowing what the sabotage was. He commits to a hypothesis in writing before checking. He's right about 40% of the time in the first month — which means he's in the learning zone, not the mastery zone.

Practice for Gap 3: Feature Engineering Judgment

David studies winning solutions from Kaggle competitions in domains related to his work, with specific focus on feature engineering decisions. Not on the models (the models are often similar across solutions) but on the features — what did the winning teams create, why, and what did it contribute to performance?

For each feature engineering decision he studies, he writes a "principle extraction": "This illustrates the principle that [X] — in contexts where [Y], creating this type of feature is likely to help because [Z]."

He's building a mental model of feature engineering, not memorizing specific solutions.

The Feedback Loop Problem

David quickly realizes one fundamental challenge: feedback is slow.

In chess puzzles, feedback is instant — the position is correct or it isn't. In music, the note was right or wrong. In ML, you don't know whether your model will generalize until you test it on truly held-out data, and often not even then, because the deployment environment differs from the test set.

He engineers faster feedback where he can: - Uses datasets with known benchmarks (so he can compare to a standard) - Creates held-out test sets before any modeling to preserve them as true held-out data - Reviews his reasoning with a more senior ML practitioner monthly (not his code — his reasoning process)

The monthly reasoning review is the most valuable feedback he gets. His reviewer isn't evaluating whether his models performed well. She's evaluating whether his reasoning process is principled and calibrated. This is process feedback, not outcome feedback — and it changes how David thinks about what he's trying to develop.

Three Months Later

David's ML work has transformed in ways he finds hard to describe without the comparison to before.

The most significant change: he now has mental representations of what good ML practice looks like. When he opens a new dataset, he has a process — not a rigid algorithm but a principled exploration sequence that reflects developed judgment. When a model performs poorly, he has diagnostic hypotheses before he starts experimenting.

He's still not a senior ML practitioner. He's been doing this for less than a year; practitioners have years of domain-specific experience that he doesn't have yet. But the quality of his reasoning has changed.

He also has much more accurate self-assessment. Before, he would have said he was "learning ML pretty well" because his course completion metrics said so. Now he can identify specific gaps in his knowledge with precision — which is both humbling and useful, because identified gaps can be targeted.

The Insight

"The courses taught me what machine learning is," David says. "The deliberate practice is teaching me to think like someone who does machine learning."

The distinction maps cleanly onto mental representations: the courses built his conceptual knowledge of the domain. The deliberate practice is building his models of how good ML practitioners reason, evaluate, and diagnose — the internal representations that enable judgment rather than just recall.

He's also confronted with an unexpected corollary: deliberate practice is more tiring than coursework. A focused session of diagnosing deliberately broken models, with committed hypotheses and careful reasoning, leaves him more mentally depleted than two hours of watching excellent instruction. This matches exactly what Ericsson's research describes — genuine deliberate practice is cognitively exhausting in a way that passive learning is not.

That exhaustion has become, perversely, a signal he trusts. If a session leaves him tired in that specific way, he probably did real work. If he finishes feeling fine, he may have been going through the motions.

Real practice is hard. The hardness is the point.