Chapter 13 Key Takeaways: Diagnosing and Fixing Bad Outputs

Bad output is information, not failure. Every bad AI output answers the question: "What did the model understand from my prompt?" Treating it as a diagnostic signal rather than a frustrating failure transforms the experience from random to systematic.
There are seven distinct root causes of bad AI output. Insufficient context, vague instruction, format mismatch, wrong capability, hallucination, training data bias, and context window overflow each have specific diagnostic signals and specific fixes. Identifying the root cause is the first step to an effective repair.
Insufficient context (~35%) and vague instruction (~25%) account for approximately 60% of AI output failures. These are the most common root causes and are entirely preventable with the fundamental prompting skills from Chapters 7-9. Investing in fundamentals pays off more than investing in advanced techniques for most users.
The diagnostic framework has five questions: What specific failure is present? Which root cause(s) are responsible? What was missing or wrong in the prompt? How far is the output from what I need? Should I repair or restart?
Name the failure specifically before writing the repair prompt. "This isn't right" produces a vague repair. "The output is generic — no reference to our industry or specific situation" produces a targeted repair. The more precisely you can describe the failure, the more precisely the model can fix it.
The Triage Matrix uses two dimensions: distance from goal and effort to repair. Low distance + low effort = quick repair. High distance = consider restarting, especially if a restart would be quick. The matrix prevents investing repair effort in outputs that cannot be salvaged.
The retry loop doesn't work for systematic failures. Submitting the same prompt again addresses only random variance failures. For systematic failures caused by missing context, vague instruction, or capability limits, the same conditions produce the same failure. Diagnose first, then re-prompt.
Format mismatch is the easiest failure to repair. When content is right but format is wrong, the Format Fix template preserves the content and restructures it. This is a Triage Matrix Quadrant 1 situation — quick repair, no restart needed.
Hallucination has consistent patterns — it is not random. Hallucination risk is highest for: specific numerical claims, citation attribution, internal/custom library API calls, events near or after the training cutoff, and obscure topics with limited training data. Knowing the hallucination risk profile lets you apply verification effort where it matters most.
For code hallucination, internal library method calls are the highest-risk pattern. AI models generate plausible API surfaces for custom libraries based on usage patterns, but they cannot know which methods were actually implemented. The "does this method exist?" verification check takes 30 seconds and prevents hours of debugging.
Unit tests that mock objects will not catch phantom method hallucinations. Raj's case demonstrates this directly — the tests passed because they mocked the QueryBuilder, never instantiating the non-existent AdvancedFilter class. Integration tests that use real objects are necessary to catch this failure mode.
Authoritative tone is a hallucination risk signal, not a confidence signal. The more professionally written and specifically attributed a hallucinated claim sounds, the less likely it is to be questioned. "According to a 2023 McKinsey study..." followed by precise percentages should increase your verification attention, not your confidence.
"Keep everything else as is" is the most important instruction in a targeted repair prompt. Without it, repair prompts can inadvertently fix the stated problem while changing elements that were already correct. Scoping the repair to only the problem element preserves working content.
There are seven repair prompt templates for systematic use: Targeted Correction (specific element is wrong), Task Clarification (model misunderstood the task), Format Fix (content right, format wrong), Context Reload (add missing context), Factual Correction (specific claim is wrong), Depth Request (correct direction but too shallow), and Full Restart (fundamentally wrong, restart required).
Some outputs are genuinely unfixable with better prompting. When the model has a genuine capability gap, when the task requires knowledge the model doesn't have, or when specialized accuracy is required that the model cannot reliably provide — stop trying to prompt your way around the limitation. Change the approach.
The personal failure taxonomy builds the long-term learning that makes prompting progressively better. Logging failures with root cause, repair, and prevention creates a personalized improvement roadmap. After 20-30 entries, patterns become visible that are invisible in the moment.
Training data bias produces systematic failures in a consistent direction. When recommendations consistently favor large companies, US-based contexts, or mainstream solutions, training data bias is the likely cause. The fix is an explicit counter-instruction: "recommendations should be appropriate for a [specific context] — not for a large enterprise."
Context window overflow is a conversation management problem, not a prompting problem. The fix is to periodically restate key constraints in long conversations, or to start a fresh conversation with essential context included from the beginning.
Each root cause in the diagnostic framework maps back to a preventive skill in an earlier chapter. Insufficient context → Chapter 8. Vague instruction → Chapter 9. Format mismatch → Chapter 7. Hallucination → Chapter 10 (self-critique). Context overflow → Chapter 11 (pattern design). The diagnostic framework is a feedback loop to the earlier chapters.
Alex's Black Friday case demonstrates the value of using your patterns even under time pressure. The failure was predictable: she skipped her brand voice reference library because she was working fast. The pattern existed; she just didn't use it. The most common way pattern libraries fail is not being built — it's being available but ignored.
Raj's phantom function demonstrates that code hallucination can pass all standard quality checks. The code compiled, the unit tests passed, the code review passed — and none of these steps detected the non-existent method. The lesson: for AI-generated code that calls internal APIs, verification requires actually checking whether the method exists, not just whether the code compiles.
The three-minute diagnosis process: Minute 1 — read the output carefully and write one sentence naming the specific failure. Minute 2 — match to a root cause and identify what was missing from the prompt. Minute 3 — choose a repair template and write the repair prompt. This process consistently produces better repairs than immediately re-prompting.
For high-stakes outputs, add hallucination prevention instructions proactively. "Flag any specific factual claim with [UNVERIFIED] if you're not certain it's accurate," "Avoid specific numeric claims unless you are confident in their accuracy," and "If you cite a source, include enough detail to verify it" are all prevention instructions that reduce hallucination risk before it manifests.
The most dangerous AI failures are not the obvious ones — they're the plausible ones. An obviously wrong output is easy to catch. A confidently stated, authoritatively framed, specifically attributed wrong output is what causes real professional damage. This is why calibrated skepticism — targeted at the specific content types where hallucination is most common — is more valuable than uniform skepticism.
Chapter 13 closes the Part 2 loop: every prompting skill has a corresponding failure mode it prevents. Learning the diagnostic framework makes every earlier chapter's skill more meaningful — because now you understand exactly what problem each skill is preventing, and you have a systematic way to identify which skill to apply when something goes wrong.