Chapter 10 Key Takeaways: Advanced Prompting Techniques

Standard prompting hits a ceiling on complex tasks. The four-element framework (task, context, format, constraints) is necessary but insufficient for tasks requiring multi-step reasoning, style matching, quality assurance, or complex scope management. Advanced techniques address specific failure modes that the fundamentals cannot.
Chain-of-thought prompting works because it externalizes reasoning. By making intermediate steps explicit in the generated text, each step constrains the next, preventing pattern-matched conclusions from substituting for actual reasoning. The improvement is in the generation process, not the model's knowledge.
Zero-shot CoT is the minimum viable reasoning boost. Adding "Let's think step by step" (or equivalent phrasing) to a prompt produces roughly 60% of the benefit of full few-shot CoT examples — with zero additional examples required. This is the easiest upgrade available for reasoning-heavy tasks.
CoT improves accuracy by approximately 3× on multi-step reasoning tasks. The Wei et al. (2022) finding — from ~18% to ~57% on math word problems — illustrates the magnitude of possible improvement. Gains vary by task and model but are consistently positive for any task requiring more than one inferential step.
Use CoT for reasoning, not for everything. CoT adds value on multi-step logic, analysis, diagnosis, and planning. It adds noise on simple factual retrieval, short creative tasks, and direct formatting requests. Knowing when not to use CoT is as important as knowing when to use it.
Fake reasoning is a real risk with CoT. Some models produce outputs that look like step-by-step reasoning but are actually vague, non-building, pattern-matched summaries. Detecting fake reasoning: each step should commit to specific intermediate conclusions that constrain the next step. If steps could be reordered without consequence, they're probably not genuine reasoning.
Few-shot prompting shows rather than describes. Abstract instruction ("conversational and warm") communicates a genre. Concrete examples communicate a specific voice. For style, format, and classification tasks, demonstrating is always more effective than describing.
Two to six examples is the optimal few-shot range for large models. The marginal benefit of additional examples drops sharply after six. Three well-chosen examples often achieves near-maximum performance. The constraint is example quality, not quantity.
Example selection is the highest-leverage decision in few-shot prompting. Examples that represent the range of your task, demonstrate the qualities you care about most, and maintain consistent format will outperform twice as many mediocre examples. Choose based on representativeness, not recency or effort.
Few-shot example quality can be tested. A/B testing with different example sets — keeping everything else constant — is the most reliable way to identify which examples are producing the results you want. If output quality varies significantly between example sets, the examples are doing the work.
Self-critique is a quality layer, not a replacement for human review. Asking the model to evaluate and improve its own output catches a meaningful portion of errors — approximately 20-40% depending on task type — but does not eliminate them. Self-critique is most valuable for reducing the error rate in your drafts before human review, not for eliminating the need for human review.
Self-critique requires explicit criteria to avoid sycophancy. Open-ended self-critique ("how could this be better?") often produces weak, validating responses. Criteria-based self-critique ("evaluate against these 4 specific standards") forces genuine evaluation. The more specific the criteria, the more useful the critique.
Constitutional self-critique applies explicit standards, not open-ended judgment. Specifying a "constitution" — a numbered list of quality standards the output must meet — produces more rigorous and actionable critique than asking for general improvement.
Structured decomposition prevents scope overload. Asking AI to produce a large, complex deliverable in a single prompt often produces broad, shallow output. Breaking the task into defined subtasks and addressing each with focused attention produces better results on every subtask.
"Plan then execute" gives you structural control. The two-stage approach — get the plan first, review it, then execute — prevents you from discovering structural problems after all the content is written. Approve the skeleton before the flesh.
Tree-of-Thought is for problems with genuinely multiple viable paths. For most business tasks, chain-of-thought is sufficient. Consider Tree-of-Thought when the problem has multiple legitimate approaches and the early choice of approach materially affects the outcome.
Technique combinations amplify individual benefits. Few-shot + CoT teaches both the format and the reasoning process. Role + self-critique establishes evaluative standards through the role. Decomposition + CoT applies step-by-step reasoning within each defined subtask. The combinations are more powerful than any single technique.
Your few-shot library is a strategic asset. A curated set of high-quality examples for your recurring tasks is the closest thing to a "trained model" that you can build without technical resources. Invest time in building it; maintain it as the work evolves.
Self-critique creates a verification checklist. Having the model flag which claims are verified vs. unverified doesn't eliminate hallucinations — it creates a prioritized list of what you need to verify. This transforms the human review from line-by-line reading to targeted checking of flagged content.
Technique selection is a skill that develops with practice. The selection guide in this chapter provides a starting framework, but the most reliable guide is your own experience with your specific tasks. Document which techniques work best for which of your recurring tasks — this knowledge compounds.
The "master prompt" combining multiple techniques is worth building for your highest-stakes tasks. A prompt that combines role + context + few-shot + CoT + self-critique for your most important recurring task takes time to build but produces consistently excellent results. Build one; measure the improvement.
CoT makes debugging tractable. For code debugging, the five-step CoT format — intent analysis, failure mode analysis, isolation analysis, suspect ranking, investigation step — consistently outperforms generic "what's wrong with this code?" prompts. The specificity of the reasoning constraint is the key variable.
The 3% failure rate is a clue, not just a rate. Raj's debugging case demonstrates a general principle: the specific characteristics of a failure pattern (its frequency, which inputs trigger it, when it occurs) are diagnostic data. Prompting the model to reason about those characteristics specifically leads to faster root-cause identification.
Few-shot libraries need maintenance. Examples that are used repeatedly may become too prominent in the model's output (borrowed phrases appearing literally). Review and refresh examples periodically. When the brand or style evolves, update the library to reflect the current standard.
Advanced techniques work because they change how the model generates, not what it knows. CoT, few-shot, self-critique, and decomposition all work by restructuring the generation process — making reasoning explicit, showing standards through examples, separating generation from evaluation, limiting scope per step. They are not tricks; they are principled interventions in the generation process.