Case Study 2: McKinsey's AI-Augmented Consulting — Prompt Chains in Professional Services


Introduction

In September 2023, McKinsey & Company — the world's most prestigious management consulting firm, with $16 billion in annual revenue and 45,000 consultants across 130 cities — published the results of a controlled experiment that sent ripples through the professional services industry.

Working with researchers from Harvard Business School, the Wharton School, and MIT Sloan, McKinsey enrolled 758 consultants in a randomized study to measure how GPT-4 affected the quality and speed of consulting work. The results were striking: consultants using AI completed 12.2 percent more tasks, finished them 25.1 percent faster, and produced output that was rated 40 percent higher in quality than consultants working without AI.

But the headline numbers obscure a more nuanced story — one that reveals both the extraordinary power and the subtle dangers of AI augmentation in knowledge work. This case study examines how elite consulting firms are using advanced prompt engineering — particularly prompt chaining, structured outputs, and multi-agent patterns — to augment research, analysis, and deliverable creation. It also examines where the approach breaks down and what governance practices the industry is developing in response.


Phase 1: The Harvard-Wharton-MIT Experiment

Study Design

The experiment, led by Harvard's Fabrizio Dell'Acqua, Ethan Mollick (Wharton), and others, was designed to measure AI's impact on realistic consulting tasks — not toy problems. The 758 McKinsey consultants (ranging from business analysts to partners) were randomly assigned to one of three groups:

  1. No AI. Completed tasks using traditional methods only.
  2. AI with basic access. Given access to GPT-4 with no specific guidance on how to use it.
  3. AI with training. Given access to GPT-4 plus instruction in prompt engineering techniques — including many of the techniques covered in Chapter 20.

The tasks were drawn from real consulting work: market sizing, competitive analysis, creative brainstorming, persuasive writing, and strategic recommendation development.

Key Findings

For tasks within the AI "frontier" — tasks the model could handle competently — the results were dramatic. Consultants in the AI groups outperformed the no-AI group on every metric. The trained group (Group 3) performed particularly well, suggesting that prompt engineering skill significantly amplifies AI's value.

For tasks outside the frontier — tasks requiring judgment the model lacked — AI actually reduced performance. When consultants were given a task that required specialized knowledge the model did not possess, those using AI were more likely to produce incorrect recommendations than those working without AI. The researchers called this the "falling asleep at the wheel" effect: consultants trusted the model's confident-sounding output and failed to apply their own critical judgment.

Business Insight: This finding has profound implications. AI does not uniformly improve knowledge work. It improves tasks within its capability frontier and degrades tasks outside it — and the degradation is particularly dangerous because the model's output looks equally confident in both cases. The consultants who performed best were those who could accurately judge which tasks to delegate to AI and which to handle themselves.

The Role of Prompt Engineering

The study found that consultants who used structured prompting techniques — including chain-of-thought reasoning, role-based prompting, and iterative refinement — produced significantly better outputs than those who used the model conversationally. Specifically:

  • Consultants who structured their prompts with explicit reasoning steps (CoT) produced analyses rated 35 percent more rigorous than those who asked open-ended questions.
  • Consultants who broke complex analyses into sequential prompts (chaining) produced recommendations rated 28 percent more actionable.
  • Consultants who used the model to critique and revise its own output (self-critique) submitted deliverables with 45 percent fewer factual errors.

These findings validate the central thesis of Chapter 20: advanced prompt engineering techniques are not academic curiosities — they are practical skills with measurable business impact.


Phase 2: From Experiment to Practice

McKinsey's Internal AI Platform

Following the experiment, McKinsey accelerated its internal AI strategy. By mid-2024, the firm had deployed Lilli — an internal generative AI tool named after Lillian Dombrowski, the firm's first professional staff member hired in 1935. Lilli was designed not as a general-purpose chatbot but as a specialized research and analysis assistant for consultants.

Lilli's architecture incorporated several advanced prompt engineering principles from Chapter 20:

Retrieval-augmented generation. Lilli connects to McKinsey's vast internal knowledge base — tens of thousands of client engagement reports, research publications, and proprietary frameworks — so that every response is grounded in McKinsey-specific expertise rather than generic internet knowledge. (This connects to the RAG concepts introduced in Chapter 20 and explored in depth in Chapter 21.)

Prompt chaining for research workflows. When a consultant asks Lilli to analyze a market entry opportunity, the system does not generate a single response. Instead, it executes a chain: 1. Search the knowledge base for relevant prior engagements and market analyses. 2. Synthesize findings into a structured market overview. 3. Identify gaps in the available data and flag them for the consultant. 4. Generate a preliminary analysis framework (e.g., market sizing, competitive landscape, entry barriers). 5. Format the output as a McKinsey-standard slide structure.

Each step has its own prompt, its own system message, and its own quality checks.

Structured output for slide generation. Consulting deliverables follow rigid formatting standards — the "McKinsey slide" is a specific artifact with a governing thought at the top, supporting evidence in the body, and a source line at the bottom. Lilli generates structured output that maps directly to this format, reducing the manual effort of converting free-form text into presentation-ready content.

Multi-agent critique. Before presenting research output to the consultant, Lilli runs a critique step that evaluates the output against McKinsey's quality standards: "Is the governing thought clear and action-oriented? Is every data point sourced? Does the analysis address the 'so what' — the implication for the client's decision?"

Measured Impact

McKinsey reported the following results from Lilli's deployment (as of early 2025):

  • Research time reduction: 30-40 percent reduction in the time consultants spend on initial research and data gathering.
  • Draft quality improvement: First drafts produced with Lilli assistance were rated "engagement-ready" 60 percent of the time, compared to 25 percent for unassisted first drafts.
  • Knowledge access democratization: Junior consultants gained access to insights that previously required either years of experience or knowing which senior partner to ask. A first-year business analyst in the Tokyo office could access the same depth of prior work as a principal in New York.
  • Consistency improvement: Deliverables produced with Lilli showed less variance in quality across geographies and experience levels.

Business Insight: The democratization effect is arguably Lilli's most strategically significant impact. Consulting firms' competitive advantage has historically depended on the accumulated knowledge of their senior practitioners. By making that knowledge searchable and usable through AI, McKinsey reduces its dependency on individual experts while simultaneously making every consultant more effective.


Phase 3: The Challenges

Challenge 1: The Over-Reliance Problem

The "falling asleep at the wheel" effect identified in the Harvard study persisted in practice. Some consultants — particularly junior ones — began treating Lilli's output as authoritative rather than as a starting point for their own analysis.

In one widely discussed internal incident, a team submitted a market sizing analysis to a client that contained a significant error in a key assumption. The error originated in Lilli's output, was not caught during human review, and was identified by the client's internal analysts. The engagement partner described it as "the first time a McKinsey team was embarrassed by trusting an AI more than their own training."

McKinsey's response was instructive: rather than restricting access to Lilli, the firm added mandatory "AI output validation" steps to its engagement quality processes. Every Lilli-assisted analysis now requires at least one team member to independently verify key assumptions and data points — a human-in-the-loop requirement that mirrors the validation functions in the PromptChain class from Chapter 20.

Challenge 2: Client Confidentiality

Consulting firms handle extraordinarily sensitive client data. A market entry analysis for one pharmaceutical company could contain information that would be material to that company's competitors — competitors that might also be McKinsey clients. Using this data to train or fine-tune AI models would be a catastrophic breach of confidentiality.

McKinsey addressed this through strict data isolation: - Lilli does not retain conversation history across different engagement teams. - Client-specific data is processed but never stored in the model's retrievable knowledge base. - Engagement-specific prompts are logged for quality assurance but access is restricted to the engagement team and quality reviewers. - The firm uses private deployments of the underlying LLM, with contractual guarantees that data is not used for model training.

Challenge 3: Preserving Analytical Rigor

Consulting firms sell analytical rigor. When a McKinsey recommendation carries weight, it is because clients trust that the analysis behind it was exhaustive, methodologically sound, and informed by deep expertise. If AI-assisted analysis is perceived as "cheap" or "automated," it could undermine the very brand premium that supports McKinsey's pricing.

The firm's approach has been to position AI as an accelerant rather than a replacement. "AI does not replace the consultant," as one McKinsey partner put it. "It replaces the three days the consultant spent reading background reports so they can spend those three days thinking."

This positioning is reflected in how Lilli's chain is structured: the AI handles data gathering, synthesis, and formatting (Steps 1, 2, and 5 in the chain above), while the consultant owns analysis, judgment, and recommendation (Steps 3 and 4).

Challenge 4: Governance at Scale

With 45,000 potential users across 65+ countries, governing AI use at McKinsey is a massive operational challenge. The firm established an AI governance framework with several notable features:

Prompt libraries. Rather than allowing consultants to write ad hoc prompts, McKinsey maintains curated prompt libraries for common tasks — market sizing, competitive analysis, interview guide generation, and others. These prompts have been tested, reviewed for security and bias, and optimized for quality.

Usage monitoring. All Lilli interactions are logged. An internal team monitors for misuse patterns — excessive reliance without independent verification, use of client-confidential data in prompts, and outputs that violate quality standards.

Training requirements. All consultants complete an AI literacy and prompt engineering training program before gaining access to Lilli. The training includes the advanced techniques from Chapter 20 and emphasizes the boundaries of AI capability — when to trust the model and when to override it.

Tiered access. Access to Lilli's most powerful features (client data integration, knowledge base search, structured deliverable generation) is restricted to consultants who have completed advanced training and are working under an engagement with appropriate data handling protocols.


Industry-Wide Implications

McKinsey's experience is not unique — it is representative of a transformation sweeping through professional services:

Boston Consulting Group deployed its own internal AI platform and reported similar productivity gains, with consultants completing tasks 25 percent faster and producing higher-quality deliverables.

Deloitte invested in AI-augmented audit and advisory tools, using prompt chains to automate initial reviews of financial statements and flag anomalies for human auditors.

Law firms including Allen & Overy (which deployed Harvey, an AI legal assistant) are using structured output extraction to analyze contracts, identify risk clauses, and generate first drafts of legal documents.

Big Four accounting firms are building prompt-chain workflows for tax preparation, regulatory filing, and advisory services.

The common thread across all these implementations is the architecture described in Chapter 20: prompt chains that decompose complex professional tasks into steps, with structured outputs, quality checkpoints, and human oversight at critical decision points.


Lessons for Business Leaders

Lesson 1: AI Amplifies Existing Skill, It Does Not Replace It

The McKinsey study's most important finding is that AI made good consultants better and poor consultants more dangerous. The consultants who benefited most from AI were those with strong analytical foundations, clear thinking, and the ability to critically evaluate AI output. Those who lacked these skills produced worse work with AI than without it, because they accepted the model's confident-sounding errors at face value.

For business leaders, the implication is that AI training must be paired with domain training. Teaching employees to use AI tools without ensuring they have the judgment to evaluate AI output is like giving someone a car without teaching them to drive.

Lesson 2: Prompt Chains Are the Architecture of AI-Augmented Work

The ad hoc, conversational use of AI (open a chatbot, type a question, use the answer) is the lowest tier of AI augmentation. Meaningful productivity gains come from structured workflows — prompt chains that decompose tasks, ensure quality at each step, and integrate AI output into existing business processes.

McKinsey's Lilli is not a chatbot. It is a multi-step workflow engine that happens to use language models as its computational substrate. The sophistication is not in the model — it is in the orchestration.

Lesson 3: Governance Is a Competitive Advantage

Companies that govern AI well — curated prompt libraries, usage monitoring, mandatory training, human-in-the-loop requirements — will outperform those that govern it poorly. Not because governance makes AI faster, but because it makes AI trustworthy. And trust is the currency of professional services.

Lesson 4: The Economic Model Is Changing

If AI can reduce a three-week consulting engagement to one week of equivalent quality, the traditional billing model (hours x rate) comes under pressure. Clients will question why they are paying for three weeks of work that can be done in one. This forces professional services firms to shift from selling time to selling outcomes — a fundamental transformation of the industry's economic model.

Lesson 5: AI Creates New Roles, Not Just Efficiency

McKinsey did not reduce its headcount after deploying Lilli. Instead, it redeployed capacity. Consultants spent less time on research and formatting and more time on client relationship management, creative problem-solving, and implementation support — activities where human judgment and interpersonal skill remain irreplaceable. The firm also created new roles: prompt engineers, AI quality reviewers, and AI governance specialists.


Discussion Questions

  1. The Harvard-Wharton-MIT study found that AI made good consultants better and poor consultants more dangerous. How should firms use this finding in their hiring, training, and performance evaluation practices?

  2. McKinsey's Lilli uses curated prompt libraries rather than allowing ad hoc prompts. What are the trade-offs of this approach? When does curation help, and when does it constrain innovation?

  3. If AI can produce consulting-quality analysis at a fraction of the cost, how should consulting firms adjust their pricing models? What are the risks of undervaluing or overvaluing AI-augmented work?

  4. The case describes an incident where a team submitted AI-generated analysis with an undetected error. How should professional services firms allocate liability when AI contributes to a work product? Should the responsible consultant, the firm, or the AI vendor bear the risk?

  5. Consider the democratization effect: a first-year analyst with Lilli access can produce analysis comparable to a principal with 15 years of experience. What does this mean for career development, mentorship, and the apprenticeship model that consulting firms have traditionally relied on?

  6. Compare McKinsey's prompt chain architecture for market analysis with Athena's QBR chain described in Chapter 20. What structural similarities do you observe? How do the validation and governance requirements differ between a consulting deliverable and an internal business report?


Key Takeaways

  • AI augmentation in consulting produces a 25-40 percent improvement in speed and quality for tasks within the AI's capability frontier — but can degrade performance for tasks outside it.
  • The consultants who benefit most from AI are those with the strongest domain expertise and critical thinking skills — AI amplifies existing capability rather than substituting for it.
  • Prompt chains, structured outputs, and multi-agent critique patterns are the architectural foundation of AI-augmented professional services, mirroring the techniques in Chapter 20.
  • Governance — curated prompt libraries, usage monitoring, mandatory training, human-in-the-loop requirements — is essential for maintaining quality and trust.
  • AI is transforming the economics of professional services, forcing a shift from billing for time to billing for outcomes.
  • New roles (prompt engineers, AI governance specialists) are emerging alongside productivity gains, suggesting that AI creates employment at the same time it automates tasks.

This case study connects to the prompt chaining, structured output, and enterprise governance concepts in Chapter 20. For deeper exploration of AI's impact on organizational transformation, see Chapters 35-36. For the ethical dimensions of AI-augmented work, see Chapter 38.