Case Study: Elena's Research Accelerator — Elicit and Consensus for Literature Synthesis

The Challenge

Elena's consulting project required her to build an evidence-based perspective on a question her client — a large hospital system — was wrestling with: whether AI-assisted clinical documentation tools actually improve physician wellbeing and reduce burnout, or whether the productivity gains were being absorbed by increased patient volumes with no net benefit to physicians.

This was not a question with an obvious answer. Elena had relevant intuitions from client conversations, but the client's medical leadership would expect a rigorous perspective grounded in research evidence. The literature on physician burnout is large, methodologically heterogeneous, and partly published in clinical journals she did not have training to assess independently.

In a traditional workflow, Elena would have spent two to three days searching PubMed and Google Scholar, trying to identify the relevant studies, reading abstracts and deciding what to request full-text for, and eventually synthesizing across 10-15 papers. She would have needed to be careful to capture the range of evidence rather than cherry-picking, and she would have needed domain expert input to assess study quality.

She decided to use Elicit and Consensus to see whether she could do this differently.

Phase 1: Scoping with Consensus (45 minutes)

Elena started with Consensus because she had a specific, answerable research question: "Does AI clinical documentation reduce physician burnout?"

The Consensus result was immediately informative — and appropriately hedged. It found 11 studies with relevant findings, reported a mixed-to-positive evidence direction, and noted that most evidence was on documentation burden reduction (strong evidence) and physician satisfaction (moderate evidence), while actual burnout metric improvement (using validated burnout scales) had more limited evidence.

This three-way distinction — documentation burden, satisfaction, clinical burnout — was useful framing she had not started with. The Consensus output prompted her to refine the question.

She ran two follow-up queries: 1. "Does reducing documentation burden improve physician wellbeing?" 2. "Do physicians exposed to AI clinical tools report higher or lower satisfaction after 12+ months?"

The first query had stronger evidence (documentation reduction is fairly consistently associated with wellbeing improvements). The second was more interesting — the time dimension revealed something the general question had obscured. Several studies showed high initial satisfaction that somewhat declined at 12 months as workload dynamics adjusted.

Total time for scoping phase: 45 minutes. At this point she had a clearer question, an initial orientation on the evidence landscape, and a specific hypothesis to investigate further: that time-horizon matters significantly for interpreting AI documentation tool benefits.

Phase 2: Systematic Paper Search with Elicit (90 minutes)

With clearer questions, Elena moved to Elicit for more detailed paper analysis.

Her primary Elicit query: "What is the effect of AI ambient documentation tools on physician time, wellbeing, and practice patterns?"

Elicit returned 23 papers. For each, it extracted: - Study type (RCT, cohort study, survey, etc.) - Study population (specialty, practice setting, country) - Tool studied (Nuance DAX was most common, several others) - Key outcomes measured - Reported findings - Study limitations noted by authors

This structured extraction across 23 papers in a single view was the feature Elena found most valuable. Scanning the structured extractions took 20 minutes to triage into three piles: - Clearly relevant and high quality: 8 papers - Potentially relevant, need to look at more carefully: 7 papers - Low relevance or methodological concerns: 8 papers

She was not making final quality judgments from Elicit's summaries — that would require reading the actual papers. But the triage was informed. Several papers were cohort studies with very short follow-up periods; Elicit's extraction made this visible across all papers simultaneously rather than requiring her to open each one to check.

She added one more Elicit query: "Studies of negative outcomes or unintended consequences of AI clinical documentation tools." This was deliberate — she wanted to avoid ending up with only the evidence supporting a favorable conclusion. This query returned 6 additional papers covering concerns about documentation quality changes, physician over-reliance on AI documentation, and one study on EHR error rates in AI-documented notes.

Phase 3: Reading and Analysis (3 hours)

Elena read 15 papers in full — the 8 high-quality relevant papers plus 7 from the potentially relevant pile that remained after closer review. She read the 6 negative outcomes papers selectively, focusing on methodology and key findings.

Across this reading, she kept notes organized around the framing she had developed in Phase 1:

Documentation burden evidence: Robust. Multiple well-designed studies showed 40-60% reductions in documentation time with ambient AI tools. This finding was consistent and credible.

Physician satisfaction: Positive but with important nuance. Initial satisfaction was high (typically 80%+ reporting the tool as positive). Studies with 12+ month follow-up showed some decline, often attributed to workload backfill — physicians seeing more patients as documentation time freed up, often without compensating reductions in other demands.

Burnout metrics specifically: Limited evidence. Most studies measured satisfaction and documentation time, not validated burnout scales (like the Maslach Burnout Inventory). The studies that did use validated measures showed smaller effects and more mixed results than the satisfaction studies suggested.

Unintended consequences: Real but manageable. Documentation quality concerns were the most frequent, with AI-generated notes sometimes missing clinical nuances, temporal relationships, or physician-specific reasoning. EHR error rates were higher in AI-documented notes in one study — a finding worth flagging.

Phase 4: Synthesis with AI Assistance (60 minutes)

With her reading complete, Elena used Claude to help structure the synthesis.

Her prompt:

I've completed a literature review on AI clinical documentation tools and physician wellbeing.
I want to synthesize this into a 3-4 page section for a strategy report for a hospital client.

Key findings from my research:

1. Documentation burden reduction: strong evidence (40-60% time reduction, multiple robust studies)
2. Initial physician satisfaction: high (80%+ positive) but moderates over 12+ months
3. Validated burnout metrics: limited evidence, smaller effects than satisfaction suggests
4. Workload backfill risk: documented in multiple 12-month studies
5. Documentation quality concerns: some evidence of reduced clinical nuance in AI notes
6. EHR error risk: one study found higher error rates in AI-documented notes

My client is a hospital system evaluating deployment of AI documentation tools.
They want honest assessment of the evidence, not just the positive case.

Structure this synthesis as:
- What the evidence clearly shows
- What the evidence suggests with less certainty
- Important uncertainties and gaps
- Implications for implementation strategy

Maintain an analytically rigorous, consulting-appropriate tone.

Claude produced a well-structured draft that organized the evidence in the three-tier structure she requested and captured the key nuances. She edited it for approximately 40 minutes — adjusting several framings, adding specific study citations she wanted to include, and softening one conclusion that Claude had stated more definitively than the evidence supported.

The Final Product and What It Did

The literature synthesis section was 2,100 words. It took Elena approximately 6 hours from starting Consensus to having an edited synthesis — versus her estimate of 2-3 days using traditional search.

The quality difference compared to what she typically produces on research-heavy questions was real: - Broader coverage: she found 23 papers versus the 8-10 she typically finds with keyword search alone - More systematic: the structured extraction made it easier to notice patterns across papers (the 12-month follow-up finding was visible only because she could compare time horizons across studies simultaneously) - More balanced: the deliberate "negative outcomes" search query produced papers she likely would not have found through standard search

The hospital's CMO — a physician with direct experience of AI documentation tools — reviewed the synthesis and specifically noted that the workload backfill finding and the burnout-versus-satisfaction distinction were the most clinically important points and had been understated in the vendor presentations the system had received.

What Elena Learned About These Research Tools

Consensus first, Elicit second. Consensus is faster for getting initial orientation and testing specific hypotheses. Elicit is more powerful for comprehensive paper analysis. The right order: Consensus to understand the landscape, Elicit for systematic deep-dive.

Deliberately search for disconfirming evidence. Her negative outcomes query was the single most important methodological choice she made. Research synthesis without a deliberate effort to find opposing evidence produces a skewed picture. The tools make it easy to only search for what confirms your prior view — you have to deliberately work against that tendency.

Elicit summaries are scaffolding, not conclusions. Several times, Elicit's paper summaries slightly mischaracterized findings. She found these discrepancies when reading the papers. The summaries were right enough to usefully triage but not reliable enough to cite without verification.

AI-assisted synthesis needs expert judgment for framing. The most important analytical contribution in this project — the three-way distinction between documentation burden, satisfaction, and validated burnout metrics — was not something the tools provided. It emerged from her reading and domain reasoning. The tools accelerated the retrieval and processing. She supplied the analytical framework.

The comparison matters: Elena estimated that the AI-assisted workflow saved approximately 10-12 hours compared to traditional literature review. The output was also more comprehensive. For research-intensive consulting work, she has adopted this as her standard workflow for any evidence synthesis task.

The Workflow as a Repeatable Template

Based on this project, Elena documented a repeatable four-phase workflow for AI-assisted literature synthesis:

Phase 1 (30-60 min): Scoping with Consensus Run 2-4 specific questions to understand evidence landscape. Refine your questions based on what you find. Identify the key distinctions that matter.

Phase 2 (60-120 min): Systematic retrieval with Elicit Run comprehensive queries. Include at least one query explicitly seeking disconfirming evidence. Triage results into high/medium/low relevance.

Phase 3 (variable): Careful reading Read high-relevance papers fully. Verify Elicit summaries against the actual papers. Take structured notes organized around your analytical framework.

Phase 4 (60-90 min): AI-assisted synthesis Provide Claude or similar with your structured notes and key findings. Request specific synthesis structure. Edit and verify the output against your notes.

She shared this workflow with two junior colleagues. One adapted it immediately for a healthcare policy project. The other adapted it for a technology adoption research question in a different industry. Both reported significant time savings and improved systematic coverage compared to their traditional research approach.

This is the outcome that matters most: a workflow that is more efficient, more systematic, and produces better output than the previous approach — not because the AI replaced the analytical work, but because it compressed the retrieval and processing work enough that the practitioner could spend more time on the analysis that actually requires human judgment.