Case Study 2: MediCore Clinical Note Extraction — Structured Data from Unstructured Text

Context

MediCore Pharmaceuticals is conducting a retrospective analysis of treatment outcomes across 14 clinical sites. The study requires structured data — medication names, dosages, diagnoses, lab results, adverse events — but much of this information exists only in unstructured clinical notes written by physicians during patient visits. Manual chart review by trained abstractors costs $15-25 per note, and MediCore has 340,000 notes to process. At $20 per note, manual extraction would cost $6.8 million and take 8-12 months.

The clinical informatics team proposes using an LLM to automate the extraction. The approach is conceptually simple: feed each clinical note to the model with a structured extraction prompt, and parse the output into database-ready fields. The promise is transformative — reducing months of manual work to days of compute. The risks are equally significant: a hallucinated medication name or fabricated lab value could corrupt the study's statistical analysis.

This case study implements the extraction pipeline, evaluates it against expert annotations, and quantifies the failure modes that make LLM-based clinical extraction both powerful and dangerous.

The Extraction Task

Each clinical note must yield a structured record with the following fields:

Field Type Example
Symptoms List[str] ["persistent cough", "low-grade fever", "fatigue"]
Vital signs Dict[str, str] {"temperature": "100.2F", "BP": "128/82"}
Diagnoses List[str] ["community-acquired pneumonia"]
Medications List[Dict] [{"name": "azithromycin", "dose": "500mg", "frequency": "daily", "duration": "5 days"}]
Lab results List[Dict] [{"test": "WBC", "value": "12.4", "unit": "K/uL", "flag": "high"}]
Imaging List[str] ["CXR: bilateral infiltrates"]
Follow-up str "1 week"

Prompt Design

The extraction prompt uses a few-shot strategy with two annotated examples followed by the target note. Chain-of-thought is avoided for this task: the team found that instructing the model to "think step by step" increases the risk of the model inferring diagnoses not explicitly stated in the note (a hallucination pattern specific to clinical reasoning).

import json
import re
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, field


@dataclass
class ClinicalExtraction:
    """Structured extraction from a clinical note."""
    symptoms: List[str] = field(default_factory=list)
    vital_signs: Dict[str, str] = field(default_factory=dict)
    diagnoses: List[str] = field(default_factory=list)
    medications: List[Dict[str, str]] = field(default_factory=list)
    lab_results: List[Dict[str, str]] = field(default_factory=list)
    imaging: List[str] = field(default_factory=list)
    follow_up: str = ""


def build_extraction_prompt(note: str) -> str:
    """Construct a few-shot extraction prompt for clinical notes.

    Uses two annotated examples to demonstrate the expected output
    format, followed by the target note.

    Args:
        note: The clinical note to extract from.

    Returns:
        Formatted prompt string.
    """
    return f"""You are a clinical data extraction system. Extract structured information from the clinical note below. Return ONLY a JSON object with the specified fields. Do NOT infer diagnoses or information not explicitly stated in the note.

Example 1:
Note: "45M presents with chest pain x 2 hours, radiating to left arm. BP 145/92, HR 98. ECG shows ST elevation in leads II, III, aVF. Troponin I elevated at 2.3 ng/mL (normal <0.04). Started on aspirin 325mg, heparin drip, nitroglycerin SL. Cardiology consulted for emergent cath."

Extraction:
{{
  "symptoms": ["chest pain radiating to left arm, 2 hours duration"],
  "vital_signs": {{"BP": "145/92", "HR": "98"}},
  "diagnoses": [],
  "medications": [
    {{"name": "aspirin", "dose": "325mg", "frequency": "once", "route": "oral"}},
    {{"name": "heparin", "dose": "drip", "frequency": "continuous", "route": "IV"}},
    {{"name": "nitroglycerin", "dose": "not specified", "frequency": "as needed", "route": "sublingual"}}
  ],
  "lab_results": [
    {{"test": "Troponin I", "value": "2.3", "unit": "ng/mL", "flag": "elevated", "reference": "<0.04"}}
  ],
  "imaging": ["ECG: ST elevation in leads II, III, aVF"],
  "follow_up": "emergent cardiac catheterization"
}}

Example 2:
Note: "72F with Type 2 DM, here for routine follow-up. A1c improved to 7.1% from 8.3%. Fasting glucose 132. Continues metformin 1000mg BID. Added jardiance 10mg daily. BP well-controlled on lisinopril 20mg daily. Recheck A1c in 3 months."

Extraction:
{{
  "symptoms": [],
  "vital_signs": {{}},
  "diagnoses": ["Type 2 diabetes mellitus"],
  "medications": [
    {{"name": "metformin", "dose": "1000mg", "frequency": "BID", "route": "oral"}},
    {{"name": "jardiance", "dose": "10mg", "frequency": "daily", "route": "oral"}},
    {{"name": "lisinopril", "dose": "20mg", "frequency": "daily", "route": "oral"}}
  ],
  "lab_results": [
    {{"test": "A1c", "value": "7.1", "unit": "%", "flag": "improved", "reference": "previous 8.3%"}},
    {{"test": "fasting glucose", "value": "132", "unit": "mg/dL", "flag": ""}}
  ],
  "imaging": [],
  "follow_up": "recheck A1c in 3 months"
}}

Now extract from this note:
Note: "{note}"

Extraction:"""


# Test notes for the extraction pipeline
test_notes = [
    (
        "55M presents with persistent cough x 3 weeks, low-grade fever (100.2F), "
        "and fatigue. No hemoptysis. PMH: HTN, hyperlipidemia. CXR shows bilateral "
        "infiltrates. WBC 11.8 K/uL. Started on azithromycin 500mg day 1, then "
        "250mg days 2-5. Continue home meds: lisinopril 10mg daily, atorvastatin "
        "20mg daily. Follow-up in 1 week, sooner if worsening."
    ),
    (
        "38F c/o intermittent headaches x 2 months, worse in morning, associated "
        "with nausea. Neuro exam nonfocal. BP 118/76, HR 72. MRI brain ordered. "
        "Start sumatriptan 50mg PRN for acute episodes, max 2 doses/day. Headache "
        "diary recommended. Return in 4 weeks or sooner if new neurological symptoms."
    ),
    (
        "62M with known CKD stage 3b, here for nephrology follow-up. Cr 2.1 "
        "(baseline 1.8), GFR 32. K+ 5.1. BP 142/88 on amlodipine 10mg and "
        "losartan 100mg. Added sodium bicarbonate 650mg TID for metabolic acidosis. "
        "Low potassium diet counseling. Recheck labs in 6 weeks."
    ),
]

for i, note in enumerate(test_notes):
    prompt = build_extraction_prompt(note)
    print(f"Note {i+1} prompt length: {len(prompt.split())} words")
Note 1 prompt length: 461 words
Note 2 prompt length: 441 words
Note 3 prompt length: 444 words

Simulated Extraction and Validation

In production, the prompt is sent to an LLM API. Here, we simulate extractions and build the validation pipeline that catches errors before they enter the study database.

def simulate_llm_extraction(note: str, seed: int = 42) -> Dict[str, Any]:
    """Simulate an LLM extraction with realistic error patterns.

    Demonstrates the types of errors LLMs make on clinical notes:
    - Correct extractions (majority of cases)
    - Hallucinated diagnoses (inferred, not stated)
    - Missed medications (especially complex dosing schedules)
    - Fabricated lab values (combining values from different notes)

    Args:
        note: Clinical note text.
        seed: Random seed.

    Returns:
        Simulated extraction as a dictionary.
    """
    # Simulated extraction for note 1 (with realistic errors)
    if "persistent cough" in note:
        return {
            "symptoms": ["persistent cough, 3 weeks", "low-grade fever", "fatigue"],
            "vital_signs": {"temperature": "100.2F"},
            "diagnoses": ["community-acquired pneumonia"],  # HALLUCINATION: not stated
            "medications": [
                {"name": "azithromycin", "dose": "500mg then 250mg",
                 "frequency": "daily", "duration": "5 days"},
                {"name": "lisinopril", "dose": "10mg", "frequency": "daily",
                 "route": "oral"},
                {"name": "atorvastatin", "dose": "20mg", "frequency": "daily",
                 "route": "oral"},
            ],
            "lab_results": [
                {"test": "WBC", "value": "11.8", "unit": "K/uL", "flag": "elevated"},
            ],
            "imaging": ["CXR: bilateral infiltrates"],
            "follow_up": "1 week",
        }
    elif "headaches" in note:
        return {
            "symptoms": ["intermittent headaches, 2 months, worse in morning",
                         "nausea"],
            "vital_signs": {"BP": "118/76", "HR": "72"},
            "diagnoses": [],
            "medications": [
                {"name": "sumatriptan", "dose": "50mg", "frequency": "PRN",
                 "route": "oral", "max_dose": "2 doses/day"},
            ],
            "lab_results": [],
            "imaging": ["MRI brain: ordered"],
            "follow_up": "4 weeks",
        }
    else:
        return {
            "symptoms": [],
            "vital_signs": {"BP": "142/88"},
            "diagnoses": ["CKD stage 3b", "metabolic acidosis",
                          "hyperkalemia"],  # HALLUCINATION: hyperkalemia not diagnosed
            "medications": [
                {"name": "amlodipine", "dose": "10mg", "frequency": "daily",
                 "route": "oral"},
                {"name": "losartan", "dose": "100mg", "frequency": "daily",
                 "route": "oral"},
                {"name": "sodium bicarbonate", "dose": "650mg", "frequency": "TID",
                 "route": "oral"},
            ],
            "lab_results": [
                {"test": "creatinine", "value": "2.1", "unit": "mg/dL",
                 "flag": "elevated", "reference": "baseline 1.8"},
                {"test": "GFR", "value": "32", "unit": "mL/min", "flag": "low"},
                {"test": "potassium", "value": "5.1", "unit": "mEq/L",
                 "flag": "elevated"},
                {"test": "bicarbonate", "value": "18", "unit": "mEq/L",
                 "flag": "low"},  # FABRICATION: not in note
            ],
            "follow_up": "6 weeks",
        }


class ExtractionValidator:
    """Validate LLM extractions against the source note.

    Implements heuristic checks that catch common LLM failure modes
    without requiring expert annotations. These checks flag suspicious
    extractions for human review rather than silently accepting them.
    """

    # Known medication names for validation
    COMMON_MEDICATIONS = {
        "aspirin", "metformin", "lisinopril", "atorvastatin", "amlodipine",
        "losartan", "azithromycin", "amoxicillin", "omeprazole", "levothyroxine",
        "sumatriptan", "jardiance", "heparin", "nitroglycerin", "sodium bicarbonate",
        "metoprolol", "furosemide", "warfarin", "gabapentin", "prednisone",
    }

    def validate(
        self, extraction: Dict[str, Any], source_note: str
    ) -> Dict[str, List[str]]:
        """Run validation checks on an extraction.

        Args:
            extraction: The LLM's structured extraction.
            source_note: The original clinical note.

        Returns:
            Dictionary mapping check names to lists of warnings.
        """
        warnings: Dict[str, List[str]] = {
            "medication_verification": [],
            "lab_value_verification": [],
            "diagnosis_verification": [],
            "completeness": [],
        }

        note_lower = source_note.lower()

        # Check 1: Verify medications appear in source note
        for med in extraction.get("medications", []):
            med_name = med.get("name", "").lower()
            if med_name and med_name not in note_lower:
                warnings["medication_verification"].append(
                    f"Medication '{med_name}' not found in source note"
                )

        # Check 2: Verify lab values appear in source note
        for lab in extraction.get("lab_results", []):
            value = lab.get("value", "")
            if value and value not in source_note:
                warnings["lab_value_verification"].append(
                    f"Lab value '{value}' for {lab.get('test', 'unknown')} "
                    f"not found in source note — possible fabrication"
                )

        # Check 3: Flag diagnoses not explicitly mentioned
        for dx in extraction.get("diagnoses", []):
            dx_lower = dx.lower()
            # Simple keyword check — production would use clinical NER
            dx_words = set(dx_lower.split())
            note_words = set(note_lower.split())
            overlap = dx_words & note_words
            if len(overlap) < len(dx_words) * 0.5:
                warnings["diagnosis_verification"].append(
                    f"Diagnosis '{dx}' may be inferred rather than explicitly "
                    f"stated (only {len(overlap)}/{len(dx_words)} words found)"
                )

        # Check 4: Completeness — look for common patterns missed
        # Check for medication-like patterns in note not in extraction
        med_pattern = re.compile(
            r'\b(\w+)\s+(\d+\s*mg|\d+\s*mcg|\d+\s*units)', re.IGNORECASE
        )
        note_meds = set()
        for match in med_pattern.finditer(source_note):
            candidate = match.group(1).lower()
            if candidate in self.COMMON_MEDICATIONS:
                note_meds.add(candidate)

        extracted_meds = {
            med.get("name", "").lower()
            for med in extraction.get("medications", [])
        }
        missed = note_meds - extracted_meds
        if missed:
            warnings["completeness"].append(
                f"Possible missed medications: {', '.join(missed)}"
            )

        return warnings


# Run extraction and validation
validator = ExtractionValidator()

for i, note in enumerate(test_notes):
    print(f"\n{'='*60}")
    print(f"Note {i+1}")
    print(f"{'='*60}")

    extraction = simulate_llm_extraction(note, seed=i)
    warnings = validator.validate(extraction, note)

    print(f"Extracted {len(extraction.get('medications', []))} medications, "
          f"{len(extraction.get('lab_results', []))} lab results, "
          f"{len(extraction.get('diagnoses', []))} diagnoses")

    total_warnings = sum(len(w) for w in warnings.values())
    if total_warnings == 0:
        print("  PASS: No validation warnings")
    else:
        print(f"  WARNING: {total_warnings} issue(s) found:")
        for category, msgs in warnings.items():
            for msg in msgs:
                print(f"    [{category}] {msg}")
============================================================
Note 1
============================================================
Extracted 3 medications, 1 lab results, 1 diagnoses
  WARNING: 1 issue(s) found:
    [diagnosis_verification] Diagnosis 'community-acquired pneumonia' may be inferred rather than explicitly stated (only 0/2 words found)

============================================================
Note 2
============================================================
Extracted 1 medications, 0 lab results, 0 diagnoses
  PASS: No validation warnings

============================================================
Note 3
============================================================
Extracted 3 medications, 4 lab results, 3 diagnoses
  WARNING: 3 issue(s) found:
    [lab_value_verification] Lab value '18' for bicarbonate not found in source note — possible fabrication
    [diagnosis_verification] Diagnosis 'metabolic acidosis' may be inferred rather than explicitly stated (only 1/2 words found)
    [diagnosis_verification] Diagnosis 'hyperkalemia' may be inferred rather than explicitly stated (only 0/1 words found)

Quantifying Extraction Quality

The team evaluates the LLM extraction pipeline against 500 notes with expert annotations. Metrics are computed per field at the entity level (exact match for medications, partial match for symptoms).

def compute_extraction_metrics(
    predicted: List[str], ground_truth: List[str]
) -> Dict[str, float]:
    """Compute precision, recall, and F1 for extracted entities.

    Uses exact string matching (case-insensitive). In production,
    use fuzzy matching or clinical concept normalization (RxNorm
    for medications, SNOMED CT for diagnoses).

    Args:
        predicted: List of extracted entity strings.
        ground_truth: List of annotated entity strings.

    Returns:
        Dictionary with precision, recall, and F1.
    """
    pred_set = {s.lower().strip() for s in predicted}
    truth_set = {s.lower().strip() for s in ground_truth}

    if len(pred_set) == 0 and len(truth_set) == 0:
        return {"precision": 1.0, "recall": 1.0, "f1": 1.0}
    if len(pred_set) == 0:
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
    if len(truth_set) == 0:
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0}

    true_positives = len(pred_set & truth_set)
    precision = true_positives / len(pred_set)
    recall = true_positives / len(truth_set)
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    return {"precision": precision, "recall": recall, "f1": f1}


# Simulated aggregate results across 500 notes
field_metrics = {
    "Medications (name)":  {"precision": 0.94, "recall": 0.89, "f1": 0.91},
    "Medications (dose)":  {"precision": 0.88, "recall": 0.85, "f1": 0.86},
    "Lab results":         {"precision": 0.91, "recall": 0.87, "f1": 0.89},
    "Diagnoses":           {"precision": 0.82, "recall": 0.78, "f1": 0.80},
    "Symptoms":            {"precision": 0.86, "recall": 0.81, "f1": 0.83},
    "Vital signs":         {"precision": 0.96, "recall": 0.93, "f1": 0.94},
}

print(f"{'Field':<25} {'Precision':>10} {'Recall':>10} {'F1':>10}")
print("-" * 55)
for field_name, metrics in field_metrics.items():
    print(f"{field_name:<25} {metrics['precision']:>10.2f} {metrics['recall']:>10.2f} {metrics['f1']:>10.2f}")
Field                      Precision     Recall         F1
-------------------------------------------------------
Medications (name)              0.94       0.89       0.91
Medications (dose)              0.88       0.85       0.86
Lab results                     0.91       0.87       0.89
Diagnoses                       0.82       0.78       0.80
Symptoms                        0.86       0.81       0.83
Vital signs                     0.96       0.93       0.94

Lessons Learned

  1. Extraction is easier than inference, and the distinction matters. The LLM performs well at extracting information explicitly stated in the note (medication names, vital signs, lab values) but poorly at inferring diagnoses that require clinical reasoning. The validation pipeline's most important job is catching inferred-but-not-stated diagnoses — which constitute the most dangerous failure mode for downstream analysis.

  2. Hallucinated lab values are rare but catastrophic. In the 500-note evaluation, the LLM fabricated a lab value in 2.3% of notes. Each fabrication is a data corruption event that could bias the study's statistical conclusions. The string-matching validator catches most fabrications (if the value does not appear in the source note, it was fabricated), but edge cases remain: the model might report "glucose 130" when the note says "glucose 132" — a subtle rounding that passes the validator but corrupts the data.

  3. Complex dosing schedules are the hardest extraction target. "Azithromycin 500mg day 1, then 250mg days 2-5" requires the model to parse a multi-step dosing protocol into structured fields. The LLM frequently simplifies this to "500mg daily" or "250mg daily," losing the loading-dose structure. For the study's purposes, this information loss may be acceptable — but it must be documented.

  4. Cost analysis favors LLMs with human review, not LLMs alone. At \$0.01 per note (API cost for a 500-token note with GPT-4-class model), the 340,000-note extraction costs approximately \$3,400 — compared to \$6.8 million for manual extraction. But the LLM output requires human review. If reviewers check every flagged note (approximately 15% are flagged by the validator), the review cost is 51,000 notes at \$5 per review = \$255,000. Total cost: ~\$258,000 — a 96% reduction from fully manual extraction, with comparable accuracy on the fields that matter most for the study.

  5. Regulatory implications are nonnegotiable. For FDA submission-quality data, LLM-extracted data requires documented validation against a "gold standard" human annotation on a representative sample. The team's 500-note evaluation serves this purpose, but the sample must be stratified by note complexity, clinical site, and physician writing style. The validation report becomes a regulatory submission artifact.