18 min read

The default image of AI interaction is a chat interface: you type, the AI responds. This is still the most common mode, but it is increasingly the least interesting one.

In This Chapter

Beyond Text
1. What "Multimodal" Means
2. Image Inputs: Vision Prompting
3. Document Inputs: PDFs and Word Files
4. Spreadsheet and Data Inputs
5. Code Inputs
8. Platform Comparison for Modalities
9. Privacy Considerations with Multimodal Inputs
10. Common Multimodal Mistakes
11. Research Breakdown: Multimodal AI Capabilities
Content Blocks
Summary

Case Study 01 Case Study 02 Key Takeaways Exercises Quiz Further Reading

Chapter 12: Multimodal Prompting: Working with Images, Documents, and Data

Beyond Text

The default image of AI interaction is a chat interface: you type, the AI responds. This is still the most common mode, but it is increasingly the least interesting one.

Modern AI tools can accept images, PDFs, spreadsheets, code files, and increasingly audio and video. They can read a photograph of a whiteboard and extract the content. They can analyze a 300-page report and answer specific questions about it. They can look at a screenshot of a bug and describe what's wrong. They can compare two product designs side by side and give detailed design feedback.

This range of capabilities is not merely a convenience — it opens up entirely new workflows that text-only AI cannot support. A competitive analysis that used to require manually reviewing dozens of screenshots can be partially automated. A dense technical report that used to require hours of careful reading can be queried like a database. A financial dataset that used to require Excel expertise can be analyzed through natural language.

But these capabilities come with important nuances. Each input type has specific strengths, specific limitations, and specific prompting strategies that unlock its potential. What works for a text prompt does not automatically translate to a document prompt or an image prompt. This chapter covers each major modality with the specificity it requires.

1. What "Multimodal" Means

"Multimodal" refers to AI systems that accept multiple types of input — text, images, documents, audio, video — rather than text alone. A multimodal model can process information from different modalities and reason across them.

The progression of AI capabilities has been roughly: - Text-only (early LLMs): input and output are text - Text + image (vision models, circa 2023): input can include images; output is still text - Text + document (document understanding): input can include PDFs and structured documents - Text + data (data analysis capabilities): input can include spreadsheets and structured data, with code execution for analysis - Text + audio/video (emerging): input can include audio transcription and video frames

Not all platforms support all modalities. The platform comparison in this chapter will help you understand what's available where.

The Prompting Implication

When you add a non-text input to a prompt, you are giving the AI a new source of information to reason with. The prompting challenge is telling the AI what to do with that information — specifically and usefully.

A prompt with an image attached that says "what do you think?" is not a multimodal prompt in any useful sense. A prompt that says "analyze this screenshot of our competitor's pricing page for the specific elements they use to create urgency, and compare each element to our own pricing page as I've described it" — that is a multimodal prompt. The modality is doing useful work.

2. Image Inputs: Vision Prompting

What Vision Models Can Do

Modern vision-capable AI models (available in ChatGPT-4o, Claude 3.x, Gemini, and others) can:

Describe and analyze visual content: What is in the image, how is it laid out, what is its overall design style
Read text in images: OCR-like capability — can read printed text, handwritten text (with some accuracy), and text overlays in photos
Analyze diagrams and charts: Can interpret flowcharts, organizational charts, bar charts, line graphs (though quantitative precision varies)
Evaluate design quality: Can apply design principles to critique UI/UX, marketing materials, presentation slides
Identify objects and scenes: Can identify products, logos, people's roles (but not identities), settings, and contexts
Compare multiple images: Can identify differences, commonalities, and relative quality between uploaded images
Analyze screenshots: Can read interface states, error messages, layout issues, and UI content
Process whiteboard photos: Can read handwritten content and transcribe brainstorming sessions

What Vision Models Cannot Do Reliably

Understanding limitations prevents frustration and misplaced trust:

Counting precisely: Vision models frequently miscount items in images, especially when items overlap, are small, or number more than about 10.

Fine text in images: Very small text, text at odd angles, low-contrast text, or highly stylized fonts may be misread or missed entirely. Critical accuracy for financial figures or legal text in images requires human verification.

Spatial reasoning: Complex questions about relative positions ("is the blue box to the left of or above the red box?") or precise measurements are unreliable. Vision models understand approximate spatial relationships, not precise ones.

Facial recognition: Most public AI tools deliberately do not identify specific individuals from photographs. They can describe general characteristics but not name people.

Image manipulation history: Cannot determine if an image has been digitally altered, when it was created (beyond visible indicators), or what was in the original before editing.

Color differentiation: In some cases, distinguishing between very similar colors (particularly in low-quality images or at scale) can produce errors.

Prompt Structure for Image Analysis

Structure your image prompts in three parts:

Context — what the image is and why you're analyzing it
Specific questions or analysis tasks — exactly what you want to know or assess
Output format — how you want the analysis structured

Example structure:

[CONTEXT: what this image is and what you're trying to accomplish]

Please analyze this image for:
1. [Specific question or criterion 1]
2. [Specific question or criterion 2]
3. [Specific question or criterion 3]

Format your response as: [list / table / sections with headers / etc.]

Full Image Prompt Template

This is [DESCRIPTION OF IMAGE TYPE — e.g., "a screenshot of a competitor's
product page" or "a photograph of a whiteboard from our planning session"].

Context: [WHY I'M ANALYZING THIS — what I'm trying to understand or accomplish]

Please [ANALYSIS TASK]:
- [Specific thing 1 to identify/evaluate/extract]
- [Specific thing 2]
- [Specific thing 3]

[OPTIONAL: Compare this to [REFERENCE POINT — e.g., "our own design" or "industry standard practice"]]
[OPTIONAL: Ignore [ELEMENT TO EXCLUDE — e.g., "the company logo area" or "any stock photography"]]

Output format: [DESIRED FORMAT]

Use Cases and Examples

Diagram Analysis:

This is a network architecture diagram from our current system documentation.
Context: I'm assessing whether this architecture meets our new high-availability requirements.

Please analyze:
1. Single points of failure — identify any components where failure would take down the system
2. Redundancy — where is redundancy present, and where is it absent?
3. Scalability bottlenecks — which components would be constrained first under 10x load?

Format as a bulleted list under each heading.

Screenshot Troubleshooting:

This is a screenshot of an error message in our application. Context: user reported they
cannot complete checkout.

Please:
1. Read and transcribe the error message exactly as shown
2. Based on the error text, what are the most likely causes of this error?
3. What information should our support team gather from this user to diagnose it?

Note: read the error text carefully — the exact wording matters for diagnosis.

Design Feedback:

This is a mockup of our new product packaging design. Context: preparing for
a consumer focus group and want to anticipate key questions.

Analyze from the perspective of a first-time consumer encountering this product
in a grocery store:
1. What is the first element that catches the eye? Is that optimal for conversion?
2. Can the product purpose be understood in under 3 seconds from the front panel?
3. What questions might a consumer have after reading the visible text?
4. On a scale of 1-5, how professional does this look relative to CPG industry standards?
   Justify your rating with specific observations.

Alex's Scenario: Analyzing Competitor Ads

Alex needs to systematically review competitor advertising for her quarterly competitive brief. She has collected 12 screenshots of competitor digital ads — banner ads, social posts, and display ads. Rather than manually cataloging each one, she uploads them to Claude (which she uses for vision tasks due to its strong image analysis capabilities) with this prompt:

These are screenshots of [COMPETITOR NAME]'s recent digital advertising.
Context: I'm compiling our Q3 competitive brief and need to understand their
current campaign themes and messaging strategy.

For each ad in this batch, please provide:
1. Primary message/claim (what is the ad saying?)
2. Target audience signal (who is this ad aimed at, based on imagery, copy, and context)
3. Call to action (what is the viewer being asked to do?)
4. Emotional appeal (what feeling is the ad trying to create?)
5. Any offers or promotions visible

After analyzing all ads, provide:
- The 2-3 dominant themes across this ad set
- Any notable differences from their previous messaging (I'll add this context)
- How their current positioning compares to ours: [2-3 sentences on our positioning]

Format: Individual ad analyses in a table, followed by the synthesis section.

The output gives her a structured catalog of 12 ads in about 3 minutes of AI processing, compared to 45 minutes of manual review and note-taking. Her role: review for accuracy, add the competitive context that AI cannot provide, and draw the strategic implications.

3. Document Inputs: PDFs and Word Files

Uploading vs. Pasting: Trade-offs

When your source is a text document, you have two main options: upload the file or paste the text.

Factor	Uploading the File	Pasting the Text
Convenience	High — drag and drop	Low — requires extraction and formatting
Format preservation	High — maintains tables, headers, layout	Medium — formatting often breaks in paste
Accuracy	Medium — OCR errors possible in scanned PDFs	High — you control exactly what text is included
Privacy	Lower — full document sent to provider	Higher — you control what text is included
Context efficiency	Lower — entire document included	Higher — can include only relevant sections
Long document handling	Platform-dependent	Can be chunked and managed

Best practices: - Upload PDFs for most use cases — the convenience and format preservation outweigh the minor accuracy risk for typical documents - Paste text for scanned documents (PDF OCR can be unreliable for complex layouts), for sensitive content where you want to control exactly what is sent, or when you only need a specific section - For very long documents (over 50-100 pages), see the long document strategies section below

Prompt Structure for Document Analysis

I'm sharing [DOCUMENT DESCRIPTION — type, purpose, source].

[CONTEXT: why I'm analyzing this, what I'll do with the output]

Please [SPECIFIC TASK]:
[Task details in numbered or bulleted list]

[OPTIONAL: Focus on [SECTION OR ASPECT] and skip [IRRELEVANT SECTIONS]]
[OPTIONAL: For this task, assume I am [YOUR ROLE/CONTEXT]]

Output format: [DESIRED FORMAT]

Multi-Document Comparison

One of the most powerful document prompting capabilities is comparing multiple documents. This is especially useful for: - Comparing proposals or bids on the same criteria - Finding contradictions between policy documents - Comparing versions of a document (before/after review) - Synthesizing multiple research sources on the same topic

Multi-document comparison template:

I'm sharing [NUMBER] documents: [brief description of each].

Context: [WHY I'm comparing these]

Please compare them on these criteria:
1. [CRITERION 1]
2. [CRITERION 2]
3. [CRITERION 3]

Format: A table with documents as columns and criteria as rows.
After the table, provide: [2-3 key synthesis observations / a recommendation /
any significant contradictions between documents]

Note: If a criterion is not addressed in a document, write "Not addressed" — do
not infer or assume.

Extraction Tasks with Documents

Documents are an excellent source for structured extraction using the Extractor pattern from Chapter 11. The key rules apply: specify exactly what to extract, what format to use, and what to write when information is absent.

Document extraction template:

Extract the following information from this [DOCUMENT TYPE]:

For each [ITEM TYPE — e.g., "contract clause" or "recommendation"], provide:
- [FIELD 1]: [description]
- [FIELD 2]: [description]
- [FIELD 3]: [description]

Rules:
- Only extract what is explicitly stated in the document
- If a field is not present, write "Not in document"
- Page number reference for each extracted item (if visible)

[DOCUMENT]

Long Document Strategies

Documents over 50-100 pages require specific strategies because: - They may exceed context window limits - Even within context limits, attention quality decreases across very long documents - Long documents contain much irrelevant content for any specific question

Strategy 1: Question-and-Answer Mode

For specific questions, formulate targeted Q&A prompts rather than asking for a general summary:

I'm sharing a [PAGE COUNT]-page [DOCUMENT TYPE]. I have specific questions about it.

Please answer ONLY from the document's content — if information is not in the document,
say so. Do not supplement with outside knowledge.

Questions:
1. [SPECIFIC QUESTION — cite the section if possible]
2. [SPECIFIC QUESTION]
3. [SPECIFIC QUESTION]

Strategy 2: Section-by-Section Summarization with Index Building

Process the document in sections, building a master index:

This is Section [X] of a [PAGE COUNT]-page report. I'm processing it section
by section to build a reference index.

For this section, please provide:
- Section title and page range
- Key findings or claims (bulleted, under 3 sentences each)
- Important data points, statistics, or figures
- Questions raised or topics that appear unresolved

I'll compile these into a master index and then ask targeted questions.

Strategy 3: Provide the Table of Contents First

For navigating a long document, provide the table of contents first and ask for guidance:

Here is the table of contents for a [PAGE COUNT]-page report on [TOPIC]:
[TOC]

I need to [GOAL — e.g., "find all recommendations that affect our IT infrastructure"
or "answer the question of whether our current approach is supported by this research"].

Based on the table of contents, which sections should I prioritize sharing with you,
and in what order, to answer this question most efficiently?

Elena's Scenario: The 200-Page Report

Elena receives a 200-page regulatory impact assessment from a government agency. Her client needs to understand which specific regulations will affect their operations and what changes they will need to make. Elena has 4 hours before the client call.

Her workflow:

Step 1 (5 minutes): Upload the table of contents and ask:

Here is the TOC of a 200-page regulatory impact assessment.
My client is a [INDUSTRY] company with [SPECIFIC OPERATIONAL PROFILE].
Which sections are most likely to contain regulations affecting their operations?
Rank them by likely relevance to my client's context.

Step 2 (15 minutes): Upload the 4 high-priority sections (approximately 60 pages total). Ask:

These are the high-priority sections of the regulatory impact assessment.
My client context: [detailed description].

Extract all regulations that:
1. Apply to companies in [SPECIFIC INDUSTRY] with [CHARACTERISTICS]
2. Require operational or process changes
3. Have compliance deadlines in the next 24 months

For each regulation: [regulatory name / requirement / deadline / operational implication]

Step 3 (10 minutes): Ask follow-up questions on the extracted regulations:

Based on the regulations you extracted, which 3-5 represent the most significant
operational changes for a company with this profile? Explain your ranking.
What are the dependencies between these changes?

Step 4 (30 minutes): Elena's domain expertise reviews the output, adds client-specific context, identifies the two regulations where she wants to check the document text directly, and builds the client briefing from the AI-structured foundation.

Total time from 200-page document to client-ready briefing: approximately 3 hours — versus the 8-10 hours this would have taken without AI document analysis.

4. Spreadsheet and Data Inputs

Pasting CSV Data

The simplest approach to data input is pasting CSV (comma-separated values) data directly into the prompt. This works for: - Small to medium datasets (up to a few hundred rows before readability degrades) - Quick analysis requests where you want to describe findings in natural language - Tasks that don't require statistical computation (AI is not reliably accurate at arithmetic on large datasets)

CSV paste template:

Here is data from [SOURCE/CONTEXT] in CSV format:

[PASTE CSV DATA]

Please [ANALYSIS REQUEST]:
- [Specific question 1]
- [Specific question 2]
- [Specific question 3]

Note: [Any relevant context about the data — what the columns mean,
what the data represents, any known issues]

Important limitation: AI language models are not reliable calculators. For anything requiring precise arithmetic on more than a few rows, use ChatGPT Advanced Data Analysis (which executes real code) or Excel/Python rather than asking an LLM to compute from pasted data.

ChatGPT Advanced Data Analysis

ChatGPT's Advanced Data Analysis (formerly Code Interpreter) capability allows you to upload spreadsheet files and ask for analysis that is actually computed via Python execution, not estimated by language model inference. This is a qualitatively different capability from standard text analysis.

With Advanced Data Analysis you can: - Upload Excel (.xlsx), CSV, or other structured files - Ask for statistical summaries, regression analysis, correlation matrices - Request visualizations (histograms, scatter plots, time series, etc.) - Perform data cleaning and transformation - Ask follow-up questions with code being re-executed to answer each

Advanced Data Analysis template:

I'm uploading a [FILE TYPE] containing [DESCRIPTION OF DATA — what it represents,
time period, key variables].

[ANALYSIS CONTEXT: what business question this data should answer]

Please:
1. First, describe the data — dimensions, data types, any obvious quality issues
2. Then analyze: [SPECIFIC ANALYSIS REQUEST]
3. Create a visualization showing: [WHAT YOU WANT TO SEE]

For all statistical claims, show the calculation method so I can verify.

What to Ask For With Data

Effective data prompts specify the type of analysis, not just the question:

Instead of...	Ask for...
"What's interesting about this data?"	"Identify the top 3 outliers by [metric] and describe what makes each notable"
"Analyze this data"	"Calculate the month-over-month growth rate for each category and identify which categories are growing fastest and slowest"
"What does this mean?"	"Based on this conversion funnel data, identify the step with the highest drop-off rate and hypothesize 3 causes"

5. Code Inputs

When to Use Code as Input

Code is a particularly powerful input type because it is already structured and precisely defined. The most productive code-as-input use cases are:

Code review: Paste code and ask for review with specific criteria (security, performance, readability, test coverage). Use the Critic pattern from Chapter 11 with a role appropriate to your language and context.

Code explanation: Ask the model to explain what code does, at whatever level of detail is appropriate (line-by-line, function-level, architectural).

Debugging: Paste code plus error output or symptom description. Use the CoT debugging approach from Chapter 10 Case Study 1.

Refactoring: Paste code and specify the target state ("refactor this to use dependency injection" or "convert this from callback style to async/await").

Documentation generation: Paste code and ask for docstrings, comments, README sections, or API documentation.

Structuring Code Prompts

[LANGUAGE/CONTEXT: language, framework, relevant versions]
[TASK: what you want done]

Specific requirements:
- [REQUIREMENT 1]
- [REQUIREMENT 2]

[OPTIONAL: What the code is supposed to do (if not obvious from the code)]
[OPTIONAL: What error or behavior is occurring]

```[LANGUAGE]
[PASTE CODE]


**Critical rule for code debugging:** Always paste the actual error output, not a description of it. The error message, stack trace, and line numbers are information the model needs. "I'm getting an error" is much less useful than pasting the actual error.

---

## 6. Audio and Video (Where Available)

### Current State

As of early 2026, audio and video input capabilities are available in some AI tools but not yet standard:

**Audio transcription + analysis:** Several tools (including ChatGPT's Voice mode and dedicated transcription tools like Whisper) can transcribe audio. Once transcribed, the text can be analyzed with standard text prompting techniques.

**Video:** Gemini 1.5 Pro and some other models accept video input and can analyze frames, transcribe audio, and describe visual content. This capability is powerful but expensive in tokens and compute.

**The practical implication for most users:** The most reliable current workflow for audio/video is:
1. Transcribe audio using a dedicated transcription tool (Whisper, Otter.ai, Descript, or native platform features)
2. Use the resulting transcript as a document input with standard document prompting techniques

This workflow is more reliable, more auditable, and allows you to review the transcript for errors before analysis.

### Transcription + Analysis Workflow

I'm sharing a transcript from [CONTEXT — meeting, podcast, interview, etc.].

Note on transcript quality: [Any known issues — background noise, multiple speakers, technical jargon that may have been misheard]

Please [ANALYSIS TASK]: [Specific request]

For any quote you include in your response, please include the approximate time code (if available) or a brief surrounding context marker so I can verify against the original.


---

## 7. Mixed-Modal Prompts

The most powerful multimodal prompts combine multiple input types:

**Text + Image:**

[Image attached: screenshot of our website's checkout flow] [Text: description of the checkout conversion problem]

Using both the screenshot and the context I've described, identify the 3-5 elements in the checkout flow that are most likely contributing to our conversion drop-off.


**Document + Targeted Question:**

[PDF attached: competitive landscape report]

Based on this report, answer these specific questions relevant to our strategy: 1. [Question grounded in the document's content] 2. [Question grounded in the document's content] 3. [Question that requires synthesis across multiple sections]


**Multiple Documents + Synthesis:**

[PDF 1: our current product roadmap] [PDF 2: competitor product announcement]

Identify any areas where the competitor announcement reveals a product direction we should reconsider on our roadmap. Be specific — cite both the competitor's announcement and the relevant section of our roadmap. ```

8. Platform Comparison for Modalities

Capability	ChatGPT (GPT-4o)	Claude 3.5/3.7	Gemini Ultra	Notes
Image analysis	Excellent	Excellent	Excellent	All three are comparable for typical use cases
PDF upload	Good	Excellent	Good	Claude tends to be strongest for long PDF analysis
Code analysis	Excellent	Excellent	Good	ChatGPT and Claude are strongest for code
Advanced data analysis	Excellent (with tool)	Limited	Good	ChatGPT's Code Interpreter is the leader
Long document context	Good (128K)	Excellent (200K)	Excellent (1M)	Gemini has the largest context window
Video input	Limited	Not available	Good	Gemini leads; all are early-stage
Audio input	Good (Voice mode)	Not standard	Good	Improving rapidly across all platforms

Note: These capabilities change frequently. Check current platform documentation before designing workflows around specific capabilities.

9. Privacy Considerations with Multimodal Inputs

Multimodal inputs raise privacy stakes that text-only prompting does not:

Documents: Uploading a PDF sends the entire document — including information you may not intend to share. A contract, a financial report, or a patient record uploaded as a document for analysis exposes the full document to the AI provider's data practices.

Images: Photos can contain incidental sensitive information — faces, location metadata, screen contents, whiteboards with confidential content. Always review an image before uploading with the question: "what is in this image that I don't intend to share?"

Code: Pasting code for review may include API keys, credentials, database schemas, or proprietary algorithms. Before pasting code, remove any hardcoded credentials and review whether the code itself is sensitive intellectual property.

The enterprise question: Many enterprise AI platforms offer contractual data privacy guarantees that consumer tiers do not. If you're working with sensitive client data, proprietary business information, or anything that falls under regulatory data handling requirements (HIPAA, GDPR, etc.), confirm your organization's policies on AI tool use before uploading documents.

Best practices: - Redact sensitive fields before uploading documents where analysis of those fields isn't needed - Be aware that uploaded images include any text visible in them — review the full image, not just the area you're focusing on - Use enterprise-grade platforms with appropriate data handling agreements for sensitive work - For highly sensitive documents, consider pasting only the relevant sections rather than uploading the full document

10. Common Multimodal Mistakes

Mistake 1: Sending an image without context. "What do you think of this?" tells the AI nothing about what you need. Provide: what the image is, why you're analyzing it, and specifically what you want assessed.

Mistake 2: Trusting OCR output without verification. AI reading text in images makes errors, especially for small text, stylized fonts, or low-contrast content. Any critical number, name, or claim read from an image should be verified against the original document or image.

Mistake 3: Uploading an entire large document when you need one section. Uploading a 200-page document to answer a question that appears in a 5-page section wastes context window and often produces worse results than uploading the relevant section with focused context.

Mistake 4: Using language-model reasoning for precise data computation. Pasting CSV data and asking "what is the sum of column X?" is unreliable. Use dedicated data analysis tools (Python, Excel, or ChatGPT's Code Interpreter) for computation. Use AI for interpretation, pattern identification, and synthesis.

Mistake 5: Uploading sensitive documents to consumer AI tools. Consumer-tier AI platforms typically include uploaded content in their data practices. Know your platform's privacy terms before uploading anything confidential.

Mistake 6: Not iterating on multimodal prompts. First attempts at multimodal analysis often need refinement. If the initial analysis misses what you needed, clarify: "That analysis was helpful, but I specifically need to understand [X]. Looking at the document again, can you focus on..."

11. Research Breakdown: Multimodal AI Capabilities

Vision model performance: Modern vision language models (GPT-4V, Claude 3, Gemini Pro Vision) achieve near-human performance on many standard visual understanding benchmarks, including document understanding, chart analysis, and natural scene recognition. Performance on tasks requiring precise counting, spatial reasoning, and fine-detail text recognition remains notably weaker than human performance.

Document understanding: Research on long-document understanding models shows that performance degrades for information in the middle of very long documents (the "lost in the middle" finding, Liu et al., 2023). Information at the beginning and end of documents is processed more reliably than information in the middle sections of very long inputs. This supports the section-by-section strategy for long documents rather than uploading them whole.

Cross-modal reasoning: A key research finding is that multimodal models perform best when the text prompt and the non-text input are clearly aligned — when the text explains what the image/document is and what analysis is needed. Open-ended multimodal prompts ("analyze this") consistently underperform specific, structured multimodal prompts.

Content Blocks

💡 Intuition: The Modality Is Just Another Information Source Thinking about multimodal prompting becomes simpler when you remember that the image, document, or dataset is just a new source of information to reason with. Your job in the prompt is exactly what it is for text-only prompting: tell the AI what the information is, why you need to analyze it, and what specific output you want. The modality changes what information is available; the prompting principles stay the same.

⚠️ Common Pitfall: The "Look at This" Prompt The single most common multimodal mistake is attaching an image or document and writing "analyze this" or "what do you think?" without context. Without knowing what the image is, who needs the analysis, and what specific questions to answer, the AI produces a generic description that is almost never what you needed. Always add the three elements: what it is, why you're analyzing it, and specifically what you want.

✅ Best Practice: Verify Critical Information from Images AI-read text in images is accurate most of the time — but not all the time. For any number, name, date, or critical claim read from an image, verify against the original source. This is especially important for financial data, legal text, or any content where a single character error could be consequential. The more stylized or small the text, the higher the error rate.

🎭 Scenario Walkthrough: Elena's Three-Hour Report Sprint Elena's workflow for a 200-page regulatory report: (1) Upload TOC and ask which sections to prioritize (5 min), (2) Upload priority sections and extract structured regulatory requirements (15 min), (3) Ask follow-up questions for prioritization and dependencies (10 min), (4) Apply her expertise to the AI-structured output and build the client briefing (30 min). Total: approximately 3 hours for a task that would typically take 8-10. The AI handles the structured extraction; she handles the strategic interpretation.

📋 Action Checklist: Multimodal Prompting Readiness - [ ] Know which platforms you use that support image inputs - [ ] Know which platforms support PDF upload and their context limits - [ ] For data work: confirm whether you have access to ChatGPT Advanced Data Analysis or equivalent - [ ] Review your organization's data privacy policy for AI document uploads - [ ] Build your first image analysis prompt using the three-part template (what, why, what specifically) - [ ] Test your most common document type with a document analysis prompt

🗣️ Script/Template: All-Purpose Multimodal Prompt ``` I'm sharing [DESCRIPTION OF INPUT: what it is, where it's from].

Context: [WHY I'm analyzing this — what decision or task this supports]

Please analyze for: 1. [SPECIFIC QUESTION OR CRITERIA 1] 2. [SPECIFIC QUESTION OR CRITERIA 2] 3. [SPECIFIC QUESTION OR CRITERIA 3]

[OPTIONAL: Focus especially on [PRIORITY AREA]] [OPTIONAL: Do not speculate beyond what is visible/stated in the input]

Format: [DESIRED OUTPUT FORMAT] ```

⚠️ Common Pitfall: Trusting AI Data Computation Language models are not reliable calculators. When you paste tabular data and ask for sums, averages, or statistical analysis, you are relying on the model to reason about numbers — not execute computation. The results may look correct but be wrong by small amounts, especially for larger datasets. Use ChatGPT Advanced Data Analysis or dedicated tools for any computation that must be accurate. Use LLMs for interpretation, pattern identification, and synthesis of data, not for the computations themselves.

Summary

Multimodal prompting expands what AI tools can do for you — from text analysis to visual analysis, document processing, data interpretation, and code review. The capabilities are substantial and practical, but they work best when your prompts are specific about what the input is, why you're analyzing it, and what you want to know.

The core principles remain the same as text-only prompting: context, specificity, and clear output specification. What changes is the type of input that provides the context, and the capabilities and limitations that apply to each input type.

Chapter 13, the final chapter of Part 2, closes the loop: when any type of AI output goes wrong — text, multimodal, or anything else — here is how to diagnose why and fix it systematically.