Chapter 12 Quiz: Multimodal Prompting


Question 1

What are the three essential elements of an effective image analysis prompt?

A) Image quality, lighting description, and resolution specification B) Context (what the image is and why you're analyzing it), specific questions or analysis tasks, and output format C) The AI model version, the image file format, and the analysis deadline D) A description of the image, a comparison to a reference standard, and a requested improvement

Show Answer **B** — The three-part structure for image prompts is: (1) Context — what the image is and why you're analyzing it, (2) Specific questions or analysis tasks — exactly what you want to know or assess about it, and (3) Output format — how you want the analysis structured. Without context, the AI has no frame for what matters. Without specific questions, it produces generic descriptions. Without format specification, the output may not be usable.

Question 2

A vision model is asked to count the number of items in a busy market photograph that contains approximately 40 individual products on a shelf. What is the most likely outcome?

A) The model will count accurately because vision models are designed for precise counting B) The model will refuse to count items in photographs C) The model will produce an inaccurate count — reliable counting beyond about 10 overlapping or similar items is a known limitation of vision models D) The model will count accurately but will be very slow

Show Answer **C** — Precise counting of more than about 10 similar or overlapping items is a known limitation of current vision models. The model may produce a count that seems reasonable (e.g., "approximately 35-40") but cannot be relied upon for exact counts in complex visual scenes. This is one of the specific capability limitations covered in the chapter. For precise inventory counting, dedicated computer vision tools designed for that task are more appropriate.

Question 3

When should you paste text from a document rather than uploading the document file?

A) Always — pasting is always more reliable than uploading B) Never — uploading preserves formatting that pasting destroys C) When the document is scanned (OCR errors risk), when you need to control exactly what text is sent (privacy), or when you only need a specific section D) Only when the document is longer than 50 pages

Show Answer **C** — Pasting text is preferable to uploading when: the document is scanned and OCR may introduce errors, you want to control exactly what information is sent to the AI provider (privacy consideration), or you only need a specific section of a longer document (context efficiency). For most standard digital documents where privacy is not a concern, uploading is more convenient and preserves formatting.

Question 4

What is the "lost in the middle" finding, and why does it affect long document prompting strategy?

A) AI models lose their formatting standards halfway through long responses B) Research shows that information in the middle of very long documents is processed less reliably than information at the beginning and end — supporting section-by-section strategies over whole-document upload C) AI models often stop reading long documents before reaching the end D) Long documents cause AI models to forget the user's original question

Show Answer **B** — The "lost in the middle" finding (Liu et al., 2023) shows that attention quality degrades for information in the middle sections of very long documents, even when the document fits within the context window. This directly supports the section-by-section processing strategy for long documents: rather than uploading a 200-page document and asking a general question, it's more reliable to upload the relevant sections and ask targeted questions about them.

Question 5

Elena needs to extract regulatory requirements from a 200-page government report for a client briefing in 4 hours. What is the recommended approach?

A) Upload the full 200-page document and ask for a comprehensive summary B) Upload the table of contents first, identify priority sections, upload those sections for structured extraction, then apply expert judgment to the AI-organized output C) Read the document manually and use AI only for formatting the final output D) Ask the AI to generate what a typical regulatory report would contain, without uploading the document

Show Answer **B** — Elena's scenario in the chapter illustrates the recommended approach: (1) upload the TOC and ask which sections are most relevant to the client's situation, (2) upload the priority sections and run a structured extraction, (3) ask follow-up questions on the extracted content, and (4) apply expert judgment to the AI-structured output. This targeted approach is both faster and more accurate than uploading the full document.

Question 6

Why should you NOT rely on an AI language model for precise data computation when working with pasted CSV data?

A) Language models cannot read CSV format B) Language models are unreliable calculators — they reason about numbers rather than executing computation, and can produce plausible-looking but incorrect arithmetic, especially on larger datasets C) CSV data contains formatting characters that confuse language models D) Language models always round numbers, which introduces inaccuracy

Show Answer **B** — Language models reason probabilistically about token sequences — they do not execute arithmetic operations. When asked to compute sums, averages, or statistical measures from pasted data, they may produce plausible-looking numbers that are actually wrong. The solution is to use ChatGPT Advanced Data Analysis (which executes actual Python code) or dedicated tools like Excel for computation, and use language models for interpretation, pattern identification, and synthesis.

Question 7

What makes ChatGPT Advanced Data Analysis qualitatively different from standard language model data analysis?

A) It has access to a larger training dataset with more numerical examples B) It actually executes Python code to perform computations, producing mathematically accurate results rather than probabilistic estimates C) It can connect to external databases in real time D) It uses a specialized numerical reasoning model rather than a language model

Show Answer **B** — Advanced Data Analysis (formerly Code Interpreter) is different because it actually runs Python code to answer your data questions. When you ask "what is the average of column X?", it doesn't reason about what the average might be — it computes it via executed Python. This makes the results mathematically accurate rather than probabilistically estimated. It can also create visualizations, perform statistical tests, and handle data transformations that pure language models cannot.

Question 8

Alex needs to analyze 12 competitor ad screenshots for a quarterly competitive brief. What is the most effective approach?

A) Describe each ad in text and ask AI to analyze the descriptions B) Upload all 12 images with a structured prompt that specifies what to extract from each ad (primary message, target audience, CTA, emotional appeal) and requests a synthesis section at the end C) Have a human analyst review the ads, then have AI format the findings D) Use image recognition software to tag objects in each ad, then analyze the tags

Show Answer **B** — The chapter's scenario demonstrates Alex's approach: upload the images with a structured prompt that specifies exactly what to extract from each ad (using the multi-image analysis template), and request both individual ad analyses and a synthesis of themes across the full set. This structured approach ensures consistency across all 12 ads and enables the synthesis that gives it competitive intelligence value.

Question 9

When working with code as input for debugging, what is the most important information to include alongside the code?

A) The programming language version history B) The names of all libraries and frameworks being used C) The actual error output, stack trace, and line numbers — not a description of what the error is D) Comments explaining what each function is supposed to do

Show Answer **C** — The actual error output, including the exact error message, the full stack trace, and line numbers, is the most important supplementary information for code debugging prompts. "I'm getting an error" is dramatically less useful than pasting the complete error output. The error text contains information the model needs to narrow down the cause — error type, the exact line, the call chain that led there. Describing the error in natural language loses this precision.

Question 10

What is the primary privacy risk of uploading a document to a consumer AI platform?

A) The document may be indexed by search engines B) The AI may use factual claims from the document in future responses to other users C) The entire document — including information you did not intend to share — is sent to and stored by the AI provider under their data practices, which consumer tiers typically do not contractually protect D) Documents can only be read once before they are deleted from the system

Show Answer **C** — When you upload a document to a consumer AI platform, the entire document is sent to the provider and is subject to their data handling practices. Consumer-tier subscriptions typically do not include contractual data privacy guarantees that enterprise tiers provide. For sensitive documents (client data, financial information, proprietary research, PII), this is a significant risk. The recommended practices: redact sensitive fields before uploading, paste only relevant sections, or use an enterprise platform with appropriate data agreements.

Question 11

What does "cross-modal reasoning" mean in the context of multimodal AI, and why does prompt specificity matter for it?

A) The AI using information from one language to answer in another language B) The AI integrating information from multiple input types (e.g., an image AND the text context) to produce a response that neither input alone could support C) The AI switching between different reasoning modes (creative vs. analytical) D) The ability to convert images to text and text to images

Show Answer **B** — Cross-modal reasoning is when the AI integrates information from different input types — for example, using both a screenshot (visual information) and your text description of a conversion goal (contextual information) to produce analysis that requires both. Research shows that open-ended multimodal prompts ("analyze this") consistently underperform specific, structured prompts because specific prompts more effectively direct the model to integrate across the modalities. The more clearly you tell the AI which aspects of each input matter for your question, the better the cross-modal reasoning.

Question 12

Why is the platform comparison for multimodal capabilities important for professionals building AI workflows?

A) All platforms have identical multimodal capabilities — the comparison is mainly for marketing purposes B) Different platforms have different capability strengths (e.g., Gemini has the largest context window, ChatGPT leads on data analysis with code execution, Claude tends to be strongest for long PDF analysis), and designing workflows on one platform's capability may not transfer to another C) Platform comparison is mainly relevant to enterprise buyers, not individual users D) The capabilities are changing so fast that any comparison is immediately obsolete

Show Answer **B** — The platform comparison matters because capabilities genuinely differ. A workflow designed around ChatGPT's Advanced Data Analysis code execution doesn't transfer to Claude (which lacks this feature as of early 2026). A workflow designed around Claude's 200K token context window won't work on a platform with a 16K limit. Professionals building AI workflows need to understand which platform they're designing for and whether their tool dependencies are platform-specific.

Question 13

What is the recommended workflow for analyzing audio or video content with current AI tools, and why?

A) Upload video files directly to vision models, which can analyze both audio and visual content simultaneously B) Transcribe audio using a dedicated transcription tool first, then analyze the resulting text using standard document prompting techniques — this is more reliable, auditable, and allows transcript error review before analysis C) Ask the AI to imagine what was said in the meeting based on a description of the participants and topic D) Wait until audio/video AI capabilities mature further before attempting this type of analysis

Show Answer **B** — The recommended workflow is: transcribe first using a dedicated transcription tool (Whisper, Otter.ai, etc.), then analyze the transcript using standard document prompting techniques. This approach is more reliable than direct audio/video processing because: (1) transcription quality can be reviewed and corrected before analysis, (2) transcript-based analysis is more accurate than audio/video frame analysis for most content types, and (3) the text transcript can be stored and re-analyzed without needing to reprocess the audio.

Question 14

A prompt that says "Here are two vendor proposals. Tell me which one is better." is likely to produce a poor multimodal document analysis. What should be changed?

A) Add more exclamation points to emphasize urgency B) Specify the comparison criteria, define what "better" means for your specific context and decision, ask for a structured comparison format, and request that the AI note any significant risks or uncertainties C) Remove one of the two documents so the AI only needs to analyze one at a time D) Ask for the comparison in a different language to improve model performance

Show Answer **B** — "Which is better" is unspecified in the way that "analyze this" is unspecified for image prompts. "Better" for which criteria? For which audience? In what context? The improvements: specify evaluation criteria (price, implementation timeline, vendor experience, etc.), define what matters most for your decision, request a structured comparison format (table with criteria as rows), and ask for uncertainty flags (where is the comparison difficult to make from the documents alone?). Specificity transforms a vague question into a decision-support analysis.

Question 15

What is the key difference between asking AI to "interpret this data" vs. asking AI to "compute the average of this column"?

A) There is no meaningful difference — AI handles both types of requests equally well B) "Interpret this data" asks for reasoning about patterns and meaning (which AI language models do well), while "compute the average" asks for arithmetic execution (which language models do unreliably and should be done with code-execution tools or spreadsheets) C) "Interpret" requests are slower because they require more processing D) "Compute" requests are more expensive because they use more tokens

Show Answer **B** — This distinction is one of the chapter's most important practical guidelines. Language models are strong at interpretation: identifying patterns, generating hypotheses, explaining what data might mean, comparing trends. They are unreliable for computation: arithmetic on multiple rows, statistical calculations, aggregations. The solution is to use each tool for what it does well — language models for interpretation and synthesis, code-execution tools (or spreadsheets) for computation.