Chapter 12 Further Reading: Multimodal Prompting

DataField.Dev

Affiliate disclosure

Book titles on this page link to Amazon. As an Amazon Associate, DataField.Dev earns from qualifying purchases — at no additional cost to you.

Chapter 12 Further Reading: Multimodal Prompting

Research Papers

"Lost in the Middle: How Language Models Use Long Contexts" Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Hopfner, P., Hashimoto, T., & Liang, P. (2023) Stanford University

The foundational paper on long-context document processing limitations. Demonstrates that language model performance degrades for information positioned in the middle of long documents, even when the entire document fits within the context window. This finding directly motivates the chapter's section-by-section strategies for long documents. Essential reading for anyone building document analysis workflows.

Available: arxiv.org/abs/2307.03172

"GPT-4V(ision) System Card" OpenAI (2023)

OpenAI's technical documentation for GPT-4V, including detailed capability descriptions and limitations for vision tasks. The limitations section is particularly valuable: it explicitly documents known weaknesses in counting, spatial reasoning, text reading, and facial recognition. Read this alongside the chapter's capability/limitation framework for a more complete picture.

Available: openai.com/research/gpt-4v-system-card

"Claude's Vision Capabilities" Anthropic Documentation

Anthropic's technical documentation for Claude's vision capabilities, including the specific types of images and analysis it supports. Useful for understanding the differences between platform vision implementations.

Available: docs.anthropic.com/en/docs/build-with-claude/vision

"Gemini: A Family of Highly Capable Multimodal Models" Google DeepMind (2023)

The technical paper introducing Gemini's multimodal architecture. Provides both capability descriptions and quantitative benchmark results. Particularly relevant for the long-context section: Gemini's 1M token context window represents a different approach to the long-document problem than section-by-section chunking.

Available: arxiv.org/abs/2312.11805

"Document AI: Benchmarks, Models, and Applications" Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., ... & Liu, Y. (2022) Microsoft Research

A survey of document understanding AI capabilities and benchmarks. Provides the technical foundation for understanding how PDF and document analysis models work, including the specific challenges of complex layouts, tables, and mixed-format documents. More technical than the chapter requires but useful for practitioners who want to understand capabilities and limitations at a deeper level.

Available: arxiv.org/abs/2111.15664

Tools and Documentation

Whisper (OpenAI) https://openai.com/research/whisper

OpenAI's open-source speech recognition model. The recommended tool for transcription workflows before AI analysis. Can be used via API, through local deployment, or through interfaces that have integrated it. Particularly strong for English transcription with various accents and background noise conditions.

Otter.ai https://otter.ai

A commercial transcription service with real-time and recorded audio transcription, speaker identification, and integration with video conferencing platforms (Zoom, Teams, Meet). Good choice for professionals who need transcription as a regular workflow, not just an occasional use case.

Adobe Acrobat AI Assistant https://www.adobe.com/acrobat/ai-assistant.html

Adobe's AI-powered PDF analysis tool, integrated with Acrobat. For professionals who work heavily with PDFs, this provides document analysis capabilities with the privacy and security of Adobe's enterprise products. An alternative to uploading PDFs to general-purpose AI chat tools.

Microsoft 365 Copilot https://www.microsoft.com/en-us/microsoft-365/microsoft-copilot

For organizations using Microsoft 365, Copilot integrates AI capabilities directly into Word, Excel, PowerPoint, and Teams. Particularly relevant for document analysis within the enterprise suite where the privacy and security model is determined by the organization's Microsoft contracts.

Google Gemini in Workspace https://workspace.google.com/products/gemini

Google's AI integration within Google Workspace products (Docs, Sheets, Gmail, Drive). For organizations using Google Workspace, this provides document analysis within the existing workflow with Google's enterprise privacy terms.

Descript https://www.descript.com

A multimedia editing tool with strong transcription and audio/video analysis capabilities. Particularly useful for podcasters, content creators, and anyone who regularly works with recorded meetings and needs word-level accurate transcripts with speaker labels.

Platform-Specific Guides

ChatGPT Advanced Data Analysis Guide https://help.openai.com/en/articles/8437071-advanced-data-analysis-code-interpreter

OpenAI's official documentation for Advanced Data Analysis. Covers how to upload files, what file types are supported, and how to structure data analysis requests. Includes examples for common use cases including statistical analysis and visualization.

Claude's File Upload Documentation https://docs.anthropic.com/en/docs/build-with-claude/files

Anthropic's documentation for Claude's file handling capabilities — what file types are supported, size limits, and how file uploads work within the API and consumer interface.

Books and Practical Guides

"Data Analysis with ChatGPT" Various authors, O'Reilly Learning platform (2024)

Practical tutorials on using ChatGPT's code execution capabilities for data analysis tasks. Covers everything from basic summary statistics to regression analysis and visualization. Best for professionals who want to develop data analysis capabilities with AI tools but are not trained data scientists.

"Practical AI for Journalists and Researchers" A growing body of journalism school guides from institutions including the Reynolds School of Journalism, the Columbia Journalism School, and the Nieman Foundation.

While targeted at journalists, the techniques for document analysis, interview transcript processing, and factual verification in these guides are directly applicable to any knowledge worker dealing with large document sets. Particularly useful: techniques for verifying AI-extracted claims against source documents.

Privacy and Data Handling

"AI Tools and Data Privacy: A Practitioner's Guide" IAPP (International Association of Privacy Professionals) https://iapp.org

The IAPP provides ongoing guidance on privacy implications of AI tool use, including the specific questions organizations should answer before uploading documents to AI platforms. Essential reading for anyone making organizational decisions about AI data handling policies.

"Enterprise AI Privacy Considerations" Each major AI provider publishes enterprise data handling documentation. Key documents to review before building organizational AI document workflows: - OpenAI Enterprise Privacy: openai.com/enterprise-privacy - Anthropic Claude for Enterprise: anthropic.com/claude-for-enterprise - Google Workspace AI Compliance: workspace.google.com/compliance

For regulated industries (healthcare, finance, legal, government), review these documents with your compliance team before deploying document-analysis workflows.

A Note on the Pace of Change

Multimodal AI capabilities are evolving faster than any other area covered in this book. The capability comparison table in Chapter 12 reflects early 2026 state; significant changes — particularly for audio/video capabilities — are likely within months.

Recommended approach to staying current: 1. Subscribe to announcement feeds for the platforms you use (OpenAI, Anthropic, Google DeepMind blogs) 2. When a major new capability is announced, test it on your specific workflows before fully adopting it — announced capabilities and practical capabilities for your use cases can differ 3. The principles in this chapter are stable even as platform capabilities change: context, specificity, output format, privacy awareness, and verification habits apply regardless of which specific capabilities are available at any given time