Chapter 18 Further Reading: Generative AI — Multimodal
Image Generation and Diffusion Models
1. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684-10695. The foundational paper behind Stable Diffusion. Rombach et al. introduced the latent diffusion architecture that made high-quality image generation computationally feasible on consumer hardware. While technically dense, the paper's introduction and related work sections provide an excellent overview of how diffusion models evolved. Essential for readers who want to understand the technology beneath the business applications discussed in this chapter.
2. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents." arXiv preprint arXiv:2204.06125. The paper describing DALL-E 2, OpenAI's breakthrough text-to-image model. Ramesh et al. explain how text embeddings (from CLIP) are used to guide image generation — the mechanism by which text prompts are translated into visual outputs. Understanding this connection between language and image models illuminates why multimodal AI works as well as it does.
3. Oppenlaender, J. (2023). "The Creativity of Text-to-Image Generation." Proceedings of the 25th International Academic Mindtrek Conference, 192-202. An empirical study of how users interact with text-to-image tools and what constitutes "creativity" in the human-AI image generation process. Oppenlaender's findings are particularly relevant to the chapter's discussion of the creative industry impact: creative output quality depends more on the user's prompting skill and aesthetic judgment than on the model's technical capabilities.
Audio, Speech, and Music Generation
4. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." Proceedings of the 40th International Conference on Machine Learning (ICML). The paper introducing OpenAI's Whisper model for speech recognition. Whisper's approach — training on 680,000 hours of multilingual audio data — achieved near-human accuracy across languages, accents, and acoustic conditions. For business leaders evaluating speech-to-text solutions, this paper explains why the latest generation of speech recognition represents a step change in reliability.
5. Borsos, Z., Marinier, R., Vincent, D., et al. (2023). "AudioLM: A Language Modeling Approach to Audio Generation." Transactions on Machine Learning Research. AudioLM demonstrated that techniques from large language models could be applied to audio generation, producing realistic speech and music continuation. The paper illustrates the convergence between language modeling and audio generation that underlies modern TTS and music generation systems.
6. Lajszczak, M., Cong, J., & Li, T. (2024). "The State of Voice Cloning: Quality, Ethics, and Detection." IEEE Signal Processing Magazine, 41(3), 44-56. A comprehensive survey of voice cloning technology, covering both the technical state of the art and the ethical challenges. Particularly valuable for the chapter's discussion of voice cloning risks — the paper includes a taxonomy of malicious uses and an assessment of current detection capabilities. A strong complement to the deepfakes section.
Video Generation
7. Brooks, T., Peebles, B., Holmes, C., et al. (2024). "Video Generation Models as World Simulators." OpenAI Research. OpenAI's technical report accompanying the Sora announcement. While more of a research preview than a traditional paper, it articulates the vision of video generation models as "world simulators" — systems that learn not just visual patterns but the physics and dynamics of the visual world. The framing is aspirational, but the technical demonstrations are genuinely impressive. Read with the understanding that production capabilities lag significantly behind research demonstrations.
8. Singer, U., Polyak, A., Hayes, T., et al. (2023). "Make-A-Video: Text-to-Video Generation without Text-Video Data." Proceedings of the International Conference on Learning Representations (ICLR). Meta's approach to video generation, which leveraged existing text-to-image models and extended them to video without requiring paired text-video training data. The paper illustrates the technical challenges of maintaining temporal consistency in generated video — the key limitation discussed in the chapter.
Code Generation
9. Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot." arXiv preprint arXiv:2302.06590. The most rigorous study of GitHub Copilot's impact on developer productivity. Peng et al. conducted a randomized controlled trial with professional developers, finding a 55.8 percent increase in task completion speed. The study's methodology and nuanced discussion of where productivity gains do and do not occur make it essential reading for anyone making a business case for code generation tools.
10. Perry, N., Srivastava, M., Kumar, D., & Boneh, D. (2023). "Do Users Write More Insecure Code with AI Assistants?" Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2785-2799. The Stanford study referenced in the chapter, finding that developers using AI coding assistants produced more security vulnerabilities than those coding without assistance. Perry et al. hypothesize that AI-generated code may inspire false confidence, reducing the developer's inclination to scrutinize the output for security issues. Critical reading for organizations deploying code generation tools — the productivity gains are real, but so are the security risks.
11. Vaithilingam, P., Zhang, T., & Glassman, E. L. (2022). "Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models." CHI Conference on Human Factors in Computing Systems Extended Abstracts. An HCI study examining how developers actually use code generation tools — their expectations, frustrations, and workarounds. The finding that developers frequently struggle to verify the correctness of generated code is directly relevant to Tom's observation in the chapter about the need for rigorous code review.
Multimodal Models
12. OpenAI. (2023). "GPT-4V(ision) System Card." OpenAI Technical Report. OpenAI's documentation of GPT-4V's multimodal capabilities and limitations, including known failure modes, safety evaluations, and usage guidelines. System cards are the primary documentation for understanding what a model can and cannot do — essential reading for any business deploying multimodal AI systems.
13. Reid, M., Savinov, N., Teber, D., et al. (2024). "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context." arXiv preprint arXiv:2403.05530. Google's technical report on Gemini 1.5, notable for its extremely long context window and ability to process large documents, lengthy videos, and extended audio inputs. The paper illustrates the trajectory of multimodal models toward more comprehensive content understanding — relevant to the chapter's discussion of document understanding and visual analysis.
Intellectual Property and Copyright
14. Sag, M. (2023). "Copyright Safety for Generative AI." Houston Law Review, 61(2), 295-372. The most comprehensive legal analysis of copyright issues in generative AI, written by a leading copyright scholar. Sag examines the fair use arguments for AI training, the copyrightability of AI-generated outputs, and potential legislative solutions. Accessible to non-lawyers and essential for any business leader managing IP risk in generative AI. Directly relevant to the Getty v. Stability AI case study.
15. US Copyright Office. (2023). "Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence Systems." 88 Fed. Reg. 16190 (March 16, 2023). The official US Copyright Office guidance on whether and how AI-generated works qualify for copyright registration. The guidance establishes that purely AI-generated content is not copyrightable but that works involving "sufficient human authorship" in the selection, arrangement, or modification of AI-generated elements may qualify. The primary source for understanding the current US copyright framework for AI-generated content.
16. Sobel, B. L. W. (2024). "A New Common Law of Artificial Intelligence Generated Works." Stanford Technology Law Review, 27(1), 1-68. Sobel proposes a new legal framework for AI-generated works, arguing that existing copyright law is inadequate for the generative AI era. The paper's analysis of the ownership, liability, and attribution questions raised by generative AI is thoughtful and practical. Particularly useful for the discussion questions in Case Study 1 (Getty v. Stability AI).
Deepfakes and Content Authenticity
17. Chesney, R., & Citron, D. K. (2019). "Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security." California Law Review, 107(6), 1753-1820. The foundational legal analysis of deepfake risks, published before the current wave of generative AI but remarkably prescient. Chesney and Citron's taxonomy of deepfake harms — to individuals, organizations, and democratic institutions — remains the standard framework. Their proposed policy responses (authentication, liability, and legal remedies) inform the C2PA and content provenance approaches discussed in the chapter.
18. Coalition for Content Provenance and Authenticity (C2PA). (2024). "C2PA Technical Specification 2.0." C2PA.org. The full technical specification for the C2PA content provenance standard. While highly technical, the overview sections explain the standard's goals, architecture, and adoption roadmap in accessible terms. Required reading for organizations considering C2PA adoption — which, as the chapter suggests, should be most organizations that produce or distribute digital content.
Business Strategy and Creative Industry Impact
19. Mollick, E. (2024). Co-Intelligence: Living and Working with AI. Portfolio. Ethan Mollick's practical guide includes extensive discussion of generative AI's impact on creative work, drawn from his experiments at Wharton. Chapter 6 ("AI as Creative") is particularly relevant to this chapter's discussion of the creative industry impact and the new creative workflow. Mollick's framework for understanding when AI enhances creativity versus when it produces mediocrity is directly useful for business leaders making content strategy decisions.
20. Bain & Company. (2024). "Generative AI in Marketing: Moving from Experimentation to Scale." Bain & Company Report. Bain's analysis of how leading brands (including their work with Coca-Cola, discussed in Case Study 2) are integrating generative AI into marketing operations. The report includes quantitative data on cost savings, productivity gains, and consumer engagement metrics from early adopters. The framework for moving from experimentation to scaled deployment is practical and actionable.
21. Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2023). "GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models." arXiv preprint arXiv:2303.10130. An analysis of how large language models (and by extension, multimodal AI) will affect specific occupations and task categories. The finding that creative and analytical occupations are more exposed than manual occupations challenges common assumptions about AI's labor market impact. Referenced in Chapter 1's further reading — revisit it now with the multimodal perspective from this chapter.
22. McKinsey Global Institute. (2024). "The Economic Potential of Generative AI: The Next Productivity Frontier." McKinsey & Company. McKinsey's comprehensive analysis of generative AI's economic impact, including sector-by-sector estimates of value creation potential. The report estimates that generative AI could add $2.6-4.4 trillion annually to the global economy, with marketing, software development, and customer operations among the highest-impact functions. The sector estimates provide useful benchmarks for the business application discussions in this chapter.
Technical Foundations (For Deeper Understanding)
23. Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." Advances in Neural Information Processing Systems (NeurIPS), 33, 6840-6851. The seminal paper that established denoising diffusion as a viable approach to image generation. While mathematically rigorous, the paper's introduction explains the noise-addition and noise-removal intuition described in the chapter. For readers with quantitative backgrounds who want to understand how diffusion works beyond the business intuition.
24. Saharia, C., Chan, W., Saxena, S., et al. (2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding." Advances in Neural Information Processing Systems (NeurIPS), 35, 36479-36494. Google's Imagen paper, which demonstrated that combining large language models with diffusion models produces dramatically better text-to-image quality. The key insight — that the quality of the text understanding is as important as the quality of the image generation — has implications for prompt engineering (Chapter 19) and for understanding why multimodal models outperform unimodal approaches.
This reading list focuses on works accessible to MBA students and business professionals. For a comprehensive bibliography of all sources cited in this textbook, see Appendix C. For foundational readings on neural networks and deep learning, see Chapter 13's further reading. For LLM-specific readings, see Chapter 17's further reading.