Chapter 33 Further Reading: AI Product Management


AI Product Management Foundations

  1. Bock, O. & Wiener, M. (2024). "Product Management for AI Products: A Systematic Literature Review and Research Agenda." Journal of Product Innovation Management, 41(2), 312-340. The first comprehensive academic survey of AI-specific product management practices. Reviews 87 studies and identifies five core competency areas for AI PMs: probabilistic requirements management, stakeholder translation, ethical product design, continuous evaluation, and AI-specific user research. Provides an empirical foundation for the AI PM skill stack described in Section 33.1.

  2. Builders of AI (2024). The AI Product Manager's Handbook. Pragmatic Press. A practitioner-oriented guide to AI product management, written by a consortium of AI PMs from Google, Microsoft, Amazon, and Spotify. Covers the full PM lifecycle with AI-specific modifications, including chapters on setting performance thresholds, designing fallback strategies, and communicating with non-technical stakeholders. The most accessible single resource for professionals transitioning from traditional PM to AI PM.

  3. Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. Though written primarily for ML engineers, this book is invaluable for AI PMs because it explains the technical constraints that shape product decisions — data management, feature engineering, model evaluation, deployment patterns, and monitoring. Chapter 11 on ML system evaluation is particularly relevant to the metrics framework in Section 33.7. Referenced also in Chapter 6's further reading for its lifecycle perspective.

  4. Cagan, M. (2017). Inspired: How to Create Tech Products Customers Love. 2nd ed. Wiley. The standard reference on modern product management practice. While not AI-specific, its frameworks for product discovery, product teams, and stakeholder management form the foundation on which AI PM builds. Sections on empowered product teams and continuous discovery are directly applicable to AI product development. Essential context for understanding what AI PM adds to the traditional PM discipline.


Probabilistic Thinking and Decision-Making Under Uncertainty

  1. Kahneman, D., Sibony, O., & Sunstein, C.R. (2021). Noise: A Flaw in Human Judgment. Little, Brown Spark. Explores the concept of "noise" — unwanted variability in human judgment — which provides useful context for understanding why probabilistic AI systems can outperform human decision-makers even when the AI is imperfect. Relevant to the "perfection trap" discussion in Section 33.2 and the challenge of comparing AI performance to human baselines.

  2. Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. Penguin Press. A masterful exploration of prediction, uncertainty, and probabilistic thinking across domains from weather forecasting to political polling. The chapter on "How to Think Like a Meteorologist" is particularly relevant to AI PM: weathercasters face the same communication challenge as AI PMs — conveying probabilistic information to audiences that prefer certainty.

  3. Agrawal, A., Gans, J., & Goldfarb, A. (2022). Power and Prediction: The Disruptive Economics of Artificial Intelligence. Harvard Business Review Press. The sequel to Prediction Machines, focusing on how AI changes decision-making within organizations. The framework of "AI as a prediction technology that unbundles decisions from judgment" is directly applicable to AI product management. Particularly relevant to Section 33.2 (managing probabilistic products) and Section 33.5 (defining requirements that separate prediction from decision).


User Research and Trust in AI

  1. Kocielnik, R., Amershi, S., & Bennett, P.N. (2019). "Will You Accept an Imperfect AI? Exploring Designs for Adjusting End-User Expectations of AI Systems." Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-14. A landmark study on how different design strategies (setting expectations, expressing confidence, providing context) affect user acceptance of imperfect AI systems. Directly informs the expectation calibration strategies in Section 33.4. Key finding: users who are told in advance that the AI "is right about 80% of the time" are more satisfied with 80% accuracy than users who receive no expectation setting.

  2. Yin, M., Wortman Vaughan, J., & Wallach, H. (2019). "Understanding the Effect of Accuracy on Trust in Machine Learning Models." Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-12. Examines the relationship between model accuracy and user trust. Finds that trust is not a linear function of accuracy — there are threshold effects (trust drops sharply below certain accuracy levels) and anchoring effects (initial accuracy exposure sets expectations that persist even when accuracy changes). Essential reading for PMs defining performance thresholds and launch strategies.

  3. Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., ... & Horvitz, E. (2019). "Guidelines for Human-AI Interaction." Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-13. Microsoft Research's 18 guidelines for designing AI-powered user experiences, organized into four temporal phases: initially (before interaction), during interaction, when wrong (handling errors), and over time (adaptation). Provides a practical checklist for AI PMs designing user-facing AI features. The "when wrong" guidelines are particularly relevant to the failure mode design in Section 33.8.

  4. Lee, M.K. (2018). "Understanding Perception of Algorithmic Decisions: Fairness, Trust, and Emotion in Response to Algorithmic Management." Big Data & Society, 5(1). Explores how people perceive and respond to algorithmic decisions, particularly in high-stakes contexts like hiring and performance evaluation. Finds that perceived fairness — not just actual accuracy — drives trust in AI systems. Relevant to Section 33.4's discussion of user mental models and Section 33.7's fairness metrics.


AI Product Metrics and Evaluation

  1. Fabijan, A., Dmitriev, P., Olsson, H.H., & Bosch, J. (2018). "Online Controlled Experiments at Large Scale." Proceedings of the 40th International Conference on Software Engineering (ICSE), 1-4. A practical guide to running A/B tests at scale, based on experience at Microsoft. Covers sample size calculation, experiment duration, novelty effects, and common pitfalls — all relevant to the A/B testing challenges described in Section 33.9. Supplements with guidance on organizational readiness for experimentation.

  2. Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. The definitive reference on A/B testing, written by practitioners from Microsoft and Google. Covers experimental design, metric selection, statistical significance, common mistakes, and organizational best practices. Chapter 21 on "Experimentation for Machine Learning" directly addresses the AI-specific A/B testing challenges described in Section 33.9 (novelty effects, network effects, long-tail effects).

  3. Doshi-Velez, F. & Kim, B. (2017). "Towards A Rigorous Science of Interpretable Machine Learning." arXiv:1702.08608. Proposes a taxonomy of interpretability evaluation methods — from application-grounded (domain expert evaluation) to functionally grounded (proxy metrics) to human-grounded (lay user evaluation). Useful for AI PMs designing explainability features (Section 33.4) and defining how to measure the effectiveness of "Why was this recommended?" interfaces.


Failure Mode Design and Reliability

  1. Nushi, B., Kamar, E., Horvitz, E., & Koenig, D. (2017). "On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems." Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). Examines how to diagnose and troubleshoot failures in deployed ML systems, particularly systems that combine multiple models or human-AI collaboration. Proposes a framework for identifying root causes of failure — relevant to the failure mode taxonomy in Section 33.8 and the distinction between model, data, infrastructure, and concept drift failures.

  2. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems, 28. The seminal paper on ML system maintenance, arguing that ML systems accumulate "technical debt" faster than traditional software. Directly relevant to Section 33.9's discussion of continuous improvement vs. concept drift, and to the infrastructure investment track of the AI product roadmap (Section 33.11). If you read one paper on the long-term costs of AI products, read this one. (Also referenced in Chapter 6.)

  3. Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2021). "Monitoring and Explainability of Models in Production." arXiv:2007.06299. Practical guide to monitoring ML models in production, covering concept drift detection, outlier detection, and explainability monitoring. Relevant to Section 33.8 (failure mode design) and Section 33.9 (concept drift). Includes open-source tool references that AI PMs can share with their engineering teams.


Case Study Sources

  1. Settles, B. & Meeder, B. (2016). "A Trainable Spaced Repetition Model for Language Learning." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1848-1858. The paper describing Duolingo's half-life regression model — the core AI system discussed in Case Study 1. Provides the technical foundation for understanding how Duolingo's spaced repetition system works and how it balances memory retention with engagement.

  2. Nayak, P. (2019). "Understanding Searches Better Than Ever Before." Google Blog, October 25, 2019. Google's official announcement of BERT's integration into search, described in Case Study 2. Provides Google's product framing of a major AI model change — a useful example of how to communicate a technical improvement to a non-technical audience.

  3. Metz, C. (2023). "Google Puts New AI-Powered Features in Search." The New York Times, May 10, 2023. Journalistic account of Google's Search Generative Experience announcement, including the internal debates about risks, cannibalization, and competitive pressure from ChatGPT. Provides the organizational and strategic context for the generative AI phase discussed in Case Study 2.


Ethics, Fairness, and Responsible AI Product Design

  1. Holstein, K., Wortman Vaughan, J., Daumé III, H., Dudik, M., & Wallach, H. (2019). "Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?" Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-16. Based on interviews with ML practitioners at major technology companies, this paper identifies the practical tools and processes that teams need to address fairness in deployed AI products. Relevant to Section 33.7's fairness metrics and Section 33.5's fairness acceptance criteria. Particularly useful for AI PMs who need to advocate for fairness investments with skeptical stakeholders.

  2. Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé III, H., & Crawford, K. (2021). "Datasheets for Datasets." Communications of the ACM, 64(12), 86-92. Proposes standardized documentation for training datasets, including composition, collection methodology, intended use, and known biases. Essential for AI PMs who are responsible for ensuring the data underlying their products is well-documented and ethically sourced. Complements the model cards framework referenced in Chapter 6.

  3. Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchinson, B., ... & Barnes, P. (2020). "Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing." Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 33-44. Proposes an end-to-end framework for internal AI auditing — from problem formulation through deployment and monitoring. Relevant to the AI PM's responsibility for ensuring ethical product design throughout the lifecycle (Sections 33.3 and 33.5), and to the governance frameworks discussed in Chapter 27.


Broader Perspectives on AI Products

  1. Elish, M.C. & Boyd, D. (2018). "Situating Methods in the Magic of Big Data and AI." Communication Monographs, 85(1), 57-80. Examines how the "magic" framing of AI shapes user expectations and organizational decision-making. Directly relevant to Section 33.4's discussion of user mental models — particularly the "magic model" where users expect AI to be infallible. Argues that responsible AI product design requires actively demystifying AI for users and stakeholders.

  2. Muller, M., Lange, I., Wang, D., Piorkowski, D., Tsay, J., Liao, Q.V., ... & Erickson, T. (2019). "How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation." Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-15. An ethnographic study of how data science teams actually work, revealing the messy reality of data-driven product development. Useful for AI PMs who need to manage data science teams (Section 33.3) and understand why model development timelines are non-linear and inherently uncertain.


These readings span AI product management practice, probabilistic thinking, user research for AI, metrics and experimentation, failure mode design, and ethical product development. For readers prioritizing depth: start with Huyen (2022) for technical context, the Builders of AI handbook (2024) for PM-specific practices, and Amershi et al. (2019) for the most actionable design guidelines. For readers prioritizing breadth: Agrawal, Gans, and Goldfarb (2022) provides the strategic framework, and Silver (2012) provides the probabilistic mindset.