Further Reading: Python for AI Engineering

Foundational References

Books

  1. McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media. The definitive guide to pandas, written by the library's creator. Covers DataFrame operations, GroupBy, time series, and integration with NumPy. Essential reference for Sections 5.3 and 5.4.

  2. VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media. Comprehensive coverage of NumPy, pandas, matplotlib, and scikit-learn. Freely available online. Excellent companion to this chapter's treatment of the Python scientific stack.

  3. Harris, C. R. et al. (2020). "Array programming with NumPy." Nature, 585, 357-362. The canonical reference paper for NumPy, explaining its design philosophy, array model, and role in the scientific Python ecosystem. Provides the theoretical underpinning for Section 5.2.

  4. Gorelick, M. & Ozsvald, I. (2020). High Performance Python, 2nd Edition. O'Reilly Media. Deep treatment of Python performance optimization, including profiling, Cython, Numba, multiprocessing, and memory management. Extends Section 5.5's optimization strategies significantly.

  5. Ramalho, L. (2022). Fluent Python, 2nd Edition. O'Reilly Media. Advanced Python programming covering data model, iterators, generators, concurrency, and metaprogramming. Strengthens the coding best practices from Section 5.7.

Documentation

  1. NumPy Documentation. https://numpy.org/doc/ Official reference for all NumPy functions, including detailed explanations of broadcasting rules, memory layout, and universal functions (ufuncs).

  2. pandas Documentation. https://pandas.pydata.org/docs/ Comprehensive API reference and user guide. The "10 Minutes to pandas" tutorial is an excellent quick-start.

  3. matplotlib Documentation. https://matplotlib.org/stable/ The gallery section provides hundreds of example plots with source code. The tutorials section covers the Figure-Axes model in detail.

Advanced Topics

Performance and Scaling

  1. Lam, S. K., Pitrou, A., & Seibert, S. (2015). "Numba: A LLVM-based Python JIT Compiler." Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. Technical paper describing Numba's JIT compilation approach for accelerating numerical Python code. Extends the Numba discussion in Section 5.5.3.

  2. Dask Documentation. https://docs.dask.org/ Dask extends pandas and NumPy to out-of-core and parallel computation. Essential when datasets exceed available RAM, as discussed in Case Study 2.

  3. CuPy Documentation. https://docs.cupy.dev/ NumPy-compatible GPU array library. Provides a practical path to GPU acceleration that is often simpler than writing custom CUDA kernels.

  4. Polars Documentation. https://pola.rs/ A modern DataFrame library written in Rust with lazy evaluation. Offers significant performance improvements over pandas for large-scale data processing.

Visualization

  1. Knaflic, C. N. (2015). Storytelling with Data. Wiley. Not a Python book, but the best resource on data visualization principles -- choosing chart types, reducing clutter, and directing attention. Improves the quality of every plot you create with matplotlib.

  2. Tufte, E. R. (2001). The Visual Display of Quantitative Information, 2nd Edition. Graphics Press. Classic text on information design and statistical graphics. Introduces principles like data-ink ratio and chartjunk avoidance that apply directly to matplotlib figure design.

  3. seaborn Documentation. https://seaborn.pydata.org/ Statistical visualization library built on matplotlib. The tutorial section demonstrates how to create complex statistical plots with minimal code.

Reproducibility and Engineering

  1. Kluyver, T. et al. (2016). "Jupyter Notebooks -- a publishing format for reproducible computational workflows." Positioning and Power in Academic Publishing. Paper describing the Jupyter notebook architecture and its role in reproducible research. Provides context for Section 5.5.1.

  2. Cookiecutter Data Science. https://drivendata.github.io/cookiecutter-data-science/ Project template for data science work that inspired the project structure in Section 5.6.4. Widely adopted in industry and academia.

  3. Mypy Documentation. https://mypy.readthedocs.io/ Static type checker for Python. Extends the type hints discussion in Section 5.7.1 with detailed guidance on typing NumPy arrays, generics, and protocols.

Online Resources

  1. Jake VanderPlas, "A Whirlwind Tour of Python." https://jakevdp.github.io/WhirlwindTourOfPython/ Free online book covering Python fundamentals for scientific computing.

  2. Real Python. https://realpython.com/ High-quality tutorials on Python programming, including detailed articles on NumPy internals, pandas optimization, and matplotlib customization.

  3. NumPy for MATLAB Users. https://numpy.org/doc/stable/user/numpy-for-matlab-users.html Translation guide for those coming from MATLAB, mapping MATLAB idioms to NumPy equivalents.

Connections to Other Chapters

Topic This Chapter Connected Chapters
NumPy array operations Section 5.2 Chapter 2 (linear algebra implementation)
Vectorized computation Section 5.2.4 Chapter 3 (gradient computation)
pandas data handling Section 5.3 Chapter 9 (feature engineering)
Visualization Section 5.4 Chapter 8 (model evaluation plots)
Profiling Section 5.5 Chapter 12 (training pipeline optimization)
Environment management Section 5.6 Chapter 38 (MLOps and deployment)
Type hints and testing Section 5.7 All subsequent chapters