Further Reading — Chapter 18: Working with PDFs and Word Documents

Official Documentation

pypdf Documentation https://pypdf.readthedocs.io/ The official pypdf documentation covering all reader and writer operations, metadata extraction, encryption, watermarking, and page transformations. The Changelog is worth reading to understand the migration path from the deprecated PyPDF2.

python-docx Documentation https://python-docx.readthedocs.io/ Complete API reference for python-docx. The "Working with Text" and "Working with Tables" sections are the most directly useful. The section on styles is essential when you need to match a corporate document template's formatting exactly.

reportlab User Guide https://www.reportlab.com/docs/reportlab-userguide.pdf The official reportlab documentation in PDF format (appropriately). Chapters 1–4 cover the canvas API from this chapter. Chapters 5–9 cover the platypus framework for production report generation. Dense but comprehensive.

Books

Automate the Boring Stuff with Python (3rd Edition) Al Sweigart — No Starch Press Chapter 15 (Working with PDF and Word Documents) covers similar ground with a slightly different emphasis. Chapter 14 (Working with Excel Spreadsheets) and Chapter 16 (Working with CSV Files and JSON Data) connect naturally to this chapter's patterns. Free online at automatetheboringstuff.com.

Python Cookbook (3rd Edition) David Beazley and Brian K. Jones — O'Reilly Chapter 6 (Data Encoding and Processing) contains recipes for parsing and processing structured document formats. The approach to handling edge cases and malformed data is directly applicable to the PDF extraction challenges covered in this chapter.

Online Resources

Real Python: Working with PDFs in Python https://realpython.com/pdf-python/ A comprehensive tutorial covering pypdf (they note the PyPDF2 → pypdf migration), text extraction, merging, splitting, and adding watermarks. Good companion reading to this chapter with additional worked examples.

Real Python: Python-docx Tutorial https://realpython.com/python-docx/ Covers document creation, reading, and editing with python-docx in more depth than this chapter. The section on reading existing documents and the discussion of paragraph styles are particularly useful.

Towards Data Science: PDF Text Extraction in Python A practical analysis of which PDF extraction approaches work for which document types, including a comparison of pypdf, pdfminer.six, and pdfplumber. Worth reading before choosing your extraction library for a production use case.

Tools and Libraries

pypdf (PyPI) pip install pypdf PDF reading, splitting, merging, and metadata extraction. Required for pdf_reader.py in this chapter.

python-docx (PyPI) pip install python-docx Word document creation and manipulation. Note: the package name on PyPI is python-docx but the import is import docx. Required for word_generator.py.

reportlab (PyPI) pip install reportlab PDF generation from Python code. Required for the create_simple_pdf() examples in Section 18.7.

pdfplumber pip install pdfplumber A higher-level PDF extraction library built on top of pdfminer.six. Better at preserving table structure than pypdf's extract_text() for many document types. Worth evaluating if pypdf's output is too noisy for your use case.

camelot-py pip install camelot-py[cv] Specifically designed for extracting tables from PDFs. Works with both bordered and borderless tables. Significantly better than general-purpose text extraction for tabular data. Requires additional dependencies (Ghostscript, OpenCV).

tabula-py pip install tabula-py Python wrapper for Tabula, a Java-based PDF table extractor. Requires Java to be installed. Excellent for well-structured tables in PDFs generated from databases or billing systems.

Advanced Topics

Tesseract OCR (for scanned PDFs) https://github.com/tesseract-ocr/tesseract When PDFs are scanned images with no text layer, you need OCR. Tesseract is the leading open-source OCR engine. The Python wrapper pytesseract (pip install pytesseract) makes it straightforward to apply OCR to PDF page images. The workflow: render PDF pages to images (using pdf2image), run OCR on each image, then process the resulting text.

python-docx Style Reference https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html Understanding Word styles is essential for matching corporate document templates. This guide explains the difference between paragraph styles, character styles, and table styles, and how to inspect what styles are available in an existing document.

Adobe PDF Reference (for the curious) https://opensource.adobe.com/dc-acrobat-sdk-docs/ The technical specification for the PDF format. Explains exactly why text extraction is lossy — PDF content streams are drawing instructions, not structured documents. Understanding the format makes the limitations of pypdf and similar libraries much more intuitive.

When to Use Each Tool

Need Tool
Read text from an existing PDF pypdf
Extract tables from PDFs pdfplumber or camelot
OCR a scanned PDF pytesseract + pdf2image
Merge or split PDFs pypdf
Create a Word document python-docx
Fill a Word template python-docx
Read a Word document python-docx
Create a PDF from scratch (simple) reportlab canvas API
Create a PDF from scratch (complex) reportlab platypus
Generate reports with charts to PDF reportlab + matplotlib