Further Reading — Chapter 18: Working with PDFs and Word Documents
Official Documentation
pypdf Documentation https://pypdf.readthedocs.io/ The official pypdf documentation covering all reader and writer operations, metadata extraction, encryption, watermarking, and page transformations. The Changelog is worth reading to understand the migration path from the deprecated PyPDF2.
python-docx Documentation https://python-docx.readthedocs.io/ Complete API reference for python-docx. The "Working with Text" and "Working with Tables" sections are the most directly useful. The section on styles is essential when you need to match a corporate document template's formatting exactly.
reportlab User Guide
https://www.reportlab.com/docs/reportlab-userguide.pdf
The official reportlab documentation in PDF format (appropriately). Chapters 1–4 cover the canvas API from this chapter. Chapters 5–9 cover the platypus framework for production report generation. Dense but comprehensive.
Books
Automate the Boring Stuff with Python (3rd Edition) Al Sweigart — No Starch Press Chapter 15 (Working with PDF and Word Documents) covers similar ground with a slightly different emphasis. Chapter 14 (Working with Excel Spreadsheets) and Chapter 16 (Working with CSV Files and JSON Data) connect naturally to this chapter's patterns. Free online at automatetheboringstuff.com.
Python Cookbook (3rd Edition) David Beazley and Brian K. Jones — O'Reilly Chapter 6 (Data Encoding and Processing) contains recipes for parsing and processing structured document formats. The approach to handling edge cases and malformed data is directly applicable to the PDF extraction challenges covered in this chapter.
Online Resources
Real Python: Working with PDFs in Python https://realpython.com/pdf-python/ A comprehensive tutorial covering pypdf (they note the PyPDF2 → pypdf migration), text extraction, merging, splitting, and adding watermarks. Good companion reading to this chapter with additional worked examples.
Real Python: Python-docx Tutorial https://realpython.com/python-docx/ Covers document creation, reading, and editing with python-docx in more depth than this chapter. The section on reading existing documents and the discussion of paragraph styles are particularly useful.
Towards Data Science: PDF Text Extraction in Python A practical analysis of which PDF extraction approaches work for which document types, including a comparison of pypdf, pdfminer.six, and pdfplumber. Worth reading before choosing your extraction library for a production use case.
Tools and Libraries
pypdf (PyPI)
pip install pypdf
PDF reading, splitting, merging, and metadata extraction. Required for pdf_reader.py in this chapter.
python-docx (PyPI)
pip install python-docx
Word document creation and manipulation. Note: the package name on PyPI is python-docx but the import is import docx. Required for word_generator.py.
reportlab (PyPI)
pip install reportlab
PDF generation from Python code. Required for the create_simple_pdf() examples in Section 18.7.
pdfplumber
pip install pdfplumber
A higher-level PDF extraction library built on top of pdfminer.six. Better at preserving table structure than pypdf's extract_text() for many document types. Worth evaluating if pypdf's output is too noisy for your use case.
camelot-py
pip install camelot-py[cv]
Specifically designed for extracting tables from PDFs. Works with both bordered and borderless tables. Significantly better than general-purpose text extraction for tabular data. Requires additional dependencies (Ghostscript, OpenCV).
tabula-py
pip install tabula-py
Python wrapper for Tabula, a Java-based PDF table extractor. Requires Java to be installed. Excellent for well-structured tables in PDFs generated from databases or billing systems.
Advanced Topics
Tesseract OCR (for scanned PDFs)
https://github.com/tesseract-ocr/tesseract
When PDFs are scanned images with no text layer, you need OCR. Tesseract is the leading open-source OCR engine. The Python wrapper pytesseract (pip install pytesseract) makes it straightforward to apply OCR to PDF page images. The workflow: render PDF pages to images (using pdf2image), run OCR on each image, then process the resulting text.
python-docx Style Reference https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html Understanding Word styles is essential for matching corporate document templates. This guide explains the difference between paragraph styles, character styles, and table styles, and how to inspect what styles are available in an existing document.
Adobe PDF Reference (for the curious) https://opensource.adobe.com/dc-acrobat-sdk-docs/ The technical specification for the PDF format. Explains exactly why text extraction is lossy — PDF content streams are drawing instructions, not structured documents. Understanding the format makes the limitations of pypdf and similar libraries much more intuitive.
When to Use Each Tool
| Need | Tool |
|---|---|
| Read text from an existing PDF | pypdf |
| Extract tables from PDFs | pdfplumber or camelot |
| OCR a scanned PDF | pytesseract + pdf2image |
| Merge or split PDFs | pypdf |
| Create a Word document | python-docx |
| Fill a Word template | python-docx |
| Read a Word document | python-docx |
| Create a PDF from scratch (simple) | reportlab canvas API |
| Create a PDF from scratch (complex) | reportlab platypus |
| Generate reports with charts to PDF | reportlab + matplotlib |