Key Takeaways — Chapter 18: Working with PDFs and Word Documents

Understanding the Landscape

  • PDFs are designed for human readers, not machine processing. They store text as positioned strings on a canvas — there is no inherent structure. Extraction is heuristic, not reliable.
  • The single most important diagnostic before writing PDF extraction code: open the file in a PDF viewer and try to select text. If you can select it, Python can likely extract it. If you cannot, the PDF is a scanned image and requires OCR (not covered by pypdf).
  • Word's .docx format is genuinely structured XML. python-docx can read, create, and modify these files reliably and predictably.
  • python-docx only works with .docx files (Office 2007 and later). It does not support the older .doc format.

Reading PDFs with pypdf

  • Install with pip install pypdf. The correct import is import pypdf.
  • Always open PDF files in binary mode: open(pdf_path, "rb").
  • pypdf.PdfReader(file) gives you the reader object. Check reader.is_encrypted before extraction — encrypted PDFs will fail silently or raise an error.
  • reader.pages is a list of PageObject instances. Call .extract_text() on each page to get its text content.
  • reader.metadata is a dict-like object with PDF properties: /Title, /Author, /CreationDate, etc.
  • Use pypdf.PdfWriter for output operations. Call writer.add_page(page) to copy pages in, then writer.write(output_file) to save.
  • Splitting: one PdfWriter per output file, add one page, save. Merging: one PdfWriter, add all pages from all readers, save once.

Working with Extracted PDF Text

  • Extracted text is a flat string — all structural information (table structure, column relationships, header/value pairing) is lost.
  • Use regular expressions to find patterns: invoice numbers, dollar amounts, dates.
  • Build confidence levels into extraction logic: "high" (all fields found), "medium" (partial), "low" (key fields missing). Route low-confidence results to a manual review queue — never silently include uncertain data.
  • The hybrid approach (automate the majority, review the edge cases) is the professional standard for PDF data extraction in business workflows.

Creating Word Documents with python-docx

  • docx.Document() creates a new blank document. docx.Document(path) opens an existing file.
  • The document is a sequence of block-level elements: paragraphs and tables.
  • doc.add_heading(text, level=N) adds a heading (0 = Title, 1 = Heading 1, etc.).
  • doc.add_paragraph(text) adds a normal paragraph.
  • doc.add_paragraph(text, style="List Bullet") or "List Number" creates list items.
  • doc.add_table(rows=N, cols=M) creates a table. Access cells via table.rows[i].cells[j], set content with cell.text = "...".
  • For inline formatting (bold, color, font size), work with Run objects: run = para.add_run(text), then run.font.bold = True, run.font.size = Pt(12), run.font.color.rgb = RGBColor(r, g, b).

Template-Based Document Generation

  • The placeholder replacement pattern: design a template in Word with markers like {{CLIENT_NAME}}, then have Python replace the markers with real values.
  • Use distinctive markers (double curly braces, ALL_CAPS) to avoid accidentally replacing normal words in the document.
  • A placeholder may be split across multiple Word Run objects by Word's internal XML formatting. Handle this by reconstructing the full paragraph text, replacing the placeholder, and writing it back into the first run.
  • Replacements should happen in body paragraphs, table cells, headers, and footers — not just the body.
  • Template vs. from-scratch: Use a template when a business user needs control over layout, fonts, and visual design. Build from scratch when the document structure is purely data-driven and no designer needs to control it.

Generating PDFs with reportlab

  • reportlab creates PDFs from Python code via a canvas API: position content at explicit (x, y) coordinates.
  • The canvas coordinate origin (0, 0) is the bottom-left corner of the page. Y increases upward.
  • c.drawString(x, y, text) places text at a specific position.
  • c.setFont("Helvetica", 12) and c.setFillColorRGB(r, g, b) set text properties before drawing.
  • c.save() writes the PDF to disk. Full coverage of reportlab's higher-level platypus API is in Chapter 36.

Batch Document Generation

  • The report generation pattern scales linearly: once your generation function works for one document, generating ten or a hundred is just a loop.
  • Structure your input data as a list of dictionaries (or rows in a CSV/DataFrame). One dict produces one document.
  • Always test on a single record before running the batch. One bug in the generation function will produce the same bug in every document.

Code Patterns to Remember

# Read a PDF
import pypdf
with open(pdf_path, "rb") as f:
    reader = pypdf.PdfReader(f)
    for page in reader.pages:
        text = page.extract_text() or ""

# Merge PDFs
writer = pypdf.PdfWriter()
for pdf_path in pdf_paths:
    with open(pdf_path, "rb") as f:
        reader = pypdf.PdfReader(f)
        for page in reader.pages:
            writer.add_page(page)
with open(output_path, "wb") as f:
    writer.write(f)

# Create a Word document
import docx
doc = docx.Document()
doc.add_heading("Title", level=0)
doc.add_heading("Section 1", level=1)
doc.add_paragraph("Body text here.")
doc.add_paragraph("Bullet point", style="List Bullet")
doc.save("output.docx")

# Format a run
para = doc.add_paragraph()
run = para.add_run("Bold colored text")
run.font.bold = True
run.font.size = Pt(14)
run.font.color.rgb = RGBColor(0x00, 0x35, 0x6B)