8 min read

> "A document is not a document when it is a PDF sent by a vendor who generates it from a system you cannot access, containing data you need in a format you cannot parse."

Chapter 18: Working with PDFs and Word Documents

"A document is not a document when it is a PDF sent by a vendor who generates it from a system you cannot access, containing data you need in a format you cannot parse." — Every analyst who has ever manually re-typed a vendor invoice


Opening Scenario: The Invoice Reconciliation Problem

Sandra Chen drops a folder of 47 PDF files on Priya's desk — metaphorically, through a shared drive. They are vendor invoices from the previous quarter. Finance needs the totals reconciled against the purchase order system before end of day.

Priya opens the first PDF, reads the total at the bottom, types it into a spreadsheet, and closes it. She opens the second one. It is from a different vendor and the total is in a different location on the page. She finds it, types it in. Third one. Different format again.

At this rate, 47 invoices will take 90 minutes. There will be typos. Finance will push back on at least two entries because the handwritten annotations on three invoices obscure the totals.

Meanwhile, Sandra has asked Priya to generate branded Word reports for the Q4 regional sales review. The report has a cover page, an executive summary, four tables (one per region), and a standardized conclusion. There are eight regions. Generating eight nearly-identical documents manually will take most of a day.

Priya has a better idea for both of these.


18.1 The Landscape: PDFs, Word Files, and What Python Can Do

Before writing any code, it is worth being precise about what is possible and what is not.

PDFs: A Format Designed for Humans, Not Machines

PDF stands for Portable Document Format. It was designed in 1993 with one goal: make a document look exactly the same on every printer and screen, regardless of what software created it. It succeeded brilliantly at that goal. As a consequence, it is a deeply unfriendly format for automated data extraction.

A PDF file does not contain structured text the way an HTML file or a Word document does. It contains a series of rendering instructions: "draw the string 'Invoice Total:' at position (x=412, y=218), in 10-point Helvetica." The text is there, but it is positioned on a canvas. There is no structural relationship — no concept of "this value belongs to this label" — unless you build that understanding yourself from the positions.

When PDF text extraction works well: - Simple, consistently-formatted documents (pay stubs, bank statements, invoices from modern billing systems) - Documents with mostly text content and simple layouts - PDFs generated directly from software (Word exports, accounting system outputs)

When PDF text extraction does not work or works poorly: - Scanned documents (images masquerading as PDFs — the text is a picture of text, not actual text) - Complex multi-column layouts where text order breaks when extracted linearly - Tables where the relationship between cells is positional, not structural - PDFs with significant graphics, watermarks, or heavy formatting - Password-protected or rights-managed PDFs

The single most important diagnostic: open the PDF, try to select and copy some text. If you can select text, Python can likely extract it. If you cannot select text (or if you select what looks like text and get garbage), the document is a scanned image and you need OCR (optical character recognition), which is beyond this chapter's scope.

Word Documents: Structured and Python-Friendly

Microsoft Word's .docx format (introduced in Office 2007) is actually a ZIP archive containing XML files. Python's python-docx library reads and writes that XML through a clean API. Word documents have genuine structure: paragraphs, headings, tables, runs, styles. This makes them far more amenable to automation than PDFs.

What python-docx can do well: - Read all text content from a document - Create documents from scratch with headings, paragraphs, tables - Apply styles (Bold, Heading 1, Table Style, etc.) - Replace placeholder text throughout a document - Add images - Control fonts, sizes, colors, alignment

What python-docx cannot do well: - Preserve complex layouts from existing documents exactly - Work with .doc (pre-2007 Word format) files — only .docx - Render documents (you cannot generate a PDF from a docx through python-docx alone) - Handle all the edge cases of macro-enabled documents (.docm)


18.2 Reading PDFs with pypdf

pypdf (formerly PyPDF2, which is deprecated) is the standard Python library for PDF operations without requiring external tools.

pip install pypdf

Basic Text Extraction

from pathlib import Path
import pypdf


def extract_text_from_pdf(pdf_path: Path) -> str:
    """
    Extract all text from a PDF file as a single string.

    Returns an empty string if the PDF has no extractable text.
    """
    full_text = []

    with open(pdf_path, "rb") as pdf_file:
        reader = pypdf.PdfReader(pdf_file)

        # Iterate over all pages
        for page_number, page in enumerate(reader.pages, start=1):
            page_text = page.extract_text()

            if page_text:
                full_text.append(f"--- Page {page_number} ---\n{page_text}")

    return "\n\n".join(full_text)


# Use it
invoice_path = Path("/data/invoices/vendor_invoice_q4.pdf")
text = extract_text_from_pdf(invoice_path)
print(text[:500])  # Preview the first 500 characters

Reading PDF Metadata

PDFs carry metadata about the document — author, creation date, title, software used to create it. This is often useful for logging and validation.

from pathlib import Path
import pypdf


def get_pdf_metadata(pdf_path: Path) -> dict:
    """
    Extract metadata from a PDF file.

    Returns a dict with keys: title, author, creator, producer,
    creation_date, modification_date, page_count, is_encrypted.
    """
    with open(pdf_path, "rb") as pdf_file:
        reader = pypdf.PdfReader(pdf_file)
        metadata = reader.metadata

        return {
            "title": metadata.get("/Title", ""),
            "author": metadata.get("/Author", ""),
            "creator": metadata.get("/Creator", ""),
            "producer": metadata.get("/Producer", ""),
            "creation_date": metadata.get("/CreationDate", ""),
            "modification_date": metadata.get("/ModDate", ""),
            "page_count": len(reader.pages),
            "is_encrypted": reader.is_encrypted,
            "file_path": str(pdf_path),
            "file_size_kb": round(pdf_path.stat().st_size / 1024, 1),
        }


metadata = get_pdf_metadata(Path("/data/invoices/vendor_invoice_q4.pdf"))
print(f"Title:    {metadata['title']}")
print(f"Pages:    {metadata['page_count']}")
print(f"Created:  {metadata['creation_date']}")
print(f"Size:     {metadata['file_size_kb']} KB")

Extracting Text Page by Page

For business documents, you often need text from specific pages rather than the whole document:

from pathlib import Path
import pypdf


def extract_page_text(pdf_path: Path, page_number: int) -> str:
    """
    Extract text from a specific page (1-indexed).

    Args:
        pdf_path: Path to the PDF.
        page_number: Page to extract (starts at 1).

    Returns:
        Text content of the page, or empty string if page doesn't exist.
    """
    with open(pdf_path, "rb") as pdf_file:
        reader = pypdf.PdfReader(pdf_file)

        # Convert 1-indexed page number to 0-indexed list position
        page_index = page_number - 1

        if page_index < 0 or page_index >= len(reader.pages):
            return ""

        return reader.pages[page_index].extract_text() or ""


def extract_all_pages(pdf_path: Path) -> list[tuple[int, str]]:
    """
    Extract text from every page, returning a list of (page_number, text) tuples.
    """
    pages = []

    with open(pdf_path, "rb") as pdf_file:
        reader = pypdf.PdfReader(pdf_file)

        for page_number, page in enumerate(reader.pages, start=1):
            text = page.extract_text() or ""
            pages.append((page_number, text))

    return pages

Finding Patterns in PDF Text

Once you have extracted text, you can search it for business-relevant patterns using regular expressions:

import re
from pathlib import Path
import pypdf


def find_dollar_amounts(text: str) -> list[float]:
    """
    Find all dollar amounts in extracted PDF text.

    Matches patterns like: $1,234.56  $1234.56  $1,234  $0.99
    """
    # Pattern: optional $, then digits with optional commas, optional .cents
    pattern = re.compile(r"\$?\s*([\d,]+(?:\.\d{2})?)")
    matches = pattern.findall(text)

    amounts = []
    for match in matches:
        # Remove commas before converting to float
        cleaned = match.replace(",", "")
        try:
            amount = float(cleaned)
            if amount > 0:  # Filter out zeros and noise
                amounts.append(amount)
        except ValueError:
            continue

    return amounts


def find_invoice_total(text: str) -> float | None:
    """
    Attempt to find the invoice total in extracted PDF text.

    Looks for lines containing "total" (case-insensitive) followed by
    or near a dollar amount. Returns the amount found, or None.

    This is a heuristic — it works for consistently-formatted invoices.
    Always verify the results against a sample of actual documents.
    """
    lines = text.split("\n")

    for line in lines:
        if "total" in line.lower():
            # Look for a dollar amount on this line
            amounts = find_dollar_amounts(line)
            if amounts:
                # Take the largest amount on the "total" line
                # (avoids picking up tax sub-totals)
                return max(amounts)

    return None


# Use it
text = extract_text_from_pdf(Path("/data/invoices/vendor_q4.pdf"))
total = find_invoice_total(text)

if total is not None:
    print(f"Invoice total: ${total:,.2f}")
else:
    print("Could not automatically extract total — manual review required")

Splitting and Merging PDFs

from pathlib import Path
import pypdf


def split_pdf_by_page(
    source_pdf: Path,
    output_dir: Path,
    base_name: str = None,
) -> list[Path]:
    """
    Split a multi-page PDF into individual single-page PDFs.

    Args:
        source_pdf: The PDF to split.
        output_dir: Where to save the page files.
        base_name: Base filename for pages (defaults to source filename stem).

    Returns:
        List of paths to created page files.
    """
    output_dir.mkdir(parents=True, exist_ok=True)
    base = base_name or source_pdf.stem
    created_files = []

    with open(source_pdf, "rb") as pdf_file:
        reader = pypdf.PdfReader(pdf_file)

        for page_number, page in enumerate(reader.pages, start=1):
            writer = pypdf.PdfWriter()
            writer.add_page(page)

            output_path = output_dir / f"{base}_page_{page_number:03d}.pdf"

            with open(output_path, "wb") as output_file:
                writer.write(output_file)

            created_files.append(output_path)

    return created_files


def merge_pdfs(
    pdf_paths: list[Path],
    output_path: Path,
) -> Path:
    """
    Merge multiple PDF files into a single PDF.

    Pages are appended in the order provided in pdf_paths.

    Returns:
        Path to the created merged PDF.
    """
    writer = pypdf.PdfWriter()

    for pdf_path in pdf_paths:
        with open(pdf_path, "rb") as pdf_file:
            reader = pypdf.PdfReader(pdf_file)
            for page in reader.pages:
                writer.add_page(page)

    output_path.parent.mkdir(parents=True, exist_ok=True)

    with open(output_path, "wb") as output_file:
        writer.write(output_file)

    return output_path


# Example: merge all monthly statements into a quarterly PDF
statements_dir = Path("/data/statements")
monthly_pdfs = sorted(statements_dir.glob("statement_2024_*.pdf"))
quarterly_pdf = Path("/data/statements/Q4_2024_complete.pdf")

merge_pdfs(monthly_pdfs, quarterly_pdf)
print(f"Merged {len(monthly_pdfs)} files into {quarterly_pdf.name}")

18.3 Working with Word Documents Using python-docx

python-docx is the standard library for reading and writing .docx files.

pip install python-docx

Reading a Word Document

from pathlib import Path
import docx


def read_document_text(doc_path: Path) -> str:
    """
    Extract all text from a Word document as a single string.

    Each paragraph is on its own line.
    """
    doc = docx.Document(str(doc_path))
    paragraphs = [para.text for para in doc.paragraphs if para.text.strip()]
    return "\n".join(paragraphs)


def read_document_structure(doc_path: Path) -> list[dict]:
    """
    Read a Word document and return its structure as a list of elements.

    Each element is a dict with:
        type: "paragraph" or "table"
        style: paragraph style name (e.g., "Heading 1", "Normal")
        text: the text content (for paragraphs)
        rows: list of row data (for tables)
    """
    doc = docx.Document(str(doc_path))
    elements = []

    for paragraph in doc.paragraphs:
        if paragraph.text.strip():
            elements.append({
                "type": "paragraph",
                "style": paragraph.style.name,
                "text": paragraph.text,
            })

    for table in doc.tables:
        table_data = []
        for row in table.rows:
            row_data = [cell.text.strip() for cell in row.cells]
            table_data.append(row_data)

        elements.append({
            "type": "table",
            "rows": table_data,
            "row_count": len(table.rows),
            "column_count": len(table.columns),
        })

    return elements

Creating a Word Document from Scratch

The real power of python-docx is generating documents programmatically from data. This is the report generation pattern.

from pathlib import Path
import docx
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT


def create_sales_report(
    report_data: dict,
    output_path: Path,
) -> Path:
    """
    Generate a formatted Word sales report from structured data.

    Args:
        report_data: dict with keys:
            title, period, prepared_by, region, summary_text,
            sales_table (list of row dicts), notes
        output_path: Where to save the .docx file.

    Returns:
        Path to the created document.
    """
    doc = docx.Document()

    # ── PAGE SETUP ────────────────────────────────────────────────────────────
    # Set margins (Inches)
    for section in doc.sections:
        section.top_margin = Inches(1.0)
        section.bottom_margin = Inches(1.0)
        section.left_margin = Inches(1.25)
        section.right_margin = Inches(1.25)

    # ── COVER HEADING ─────────────────────────────────────────────────────────
    title_para = doc.add_heading(report_data["title"], level=0)
    title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER

    doc.add_paragraph()  # Spacer

    # Subtitle (Region + Period)
    subtitle = doc.add_paragraph()
    subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER
    subtitle_run = subtitle.add_run(
        f"{report_data['region']} Region  |  {report_data['period']}"
    )
    subtitle_run.font.size = Pt(14)
    subtitle_run.font.color.rgb = RGBColor(0x44, 0x44, 0x44)

    # Prepared by
    prepared = doc.add_paragraph()
    prepared.alignment = WD_ALIGN_PARAGRAPH.CENTER
    prepared_run = prepared.add_run(f"Prepared by: {report_data['prepared_by']}")
    prepared_run.font.size = Pt(11)
    prepared_run.font.italic = True

    doc.add_paragraph()  # Spacer

    # ── EXECUTIVE SUMMARY ─────────────────────────────────────────────────────
    doc.add_heading("Executive Summary", level=1)
    doc.add_paragraph(report_data["summary_text"])

    doc.add_paragraph()  # Spacer

    # ── SALES TABLE ───────────────────────────────────────────────────────────
    doc.add_heading("Sales Performance by Product", level=1)

    sales_rows = report_data["sales_table"]
    if sales_rows:
        # Create table with header + data rows
        header_keys = list(sales_rows[0].keys())
        table = doc.add_table(rows=1 + len(sales_rows), cols=len(header_keys))
        table.style = "Light Shading Accent 1"
        table.alignment = WD_TABLE_ALIGNMENT.CENTER

        # Header row
        header_row = table.rows[0]
        for col_index, column_name in enumerate(header_keys):
            cell = header_row.cells[col_index]
            cell.text = column_name
            # Bold the header
            for run in cell.paragraphs[0].runs:
                run.font.bold = True

        # Data rows
        for row_index, row_data in enumerate(sales_rows, start=1):
            table_row = table.rows[row_index]
            for col_index, key in enumerate(header_keys):
                table_row.cells[col_index].text = str(row_data[key])

    doc.add_paragraph()  # Spacer

    # ── NOTES ─────────────────────────────────────────────────────────────────
    if report_data.get("notes"):
        doc.add_heading("Notes", level=2)
        for note in report_data["notes"]:
            doc.add_paragraph(note, style="List Bullet")

    # ── FOOTER LINE ───────────────────────────────────────────────────────────
    doc.add_paragraph()
    footer_para = doc.add_paragraph()
    footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    footer_run = footer_para.add_run(
        f"Acme Corp Confidential  |  {report_data['period']}"
    )
    footer_run.font.size = Pt(9)
    footer_run.font.italic = True
    footer_run.font.color.rgb = RGBColor(0x99, 0x99, 0x99)

    # ── SAVE ──────────────────────────────────────────────────────────────────
    output_path.parent.mkdir(parents=True, exist_ok=True)
    doc.save(str(output_path))
    return output_path

Calling this function with a data dictionary:

report_data = {
    "title": "Q4 2024 Regional Sales Review",
    "period": "Q4 2024 (Oct–Dec)",
    "prepared_by": "Priya Okonkwo",
    "region": "Chicago",
    "summary_text": (
        "The Chicago region delivered Q4 revenue of $1,234,567, exceeding target "
        "by 4.2%. Office supplies led growth, with stapler category up 18% YoY. "
        "Year-end stocking orders from school districts drove a strong December. "
        "Q1 2025 pipeline is healthy with three pending enterprise contracts."
    ),
    "sales_table": [
        {"Product": "Office Supplies", "Q4 Revenue": "$487,234", "vs Target": "+8.1%"},
        {"Product": "Furniture",       "Q4 Revenue": "$312,890", "vs Target": "+1.2%"},
        {"Product": "Technology",      "Q4 Revenue": "$289,120", "vs Target": "-2.4%"},
        {"Product": "Cleaning",        "Q4 Revenue": "$145,323", "vs Target": "+6.7%"},
    ],
    "notes": [
        "Technology shortfall due to delayed shipment resolved in early Q1.",
        "See Appendix A for complete product-level breakdowns.",
    ],
}

output_file = Path("/data/reports/q4_chicago_report.docx")
created = create_sales_report(report_data, output_file)
print(f"Report created: {created}")

18.4 Word Templates: Finding and Replacing Placeholders

Rather than building a document from scratch, you often want to fill in a pre-designed template — a Word document with placeholders like {{CLIENT_NAME}} or {{PROJECT_DATE}} that Python replaces with real values.

This is the pattern for proposals, contracts, offer letters, and any other document where the structure is designed by a human but the content varies.

The Placeholder Replacement Pattern

from pathlib import Path
import docx


def replace_placeholders_in_paragraph(paragraph, replacements: dict) -> None:
    """
    Replace placeholder text in a paragraph's runs.

    Handles the case where a placeholder like {{CLIENT_NAME}} may be split
    across multiple runs by Word's XML structure.

    Args:
        paragraph: A python-docx Paragraph object.
        replacements: Dict mapping placeholder to replacement value.
                      e.g., {"{{CLIENT_NAME}}": "Acme Corp"}
    """
    for placeholder, value in replacements.items():
        # First, check if the whole placeholder is in the paragraph text
        if placeholder not in paragraph.text:
            continue

        # The placeholder might be split across runs — reassemble and replace
        # Strategy: combine all run text, replace, then put it back in run 0
        full_text = "".join(run.text for run in paragraph.runs)

        if placeholder in full_text:
            new_text = full_text.replace(placeholder, str(value))
            # Put the replaced text into the first run, clear the rest
            if paragraph.runs:
                paragraph.runs[0].text = new_text
                for run in paragraph.runs[1:]:
                    run.text = ""


def fill_word_template(
    template_path: Path,
    output_path: Path,
    replacements: dict,
) -> Path:
    """
    Fill a Word template by replacing all placeholders with values.

    Placeholders in the template should use a distinctive format like
    {{PLACEHOLDER_NAME}} to avoid accidentally replacing normal text.

    Args:
        template_path: Path to the .docx template file.
        output_path: Path for the filled output document.
        replacements: Dict mapping each placeholder to its replacement value.

    Returns:
        Path to the filled document.
    """
    doc = docx.Document(str(template_path))

    # Replace in body paragraphs
    for paragraph in doc.paragraphs:
        replace_placeholders_in_paragraph(paragraph, replacements)

    # Replace in table cells
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    replace_placeholders_in_paragraph(paragraph, replacements)

    # Replace in headers and footers
    for section in doc.sections:
        for paragraph in section.header.paragraphs:
            replace_placeholders_in_paragraph(paragraph, replacements)
        for paragraph in section.footer.paragraphs:
            replace_placeholders_in_paragraph(paragraph, replacements)

    output_path.parent.mkdir(parents=True, exist_ok=True)
    doc.save(str(output_path))
    return output_path

Using a Template for Batch Document Generation

from pathlib import Path


def generate_regional_reports(
    template_path: Path,
    regions_data: list[dict],
    output_dir: Path,
) -> list[Path]:
    """
    Generate one filled report per region using a shared template.

    Each dict in regions_data must contain all placeholders needed by the template.

    Returns:
        List of paths to created documents.
    """
    output_dir.mkdir(parents=True, exist_ok=True)
    created_files = []

    for region_data in regions_data:
        region_name = region_data.get("{{REGION}}", "unknown").lower().replace(" ", "_")
        output_filename = f"q4_report_{region_name}.docx"
        output_path = output_dir / output_filename

        fill_word_template(template_path, output_path, region_data)
        created_files.append(output_path)
        print(f"  Generated: {output_filename}")

    return created_files


# Example data — each dict fills one report
regions_data = [
    {
        "{{REGION}}": "Chicago",
        "{{PERIOD}}": "Q4 2024",
        "{{REVENUE}}": "$1,234,567",
        "{{VARIANCE}}": "+4.2%",
        "{{PREPARED_BY}}": "Priya Okonkwo",
    },
    {
        "{{REGION}}": "Nashville",
        "{{PERIOD}}": "Q4 2024",
        "{{REVENUE}}": "$987,432",
        "{{VARIANCE}}": "-1.3%",
        "{{PREPARED_BY}}": "Priya Okonkwo",
    },
    # ... additional regions
]

created = generate_regional_reports(
    template_path=Path("/templates/regional_report_template.docx"),
    regions_data=regions_data,
    output_dir=Path("/data/reports/q4_2024"),
)
print(f"\nGenerated {len(created)} reports.")

18.5 Generating Structured Documents: Headings, Tables, and Formatting

Here are the key python-docx patterns you will use repeatedly.

Adding Headings

import docx

doc = docx.Document()

# Heading levels 0–4
# Level 0 is the "Title" style (document title)
# Level 1–4 are "Heading 1" through "Heading 4"
doc.add_heading("Annual Sales Report 2024", level=0)
doc.add_heading("Executive Summary", level=1)
doc.add_heading("Regional Performance", level=1)
doc.add_heading("Chicago Region", level=2)
doc.add_heading("Q4 Detail", level=3)

Adding Paragraphs with Formatting

import docx
from docx.shared import Pt, RGBColor

doc = docx.Document()

# Simple paragraph
doc.add_paragraph("This is a normal paragraph.")

# Paragraph with inline formatting using runs
para = doc.add_paragraph()
para.add_run("Revenue grew by ").font.size = Pt(12)
bold_run = para.add_run("4.2%")
bold_run.font.bold = True
bold_run.font.size = Pt(12)
bold_run.font.color.rgb = RGBColor(0x00, 0x70, 0xC0)  # Blue
para.add_run(" compared to the same period last year.")

# Bullet list
doc.add_paragraph("First bullet point", style="List Bullet")
doc.add_paragraph("Second bullet point", style="List Bullet")

# Numbered list
doc.add_paragraph("First step", style="List Number")
doc.add_paragraph("Second step", style="List Number")

Adding Tables

import docx
from docx.shared import Pt


def add_data_table(doc: docx.Document, headers: list, rows: list) -> None:
    """
    Add a formatted data table to a document.

    Args:
        doc: The Document object to add the table to.
        headers: List of column header strings.
        rows: List of lists, where each inner list is one row of data.
    """
    table = doc.add_table(rows=1 + len(rows), cols=len(headers))
    table.style = "Light Grid Accent 1"

    # Header row
    header_row = table.rows[0]
    for col_index, header_text in enumerate(headers):
        cell = header_row.cells[col_index]
        cell.text = header_text
        # Make header text bold
        for paragraph in cell.paragraphs:
            for run in paragraph.runs:
                run.font.bold = True
                run.font.size = Pt(11)

    # Data rows
    for row_index, row_data in enumerate(rows, start=1):
        table_row = table.rows[row_index]
        for col_index, cell_value in enumerate(row_data):
            table_row.cells[col_index].text = str(cell_value)


# Use it
doc = docx.Document()
doc.add_heading("Q4 Regional Summary", level=1)

headers = ["Region", "Revenue", "Units Sold", "vs Target"]
rows = [
    ["Chicago",     "$1,234,567", "9,842",  "+4.2%"],
    ["Nashville",   "$987,432",   "7,891",  "-1.3%"],
    ["Cincinnati",  "$876,543",   "7,012",  "+2.1%"],
    ["St. Louis",   "$765,432",   "6,123",  "+0.8%"],
]

add_data_table(doc, headers, rows)
doc.save("/data/reports/q4_summary.docx")

18.6 Combining Python Data with Document Templates: The Report Generation Pattern

The pattern that makes document automation powerful:

  1. A business user designs the report layout in Word (or defines the data structure)
  2. Python connects to the data source (CSV, database, API)
  3. Python populates the template with live data
  4. The output is a properly formatted, branded document

Here is the full pattern assembled:

"""
The Report Generation Pattern

Data Source -> Python -> Template -> Final Document
"""
import datetime
from pathlib import Path
import pandas
import docx


def load_regional_data(csv_path: Path, region: str) -> dict:
    """
    Load and aggregate data for a specific region from the Acme sales CSV.

    Returns a dict of summary statistics ready to be inserted into a report.
    """
    df = pandas.read_csv(csv_path)
    region_df = df[df["region"] == region].copy()

    if region_df.empty:
        raise ValueError(f"No data found for region: {region}")

    total_revenue = region_df["revenue"].sum()
    total_units = region_df["units_sold"].sum()
    top_product = region_df.groupby("product")["revenue"].sum().idxmax()
    avg_order = region_df["revenue"].mean()

    # Build the sales table for the document
    product_summary = (
        region_df.groupby("product")
        .agg(revenue=("revenue", "sum"), units=("units_sold", "sum"))
        .reset_index()
        .sort_values("revenue", ascending=False)
    )

    return {
        "region": region,
        "total_revenue": f"${total_revenue:,.0f}",
        "total_units": f"{total_units:,}",
        "top_product": top_product,
        "avg_order_value": f"${avg_order:,.2f}",
        "product_rows": product_summary.values.tolist(),
        "product_headers": ["Product", "Revenue", "Units Sold"],
        "report_date": datetime.date.today().strftime("%B %d, %Y"),
    }


def generate_regional_report(
    data: dict,
    template_path: Path,
    output_path: Path,
) -> Path:
    """
    Generate a filled regional report from a template and data dict.
    """
    # Build the simple placeholder replacements
    replacements = {
        "{{REGION}}": data["region"],
        "{{TOTAL_REVENUE}}": data["total_revenue"],
        "{{TOTAL_UNITS}}": data["total_units"],
        "{{TOP_PRODUCT}}": data["top_product"],
        "{{AVG_ORDER_VALUE}}": data["avg_order_value"],
        "{{REPORT_DATE}}": data["report_date"],
    }

    # Fill text placeholders
    doc = docx.Document(str(template_path))
    for paragraph in doc.paragraphs:
        for placeholder, value in replacements.items():
            if placeholder in paragraph.text:
                for run in paragraph.runs:
                    if placeholder in run.text:
                        run.text = run.text.replace(placeholder, str(value))

    # Find the product table placeholder and fill it
    # (A paragraph containing "{{PRODUCT_TABLE}}" marks where the table goes)
    for i, paragraph in enumerate(doc.paragraphs):
        if "{{PRODUCT_TABLE}}" in paragraph.text:
            # Clear the placeholder paragraph
            paragraph.clear()
            # Add the table after this paragraph's position
            table = doc.add_table(
                rows=1 + len(data["product_rows"]),
                cols=len(data["product_headers"])
            )
            table.style = "Light Shading Accent 1"
            # Header
            for col, header in enumerate(data["product_headers"]):
                table.rows[0].cells[col].text = header
            # Data
            for row_i, row in enumerate(data["product_rows"], start=1):
                for col, val in enumerate(row):
                    table.rows[row_i].cells[col].text = str(val)
            break

    output_path.parent.mkdir(parents=True, exist_ok=True)
    doc.save(str(output_path))
    return output_path

18.7 Generating PDFs with reportlab (An Introduction)

pypdf reads PDFs but cannot create them from scratch. For generating new PDF files from Python data, reportlab is the standard library.

pip install reportlab

reportlab is significantly more complex than python-docx — it works at a lower level, placing content at specific coordinates on a page. This is a brief introduction; full coverage is in Chapter 36 (Automated Report Generation).

from pathlib import Path
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.units import inch
from reportlab.pdfgen import canvas as pdf_canvas


def create_simple_pdf(output_path: Path, data: dict) -> Path:
    """
    Create a simple PDF with a title and some data rows.

    This demonstrates the reportlab canvas API.
    Full report generation with tables and styles is covered in Chapter 36.

    Args:
        output_path: Where to save the PDF.
        data: Dict with 'title' (str) and 'rows' (list of dicts).

    Returns:
        Path to the created PDF.
    """
    output_path.parent.mkdir(parents=True, exist_ok=True)

    # Create a canvas object (the "page")
    c = pdf_canvas.Canvas(str(output_path), pagesize=LETTER)
    page_width, page_height = LETTER  # 8.5 x 11 inches in points

    # Coordinates in reportlab: (0, 0) is bottom-left corner
    # LETTER page: ~612 x 792 points

    # ── TITLE ─────────────────────────────────────────────────────────────────
    c.setFont("Helvetica-Bold", 18)
    c.drawString(inch, page_height - inch, data["title"])

    # ── SUBTITLE ──────────────────────────────────────────────────────────────
    c.setFont("Helvetica", 12)
    c.setFillColorRGB(0.4, 0.4, 0.4)
    c.drawString(inch, page_height - 1.4 * inch, data.get("subtitle", ""))

    # ── LINE SEPARATOR ────────────────────────────────────────────────────────
    c.setStrokeColorRGB(0.8, 0.8, 0.8)
    c.line(inch, page_height - 1.6 * inch, page_width - inch, page_height - 1.6 * inch)

    # ── COLUMN HEADERS ────────────────────────────────────────────────────────
    if data.get("headers") and data.get("rows"):
        c.setFillColorRGB(0, 0, 0)
        c.setFont("Helvetica-Bold", 10)
        y_position = page_height - 2.0 * inch
        x_start = inch
        col_width = (page_width - 2 * inch) / len(data["headers"])

        for col_index, header in enumerate(data["headers"]):
            c.drawString(x_start + col_index * col_width, y_position, str(header))

        # ── DATA ROWS ─────────────────────────────────────────────────────────
        c.setFont("Helvetica", 10)
        row_height = 0.25 * inch

        for row_index, row in enumerate(data["rows"]):
            y_position -= row_height

            # Alternate row shading
            if row_index % 2 == 0:
                c.setFillColorRGB(0.95, 0.95, 0.95)
                c.rect(
                    inch, y_position - 3,
                    page_width - 2 * inch, row_height,
                    fill=1, stroke=0
                )
                c.setFillColorRGB(0, 0, 0)

            for col_index, value in enumerate(row):
                c.drawString(
                    x_start + col_index * col_width,
                    y_position,
                    str(value)
                )

    # Save the page
    c.save()
    return output_path


# Quick usage example
data = {
    "title": "Acme Corp — Q4 2024 Summary",
    "subtitle": "Prepared by Priya Okonkwo  |  January 10, 2025",
    "headers": ["Region", "Revenue", "Units", "Variance"],
    "rows": [
        ["Chicago",    "$1,234,567", "9,842", "+4.2%"],
        ["Nashville",  "$987,432",   "7,891", "-1.3%"],
        ["Cincinnati", "$876,543",   "7,012", "+2.1%"],
        ["St. Louis",  "$765,432",   "6,123", "+0.8%"],
    ],
}

output = Path("/data/reports/q4_summary.pdf")
create_simple_pdf(output, data)
print(f"PDF created: {output}")

For professional report generation with complex layouts, tables, charts, and page headers/footers, Chapter 36 covers reportlab in depth using its higher-level platypus API.


18.8 Extracting Data from Vendor PDFs: Practical Patterns

The invoice reconciliation scenario that opened this chapter is one of the most common real-world PDF automation tasks. Here is the complete approach.

The Problem with Vendor PDFs

Every vendor's invoice is formatted differently. Priya's 47 invoices come from 12 different vendors, each with their own template. The only options for a fully automated solution are:

  1. Write a vendor-specific extractor for each format. High accuracy, but requires maintenance when vendors update their template.
  2. Write a general-purpose extractor using heuristics. Lower accuracy, works across formats, but will have failures that require manual review.
  3. Hybrid: general extractor with a manual review queue. The professional approach.

The Hybrid Approach

import re
from pathlib import Path
from dataclasses import dataclass, field
import pypdf


@dataclass
class InvoiceExtractionResult:
    """Results from attempting to extract invoice data from a PDF."""
    file_path: Path
    vendor_name: str = ""
    invoice_number: str = ""
    invoice_date: str = ""
    total_amount: float | None = None
    confidence: str = "low"  # "high", "medium", "low"
    raw_text: str = ""
    extraction_notes: list[str] = field(default_factory=list)


def extract_invoice_data(pdf_path: Path) -> InvoiceExtractionResult:
    """
    Attempt to extract key fields from a vendor invoice PDF.

    Uses a series of heuristic patterns to find the invoice total,
    vendor name, invoice number, and date.

    Returns an InvoiceExtractionResult. Always check the 'confidence'
    field — "low" confidence results need manual review.
    """
    result = InvoiceExtractionResult(file_path=pdf_path)

    # Extract raw text
    try:
        with open(pdf_path, "rb") as pdf_file:
            reader = pypdf.PdfReader(pdf_file)
            if reader.is_encrypted:
                result.extraction_notes.append("PDF is encrypted — cannot extract text")
                return result
            all_text = "\n".join(
                page.extract_text() or "" for page in reader.pages
            )
    except Exception as error:
        result.extraction_notes.append(f"PDF read error: {error}")
        return result

    if not all_text.strip():
        result.extraction_notes.append("No text extracted — likely a scanned image PDF")
        return result

    result.raw_text = all_text

    # ── EXTRACT TOTAL ─────────────────────────────────────────────────────────
    # Look for common "total due" phrases and grab the nearest dollar amount
    total_patterns = [
        r"(?:total\s+due|amount\s+due|balance\s+due|invoice\s+total)[:\s]*\$?([\d,]+\.?\d*)",
        r"(?:grand\s+total|net\s+total|total\s+amount)[:\s]*\$?([\d,]+\.?\d*)",
        r"total[:\s]+\$?([\d,]+\.\d{2})",
    ]

    for pattern in total_patterns:
        match = re.search(pattern, all_text, re.IGNORECASE)
        if match:
            amount_str = match.group(1).replace(",", "")
            try:
                result.total_amount = float(amount_str)
                result.confidence = "high" if "due" in pattern else "medium"
                break
            except ValueError:
                continue

    if result.total_amount is None:
        result.extraction_notes.append("Could not find invoice total — manual review required")

    # ── EXTRACT INVOICE NUMBER ────────────────────────────────────────────────
    inv_patterns = [
        r"invoice\s+(?:no\.?|number|#)\s*[:\s]?\s*([A-Z0-9\-]+)",
        r"inv[.\s]*(?:no\.?|#)\s*[:\s]?\s*([A-Z0-9\-]+)",
    ]

    for pattern in inv_patterns:
        match = re.search(pattern, all_text, re.IGNORECASE)
        if match:
            result.invoice_number = match.group(1).strip()
            break

    # ── EXTRACT DATE ──────────────────────────────────────────────────────────
    date_pattern = re.compile(
        r"(?:invoice\s+date|date)[:\s]+(\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4})"
        r"|(\w+\s+\d{1,2},\s+\d{4})",
        re.IGNORECASE
    )
    date_match = date_pattern.search(all_text)
    if date_match:
        result.invoice_date = (date_match.group(1) or date_match.group(2) or "").strip()

    # ── DETERMINE CONFIDENCE ──────────────────────────────────────────────────
    if result.total_amount and result.invoice_number and result.invoice_date:
        result.confidence = "high"
    elif result.total_amount:
        result.confidence = "medium"
    else:
        result.confidence = "low"

    return result


def process_invoice_folder(
    invoices_dir: Path,
) -> tuple[list[InvoiceExtractionResult], list[InvoiceExtractionResult]]:
    """
    Process all PDF invoices in a folder.

    Returns:
        (high_confidence_results, needs_review_results)
    """
    high_confidence = []
    needs_review = []

    for pdf_file in sorted(invoices_dir.glob("*.pdf")):
        result = extract_invoice_data(pdf_file)
        print(
            f"  {pdf_file.name:<40} "
            f"Total: {'$' + f'{result.total_amount:,.2f}' if result.total_amount else 'NOT FOUND':<15} "
            f"[{result.confidence}]"
        )

        if result.confidence == "high":
            high_confidence.append(result)
        else:
            needs_review.append(result)

    return high_confidence, needs_review

Summary

  • pypdf reads PDF files: extract text, metadata, split by page, and merge multiple PDFs. Import as import pypdf.
  • PDF text extraction only works on text-based PDFs, not scanned images. Always test by trying to select text manually before building automation.
  • PDF extraction is heuristic, not structural. Use regular expressions to find patterns, and always implement a manual review queue for low-confidence results.
  • python-docx creates and reads .docx files. Use Document() for a blank document or Document(path) to open an existing one.
  • The report generation pattern: load data from any source, populate a template or build a document programmatically, save to .docx.
  • Placeholder replacement (finding {{CLIENT_NAME}} and replacing it) is the standard approach for template-based document generation.
  • python-docx handles headings (add_heading()), paragraphs (add_paragraph()), and tables (add_table()). Formatting is applied through Run objects with .font.bold, .font.size, .font.color, and similar attributes.
  • reportlab generates PDFs from scratch via a canvas API. It requires positioning content at explicit coordinates, making it more complex than python-docx. Full coverage in Chapter 36.
  • For batch document generation (eight regional reports, forty vendor invoices), loop over your data and call the generation function once per record.

Chapter 19: Email Automation and Notifications →