> "A document is not a document when it is a PDF sent by a vendor who generates it from a system you cannot access, containing data you need in a format you cannot parse."
In This Chapter
- Opening Scenario: The Invoice Reconciliation Problem
- 18.1 The Landscape: PDFs, Word Files, and What Python Can Do
- 18.2 Reading PDFs with pypdf
- 18.3 Working with Word Documents Using python-docx
- 18.4 Word Templates: Finding and Replacing Placeholders
- 18.5 Generating Structured Documents: Headings, Tables, and Formatting
- 18.6 Combining Python Data with Document Templates: The Report Generation Pattern
- 18.7 Generating PDFs with reportlab (An Introduction)
- 18.8 Extracting Data from Vendor PDFs: Practical Patterns
- Summary
Chapter 18: Working with PDFs and Word Documents
"A document is not a document when it is a PDF sent by a vendor who generates it from a system you cannot access, containing data you need in a format you cannot parse." — Every analyst who has ever manually re-typed a vendor invoice
Opening Scenario: The Invoice Reconciliation Problem
Sandra Chen drops a folder of 47 PDF files on Priya's desk — metaphorically, through a shared drive. They are vendor invoices from the previous quarter. Finance needs the totals reconciled against the purchase order system before end of day.
Priya opens the first PDF, reads the total at the bottom, types it into a spreadsheet, and closes it. She opens the second one. It is from a different vendor and the total is in a different location on the page. She finds it, types it in. Third one. Different format again.
At this rate, 47 invoices will take 90 minutes. There will be typos. Finance will push back on at least two entries because the handwritten annotations on three invoices obscure the totals.
Meanwhile, Sandra has asked Priya to generate branded Word reports for the Q4 regional sales review. The report has a cover page, an executive summary, four tables (one per region), and a standardized conclusion. There are eight regions. Generating eight nearly-identical documents manually will take most of a day.
Priya has a better idea for both of these.
18.1 The Landscape: PDFs, Word Files, and What Python Can Do
Before writing any code, it is worth being precise about what is possible and what is not.
PDFs: A Format Designed for Humans, Not Machines
PDF stands for Portable Document Format. It was designed in 1993 with one goal: make a document look exactly the same on every printer and screen, regardless of what software created it. It succeeded brilliantly at that goal. As a consequence, it is a deeply unfriendly format for automated data extraction.
A PDF file does not contain structured text the way an HTML file or a Word document does. It contains a series of rendering instructions: "draw the string 'Invoice Total:' at position (x=412, y=218), in 10-point Helvetica." The text is there, but it is positioned on a canvas. There is no structural relationship — no concept of "this value belongs to this label" — unless you build that understanding yourself from the positions.
When PDF text extraction works well: - Simple, consistently-formatted documents (pay stubs, bank statements, invoices from modern billing systems) - Documents with mostly text content and simple layouts - PDFs generated directly from software (Word exports, accounting system outputs)
When PDF text extraction does not work or works poorly: - Scanned documents (images masquerading as PDFs — the text is a picture of text, not actual text) - Complex multi-column layouts where text order breaks when extracted linearly - Tables where the relationship between cells is positional, not structural - PDFs with significant graphics, watermarks, or heavy formatting - Password-protected or rights-managed PDFs
The single most important diagnostic: open the PDF, try to select and copy some text. If you can select text, Python can likely extract it. If you cannot select text (or if you select what looks like text and get garbage), the document is a scanned image and you need OCR (optical character recognition), which is beyond this chapter's scope.
Word Documents: Structured and Python-Friendly
Microsoft Word's .docx format (introduced in Office 2007) is actually a ZIP archive containing XML files. Python's python-docx library reads and writes that XML through a clean API. Word documents have genuine structure: paragraphs, headings, tables, runs, styles. This makes them far more amenable to automation than PDFs.
What python-docx can do well: - Read all text content from a document - Create documents from scratch with headings, paragraphs, tables - Apply styles (Bold, Heading 1, Table Style, etc.) - Replace placeholder text throughout a document - Add images - Control fonts, sizes, colors, alignment
What python-docx cannot do well:
- Preserve complex layouts from existing documents exactly
- Work with .doc (pre-2007 Word format) files — only .docx
- Render documents (you cannot generate a PDF from a docx through python-docx alone)
- Handle all the edge cases of macro-enabled documents (.docm)
18.2 Reading PDFs with pypdf
pypdf (formerly PyPDF2, which is deprecated) is the standard Python library for PDF operations without requiring external tools.
pip install pypdf
Basic Text Extraction
from pathlib import Path
import pypdf
def extract_text_from_pdf(pdf_path: Path) -> str:
"""
Extract all text from a PDF file as a single string.
Returns an empty string if the PDF has no extractable text.
"""
full_text = []
with open(pdf_path, "rb") as pdf_file:
reader = pypdf.PdfReader(pdf_file)
# Iterate over all pages
for page_number, page in enumerate(reader.pages, start=1):
page_text = page.extract_text()
if page_text:
full_text.append(f"--- Page {page_number} ---\n{page_text}")
return "\n\n".join(full_text)
# Use it
invoice_path = Path("/data/invoices/vendor_invoice_q4.pdf")
text = extract_text_from_pdf(invoice_path)
print(text[:500]) # Preview the first 500 characters
Reading PDF Metadata
PDFs carry metadata about the document — author, creation date, title, software used to create it. This is often useful for logging and validation.
from pathlib import Path
import pypdf
def get_pdf_metadata(pdf_path: Path) -> dict:
"""
Extract metadata from a PDF file.
Returns a dict with keys: title, author, creator, producer,
creation_date, modification_date, page_count, is_encrypted.
"""
with open(pdf_path, "rb") as pdf_file:
reader = pypdf.PdfReader(pdf_file)
metadata = reader.metadata
return {
"title": metadata.get("/Title", ""),
"author": metadata.get("/Author", ""),
"creator": metadata.get("/Creator", ""),
"producer": metadata.get("/Producer", ""),
"creation_date": metadata.get("/CreationDate", ""),
"modification_date": metadata.get("/ModDate", ""),
"page_count": len(reader.pages),
"is_encrypted": reader.is_encrypted,
"file_path": str(pdf_path),
"file_size_kb": round(pdf_path.stat().st_size / 1024, 1),
}
metadata = get_pdf_metadata(Path("/data/invoices/vendor_invoice_q4.pdf"))
print(f"Title: {metadata['title']}")
print(f"Pages: {metadata['page_count']}")
print(f"Created: {metadata['creation_date']}")
print(f"Size: {metadata['file_size_kb']} KB")
Extracting Text Page by Page
For business documents, you often need text from specific pages rather than the whole document:
from pathlib import Path
import pypdf
def extract_page_text(pdf_path: Path, page_number: int) -> str:
"""
Extract text from a specific page (1-indexed).
Args:
pdf_path: Path to the PDF.
page_number: Page to extract (starts at 1).
Returns:
Text content of the page, or empty string if page doesn't exist.
"""
with open(pdf_path, "rb") as pdf_file:
reader = pypdf.PdfReader(pdf_file)
# Convert 1-indexed page number to 0-indexed list position
page_index = page_number - 1
if page_index < 0 or page_index >= len(reader.pages):
return ""
return reader.pages[page_index].extract_text() or ""
def extract_all_pages(pdf_path: Path) -> list[tuple[int, str]]:
"""
Extract text from every page, returning a list of (page_number, text) tuples.
"""
pages = []
with open(pdf_path, "rb") as pdf_file:
reader = pypdf.PdfReader(pdf_file)
for page_number, page in enumerate(reader.pages, start=1):
text = page.extract_text() or ""
pages.append((page_number, text))
return pages
Finding Patterns in PDF Text
Once you have extracted text, you can search it for business-relevant patterns using regular expressions:
import re
from pathlib import Path
import pypdf
def find_dollar_amounts(text: str) -> list[float]:
"""
Find all dollar amounts in extracted PDF text.
Matches patterns like: $1,234.56 $1234.56 $1,234 $0.99
"""
# Pattern: optional $, then digits with optional commas, optional .cents
pattern = re.compile(r"\$?\s*([\d,]+(?:\.\d{2})?)")
matches = pattern.findall(text)
amounts = []
for match in matches:
# Remove commas before converting to float
cleaned = match.replace(",", "")
try:
amount = float(cleaned)
if amount > 0: # Filter out zeros and noise
amounts.append(amount)
except ValueError:
continue
return amounts
def find_invoice_total(text: str) -> float | None:
"""
Attempt to find the invoice total in extracted PDF text.
Looks for lines containing "total" (case-insensitive) followed by
or near a dollar amount. Returns the amount found, or None.
This is a heuristic — it works for consistently-formatted invoices.
Always verify the results against a sample of actual documents.
"""
lines = text.split("\n")
for line in lines:
if "total" in line.lower():
# Look for a dollar amount on this line
amounts = find_dollar_amounts(line)
if amounts:
# Take the largest amount on the "total" line
# (avoids picking up tax sub-totals)
return max(amounts)
return None
# Use it
text = extract_text_from_pdf(Path("/data/invoices/vendor_q4.pdf"))
total = find_invoice_total(text)
if total is not None:
print(f"Invoice total: ${total:,.2f}")
else:
print("Could not automatically extract total — manual review required")
Splitting and Merging PDFs
from pathlib import Path
import pypdf
def split_pdf_by_page(
source_pdf: Path,
output_dir: Path,
base_name: str = None,
) -> list[Path]:
"""
Split a multi-page PDF into individual single-page PDFs.
Args:
source_pdf: The PDF to split.
output_dir: Where to save the page files.
base_name: Base filename for pages (defaults to source filename stem).
Returns:
List of paths to created page files.
"""
output_dir.mkdir(parents=True, exist_ok=True)
base = base_name or source_pdf.stem
created_files = []
with open(source_pdf, "rb") as pdf_file:
reader = pypdf.PdfReader(pdf_file)
for page_number, page in enumerate(reader.pages, start=1):
writer = pypdf.PdfWriter()
writer.add_page(page)
output_path = output_dir / f"{base}_page_{page_number:03d}.pdf"
with open(output_path, "wb") as output_file:
writer.write(output_file)
created_files.append(output_path)
return created_files
def merge_pdfs(
pdf_paths: list[Path],
output_path: Path,
) -> Path:
"""
Merge multiple PDF files into a single PDF.
Pages are appended in the order provided in pdf_paths.
Returns:
Path to the created merged PDF.
"""
writer = pypdf.PdfWriter()
for pdf_path in pdf_paths:
with open(pdf_path, "rb") as pdf_file:
reader = pypdf.PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as output_file:
writer.write(output_file)
return output_path
# Example: merge all monthly statements into a quarterly PDF
statements_dir = Path("/data/statements")
monthly_pdfs = sorted(statements_dir.glob("statement_2024_*.pdf"))
quarterly_pdf = Path("/data/statements/Q4_2024_complete.pdf")
merge_pdfs(monthly_pdfs, quarterly_pdf)
print(f"Merged {len(monthly_pdfs)} files into {quarterly_pdf.name}")
18.3 Working with Word Documents Using python-docx
python-docx is the standard library for reading and writing .docx files.
pip install python-docx
Reading a Word Document
from pathlib import Path
import docx
def read_document_text(doc_path: Path) -> str:
"""
Extract all text from a Word document as a single string.
Each paragraph is on its own line.
"""
doc = docx.Document(str(doc_path))
paragraphs = [para.text for para in doc.paragraphs if para.text.strip()]
return "\n".join(paragraphs)
def read_document_structure(doc_path: Path) -> list[dict]:
"""
Read a Word document and return its structure as a list of elements.
Each element is a dict with:
type: "paragraph" or "table"
style: paragraph style name (e.g., "Heading 1", "Normal")
text: the text content (for paragraphs)
rows: list of row data (for tables)
"""
doc = docx.Document(str(doc_path))
elements = []
for paragraph in doc.paragraphs:
if paragraph.text.strip():
elements.append({
"type": "paragraph",
"style": paragraph.style.name,
"text": paragraph.text,
})
for table in doc.tables:
table_data = []
for row in table.rows:
row_data = [cell.text.strip() for cell in row.cells]
table_data.append(row_data)
elements.append({
"type": "table",
"rows": table_data,
"row_count": len(table.rows),
"column_count": len(table.columns),
})
return elements
Creating a Word Document from Scratch
The real power of python-docx is generating documents programmatically from data. This is the report generation pattern.
from pathlib import Path
import docx
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT
def create_sales_report(
report_data: dict,
output_path: Path,
) -> Path:
"""
Generate a formatted Word sales report from structured data.
Args:
report_data: dict with keys:
title, period, prepared_by, region, summary_text,
sales_table (list of row dicts), notes
output_path: Where to save the .docx file.
Returns:
Path to the created document.
"""
doc = docx.Document()
# ── PAGE SETUP ────────────────────────────────────────────────────────────
# Set margins (Inches)
for section in doc.sections:
section.top_margin = Inches(1.0)
section.bottom_margin = Inches(1.0)
section.left_margin = Inches(1.25)
section.right_margin = Inches(1.25)
# ── COVER HEADING ─────────────────────────────────────────────────────────
title_para = doc.add_heading(report_data["title"], level=0)
title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph() # Spacer
# Subtitle (Region + Period)
subtitle = doc.add_paragraph()
subtitle.alignment = WD_ALIGN_PARAGRAPH.CENTER
subtitle_run = subtitle.add_run(
f"{report_data['region']} Region | {report_data['period']}"
)
subtitle_run.font.size = Pt(14)
subtitle_run.font.color.rgb = RGBColor(0x44, 0x44, 0x44)
# Prepared by
prepared = doc.add_paragraph()
prepared.alignment = WD_ALIGN_PARAGRAPH.CENTER
prepared_run = prepared.add_run(f"Prepared by: {report_data['prepared_by']}")
prepared_run.font.size = Pt(11)
prepared_run.font.italic = True
doc.add_paragraph() # Spacer
# ── EXECUTIVE SUMMARY ─────────────────────────────────────────────────────
doc.add_heading("Executive Summary", level=1)
doc.add_paragraph(report_data["summary_text"])
doc.add_paragraph() # Spacer
# ── SALES TABLE ───────────────────────────────────────────────────────────
doc.add_heading("Sales Performance by Product", level=1)
sales_rows = report_data["sales_table"]
if sales_rows:
# Create table with header + data rows
header_keys = list(sales_rows[0].keys())
table = doc.add_table(rows=1 + len(sales_rows), cols=len(header_keys))
table.style = "Light Shading Accent 1"
table.alignment = WD_TABLE_ALIGNMENT.CENTER
# Header row
header_row = table.rows[0]
for col_index, column_name in enumerate(header_keys):
cell = header_row.cells[col_index]
cell.text = column_name
# Bold the header
for run in cell.paragraphs[0].runs:
run.font.bold = True
# Data rows
for row_index, row_data in enumerate(sales_rows, start=1):
table_row = table.rows[row_index]
for col_index, key in enumerate(header_keys):
table_row.cells[col_index].text = str(row_data[key])
doc.add_paragraph() # Spacer
# ── NOTES ─────────────────────────────────────────────────────────────────
if report_data.get("notes"):
doc.add_heading("Notes", level=2)
for note in report_data["notes"]:
doc.add_paragraph(note, style="List Bullet")
# ── FOOTER LINE ───────────────────────────────────────────────────────────
doc.add_paragraph()
footer_para = doc.add_paragraph()
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
footer_run = footer_para.add_run(
f"Acme Corp Confidential | {report_data['period']}"
)
footer_run.font.size = Pt(9)
footer_run.font.italic = True
footer_run.font.color.rgb = RGBColor(0x99, 0x99, 0x99)
# ── SAVE ──────────────────────────────────────────────────────────────────
output_path.parent.mkdir(parents=True, exist_ok=True)
doc.save(str(output_path))
return output_path
Calling this function with a data dictionary:
report_data = {
"title": "Q4 2024 Regional Sales Review",
"period": "Q4 2024 (Oct–Dec)",
"prepared_by": "Priya Okonkwo",
"region": "Chicago",
"summary_text": (
"The Chicago region delivered Q4 revenue of $1,234,567, exceeding target "
"by 4.2%. Office supplies led growth, with stapler category up 18% YoY. "
"Year-end stocking orders from school districts drove a strong December. "
"Q1 2025 pipeline is healthy with three pending enterprise contracts."
),
"sales_table": [
{"Product": "Office Supplies", "Q4 Revenue": "$487,234", "vs Target": "+8.1%"},
{"Product": "Furniture", "Q4 Revenue": "$312,890", "vs Target": "+1.2%"},
{"Product": "Technology", "Q4 Revenue": "$289,120", "vs Target": "-2.4%"},
{"Product": "Cleaning", "Q4 Revenue": "$145,323", "vs Target": "+6.7%"},
],
"notes": [
"Technology shortfall due to delayed shipment resolved in early Q1.",
"See Appendix A for complete product-level breakdowns.",
],
}
output_file = Path("/data/reports/q4_chicago_report.docx")
created = create_sales_report(report_data, output_file)
print(f"Report created: {created}")
18.4 Word Templates: Finding and Replacing Placeholders
Rather than building a document from scratch, you often want to fill in a pre-designed template — a Word document with placeholders like {{CLIENT_NAME}} or {{PROJECT_DATE}} that Python replaces with real values.
This is the pattern for proposals, contracts, offer letters, and any other document where the structure is designed by a human but the content varies.
The Placeholder Replacement Pattern
from pathlib import Path
import docx
def replace_placeholders_in_paragraph(paragraph, replacements: dict) -> None:
"""
Replace placeholder text in a paragraph's runs.
Handles the case where a placeholder like {{CLIENT_NAME}} may be split
across multiple runs by Word's XML structure.
Args:
paragraph: A python-docx Paragraph object.
replacements: Dict mapping placeholder to replacement value.
e.g., {"{{CLIENT_NAME}}": "Acme Corp"}
"""
for placeholder, value in replacements.items():
# First, check if the whole placeholder is in the paragraph text
if placeholder not in paragraph.text:
continue
# The placeholder might be split across runs — reassemble and replace
# Strategy: combine all run text, replace, then put it back in run 0
full_text = "".join(run.text for run in paragraph.runs)
if placeholder in full_text:
new_text = full_text.replace(placeholder, str(value))
# Put the replaced text into the first run, clear the rest
if paragraph.runs:
paragraph.runs[0].text = new_text
for run in paragraph.runs[1:]:
run.text = ""
def fill_word_template(
template_path: Path,
output_path: Path,
replacements: dict,
) -> Path:
"""
Fill a Word template by replacing all placeholders with values.
Placeholders in the template should use a distinctive format like
{{PLACEHOLDER_NAME}} to avoid accidentally replacing normal text.
Args:
template_path: Path to the .docx template file.
output_path: Path for the filled output document.
replacements: Dict mapping each placeholder to its replacement value.
Returns:
Path to the filled document.
"""
doc = docx.Document(str(template_path))
# Replace in body paragraphs
for paragraph in doc.paragraphs:
replace_placeholders_in_paragraph(paragraph, replacements)
# Replace in table cells
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
replace_placeholders_in_paragraph(paragraph, replacements)
# Replace in headers and footers
for section in doc.sections:
for paragraph in section.header.paragraphs:
replace_placeholders_in_paragraph(paragraph, replacements)
for paragraph in section.footer.paragraphs:
replace_placeholders_in_paragraph(paragraph, replacements)
output_path.parent.mkdir(parents=True, exist_ok=True)
doc.save(str(output_path))
return output_path
Using a Template for Batch Document Generation
from pathlib import Path
def generate_regional_reports(
template_path: Path,
regions_data: list[dict],
output_dir: Path,
) -> list[Path]:
"""
Generate one filled report per region using a shared template.
Each dict in regions_data must contain all placeholders needed by the template.
Returns:
List of paths to created documents.
"""
output_dir.mkdir(parents=True, exist_ok=True)
created_files = []
for region_data in regions_data:
region_name = region_data.get("{{REGION}}", "unknown").lower().replace(" ", "_")
output_filename = f"q4_report_{region_name}.docx"
output_path = output_dir / output_filename
fill_word_template(template_path, output_path, region_data)
created_files.append(output_path)
print(f" Generated: {output_filename}")
return created_files
# Example data — each dict fills one report
regions_data = [
{
"{{REGION}}": "Chicago",
"{{PERIOD}}": "Q4 2024",
"{{REVENUE}}": "$1,234,567",
"{{VARIANCE}}": "+4.2%",
"{{PREPARED_BY}}": "Priya Okonkwo",
},
{
"{{REGION}}": "Nashville",
"{{PERIOD}}": "Q4 2024",
"{{REVENUE}}": "$987,432",
"{{VARIANCE}}": "-1.3%",
"{{PREPARED_BY}}": "Priya Okonkwo",
},
# ... additional regions
]
created = generate_regional_reports(
template_path=Path("/templates/regional_report_template.docx"),
regions_data=regions_data,
output_dir=Path("/data/reports/q4_2024"),
)
print(f"\nGenerated {len(created)} reports.")
18.5 Generating Structured Documents: Headings, Tables, and Formatting
Here are the key python-docx patterns you will use repeatedly.
Adding Headings
import docx
doc = docx.Document()
# Heading levels 0–4
# Level 0 is the "Title" style (document title)
# Level 1–4 are "Heading 1" through "Heading 4"
doc.add_heading("Annual Sales Report 2024", level=0)
doc.add_heading("Executive Summary", level=1)
doc.add_heading("Regional Performance", level=1)
doc.add_heading("Chicago Region", level=2)
doc.add_heading("Q4 Detail", level=3)
Adding Paragraphs with Formatting
import docx
from docx.shared import Pt, RGBColor
doc = docx.Document()
# Simple paragraph
doc.add_paragraph("This is a normal paragraph.")
# Paragraph with inline formatting using runs
para = doc.add_paragraph()
para.add_run("Revenue grew by ").font.size = Pt(12)
bold_run = para.add_run("4.2%")
bold_run.font.bold = True
bold_run.font.size = Pt(12)
bold_run.font.color.rgb = RGBColor(0x00, 0x70, 0xC0) # Blue
para.add_run(" compared to the same period last year.")
# Bullet list
doc.add_paragraph("First bullet point", style="List Bullet")
doc.add_paragraph("Second bullet point", style="List Bullet")
# Numbered list
doc.add_paragraph("First step", style="List Number")
doc.add_paragraph("Second step", style="List Number")
Adding Tables
import docx
from docx.shared import Pt
def add_data_table(doc: docx.Document, headers: list, rows: list) -> None:
"""
Add a formatted data table to a document.
Args:
doc: The Document object to add the table to.
headers: List of column header strings.
rows: List of lists, where each inner list is one row of data.
"""
table = doc.add_table(rows=1 + len(rows), cols=len(headers))
table.style = "Light Grid Accent 1"
# Header row
header_row = table.rows[0]
for col_index, header_text in enumerate(headers):
cell = header_row.cells[col_index]
cell.text = header_text
# Make header text bold
for paragraph in cell.paragraphs:
for run in paragraph.runs:
run.font.bold = True
run.font.size = Pt(11)
# Data rows
for row_index, row_data in enumerate(rows, start=1):
table_row = table.rows[row_index]
for col_index, cell_value in enumerate(row_data):
table_row.cells[col_index].text = str(cell_value)
# Use it
doc = docx.Document()
doc.add_heading("Q4 Regional Summary", level=1)
headers = ["Region", "Revenue", "Units Sold", "vs Target"]
rows = [
["Chicago", "$1,234,567", "9,842", "+4.2%"],
["Nashville", "$987,432", "7,891", "-1.3%"],
["Cincinnati", "$876,543", "7,012", "+2.1%"],
["St. Louis", "$765,432", "6,123", "+0.8%"],
]
add_data_table(doc, headers, rows)
doc.save("/data/reports/q4_summary.docx")
18.6 Combining Python Data with Document Templates: The Report Generation Pattern
The pattern that makes document automation powerful:
- A business user designs the report layout in Word (or defines the data structure)
- Python connects to the data source (CSV, database, API)
- Python populates the template with live data
- The output is a properly formatted, branded document
Here is the full pattern assembled:
"""
The Report Generation Pattern
Data Source -> Python -> Template -> Final Document
"""
import datetime
from pathlib import Path
import pandas
import docx
def load_regional_data(csv_path: Path, region: str) -> dict:
"""
Load and aggregate data for a specific region from the Acme sales CSV.
Returns a dict of summary statistics ready to be inserted into a report.
"""
df = pandas.read_csv(csv_path)
region_df = df[df["region"] == region].copy()
if region_df.empty:
raise ValueError(f"No data found for region: {region}")
total_revenue = region_df["revenue"].sum()
total_units = region_df["units_sold"].sum()
top_product = region_df.groupby("product")["revenue"].sum().idxmax()
avg_order = region_df["revenue"].mean()
# Build the sales table for the document
product_summary = (
region_df.groupby("product")
.agg(revenue=("revenue", "sum"), units=("units_sold", "sum"))
.reset_index()
.sort_values("revenue", ascending=False)
)
return {
"region": region,
"total_revenue": f"${total_revenue:,.0f}",
"total_units": f"{total_units:,}",
"top_product": top_product,
"avg_order_value": f"${avg_order:,.2f}",
"product_rows": product_summary.values.tolist(),
"product_headers": ["Product", "Revenue", "Units Sold"],
"report_date": datetime.date.today().strftime("%B %d, %Y"),
}
def generate_regional_report(
data: dict,
template_path: Path,
output_path: Path,
) -> Path:
"""
Generate a filled regional report from a template and data dict.
"""
# Build the simple placeholder replacements
replacements = {
"{{REGION}}": data["region"],
"{{TOTAL_REVENUE}}": data["total_revenue"],
"{{TOTAL_UNITS}}": data["total_units"],
"{{TOP_PRODUCT}}": data["top_product"],
"{{AVG_ORDER_VALUE}}": data["avg_order_value"],
"{{REPORT_DATE}}": data["report_date"],
}
# Fill text placeholders
doc = docx.Document(str(template_path))
for paragraph in doc.paragraphs:
for placeholder, value in replacements.items():
if placeholder in paragraph.text:
for run in paragraph.runs:
if placeholder in run.text:
run.text = run.text.replace(placeholder, str(value))
# Find the product table placeholder and fill it
# (A paragraph containing "{{PRODUCT_TABLE}}" marks where the table goes)
for i, paragraph in enumerate(doc.paragraphs):
if "{{PRODUCT_TABLE}}" in paragraph.text:
# Clear the placeholder paragraph
paragraph.clear()
# Add the table after this paragraph's position
table = doc.add_table(
rows=1 + len(data["product_rows"]),
cols=len(data["product_headers"])
)
table.style = "Light Shading Accent 1"
# Header
for col, header in enumerate(data["product_headers"]):
table.rows[0].cells[col].text = header
# Data
for row_i, row in enumerate(data["product_rows"], start=1):
for col, val in enumerate(row):
table.rows[row_i].cells[col].text = str(val)
break
output_path.parent.mkdir(parents=True, exist_ok=True)
doc.save(str(output_path))
return output_path
18.7 Generating PDFs with reportlab (An Introduction)
pypdf reads PDFs but cannot create them from scratch. For generating new PDF files from Python data, reportlab is the standard library.
pip install reportlab
reportlab is significantly more complex than python-docx — it works at a lower level, placing content at specific coordinates on a page. This is a brief introduction; full coverage is in Chapter 36 (Automated Report Generation).
from pathlib import Path
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.units import inch
from reportlab.pdfgen import canvas as pdf_canvas
def create_simple_pdf(output_path: Path, data: dict) -> Path:
"""
Create a simple PDF with a title and some data rows.
This demonstrates the reportlab canvas API.
Full report generation with tables and styles is covered in Chapter 36.
Args:
output_path: Where to save the PDF.
data: Dict with 'title' (str) and 'rows' (list of dicts).
Returns:
Path to the created PDF.
"""
output_path.parent.mkdir(parents=True, exist_ok=True)
# Create a canvas object (the "page")
c = pdf_canvas.Canvas(str(output_path), pagesize=LETTER)
page_width, page_height = LETTER # 8.5 x 11 inches in points
# Coordinates in reportlab: (0, 0) is bottom-left corner
# LETTER page: ~612 x 792 points
# ── TITLE ─────────────────────────────────────────────────────────────────
c.setFont("Helvetica-Bold", 18)
c.drawString(inch, page_height - inch, data["title"])
# ── SUBTITLE ──────────────────────────────────────────────────────────────
c.setFont("Helvetica", 12)
c.setFillColorRGB(0.4, 0.4, 0.4)
c.drawString(inch, page_height - 1.4 * inch, data.get("subtitle", ""))
# ── LINE SEPARATOR ────────────────────────────────────────────────────────
c.setStrokeColorRGB(0.8, 0.8, 0.8)
c.line(inch, page_height - 1.6 * inch, page_width - inch, page_height - 1.6 * inch)
# ── COLUMN HEADERS ────────────────────────────────────────────────────────
if data.get("headers") and data.get("rows"):
c.setFillColorRGB(0, 0, 0)
c.setFont("Helvetica-Bold", 10)
y_position = page_height - 2.0 * inch
x_start = inch
col_width = (page_width - 2 * inch) / len(data["headers"])
for col_index, header in enumerate(data["headers"]):
c.drawString(x_start + col_index * col_width, y_position, str(header))
# ── DATA ROWS ─────────────────────────────────────────────────────────
c.setFont("Helvetica", 10)
row_height = 0.25 * inch
for row_index, row in enumerate(data["rows"]):
y_position -= row_height
# Alternate row shading
if row_index % 2 == 0:
c.setFillColorRGB(0.95, 0.95, 0.95)
c.rect(
inch, y_position - 3,
page_width - 2 * inch, row_height,
fill=1, stroke=0
)
c.setFillColorRGB(0, 0, 0)
for col_index, value in enumerate(row):
c.drawString(
x_start + col_index * col_width,
y_position,
str(value)
)
# Save the page
c.save()
return output_path
# Quick usage example
data = {
"title": "Acme Corp — Q4 2024 Summary",
"subtitle": "Prepared by Priya Okonkwo | January 10, 2025",
"headers": ["Region", "Revenue", "Units", "Variance"],
"rows": [
["Chicago", "$1,234,567", "9,842", "+4.2%"],
["Nashville", "$987,432", "7,891", "-1.3%"],
["Cincinnati", "$876,543", "7,012", "+2.1%"],
["St. Louis", "$765,432", "6,123", "+0.8%"],
],
}
output = Path("/data/reports/q4_summary.pdf")
create_simple_pdf(output, data)
print(f"PDF created: {output}")
For professional report generation with complex layouts, tables, charts, and page headers/footers, Chapter 36 covers reportlab in depth using its higher-level platypus API.
18.8 Extracting Data from Vendor PDFs: Practical Patterns
The invoice reconciliation scenario that opened this chapter is one of the most common real-world PDF automation tasks. Here is the complete approach.
The Problem with Vendor PDFs
Every vendor's invoice is formatted differently. Priya's 47 invoices come from 12 different vendors, each with their own template. The only options for a fully automated solution are:
- Write a vendor-specific extractor for each format. High accuracy, but requires maintenance when vendors update their template.
- Write a general-purpose extractor using heuristics. Lower accuracy, works across formats, but will have failures that require manual review.
- Hybrid: general extractor with a manual review queue. The professional approach.
The Hybrid Approach
import re
from pathlib import Path
from dataclasses import dataclass, field
import pypdf
@dataclass
class InvoiceExtractionResult:
"""Results from attempting to extract invoice data from a PDF."""
file_path: Path
vendor_name: str = ""
invoice_number: str = ""
invoice_date: str = ""
total_amount: float | None = None
confidence: str = "low" # "high", "medium", "low"
raw_text: str = ""
extraction_notes: list[str] = field(default_factory=list)
def extract_invoice_data(pdf_path: Path) -> InvoiceExtractionResult:
"""
Attempt to extract key fields from a vendor invoice PDF.
Uses a series of heuristic patterns to find the invoice total,
vendor name, invoice number, and date.
Returns an InvoiceExtractionResult. Always check the 'confidence'
field — "low" confidence results need manual review.
"""
result = InvoiceExtractionResult(file_path=pdf_path)
# Extract raw text
try:
with open(pdf_path, "rb") as pdf_file:
reader = pypdf.PdfReader(pdf_file)
if reader.is_encrypted:
result.extraction_notes.append("PDF is encrypted — cannot extract text")
return result
all_text = "\n".join(
page.extract_text() or "" for page in reader.pages
)
except Exception as error:
result.extraction_notes.append(f"PDF read error: {error}")
return result
if not all_text.strip():
result.extraction_notes.append("No text extracted — likely a scanned image PDF")
return result
result.raw_text = all_text
# ── EXTRACT TOTAL ─────────────────────────────────────────────────────────
# Look for common "total due" phrases and grab the nearest dollar amount
total_patterns = [
r"(?:total\s+due|amount\s+due|balance\s+due|invoice\s+total)[:\s]*\$?([\d,]+\.?\d*)",
r"(?:grand\s+total|net\s+total|total\s+amount)[:\s]*\$?([\d,]+\.?\d*)",
r"total[:\s]+\$?([\d,]+\.\d{2})",
]
for pattern in total_patterns:
match = re.search(pattern, all_text, re.IGNORECASE)
if match:
amount_str = match.group(1).replace(",", "")
try:
result.total_amount = float(amount_str)
result.confidence = "high" if "due" in pattern else "medium"
break
except ValueError:
continue
if result.total_amount is None:
result.extraction_notes.append("Could not find invoice total — manual review required")
# ── EXTRACT INVOICE NUMBER ────────────────────────────────────────────────
inv_patterns = [
r"invoice\s+(?:no\.?|number|#)\s*[:\s]?\s*([A-Z0-9\-]+)",
r"inv[.\s]*(?:no\.?|#)\s*[:\s]?\s*([A-Z0-9\-]+)",
]
for pattern in inv_patterns:
match = re.search(pattern, all_text, re.IGNORECASE)
if match:
result.invoice_number = match.group(1).strip()
break
# ── EXTRACT DATE ──────────────────────────────────────────────────────────
date_pattern = re.compile(
r"(?:invoice\s+date|date)[:\s]+(\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4})"
r"|(\w+\s+\d{1,2},\s+\d{4})",
re.IGNORECASE
)
date_match = date_pattern.search(all_text)
if date_match:
result.invoice_date = (date_match.group(1) or date_match.group(2) or "").strip()
# ── DETERMINE CONFIDENCE ──────────────────────────────────────────────────
if result.total_amount and result.invoice_number and result.invoice_date:
result.confidence = "high"
elif result.total_amount:
result.confidence = "medium"
else:
result.confidence = "low"
return result
def process_invoice_folder(
invoices_dir: Path,
) -> tuple[list[InvoiceExtractionResult], list[InvoiceExtractionResult]]:
"""
Process all PDF invoices in a folder.
Returns:
(high_confidence_results, needs_review_results)
"""
high_confidence = []
needs_review = []
for pdf_file in sorted(invoices_dir.glob("*.pdf")):
result = extract_invoice_data(pdf_file)
print(
f" {pdf_file.name:<40} "
f"Total: {'$' + f'{result.total_amount:,.2f}' if result.total_amount else 'NOT FOUND':<15} "
f"[{result.confidence}]"
)
if result.confidence == "high":
high_confidence.append(result)
else:
needs_review.append(result)
return high_confidence, needs_review
Summary
- pypdf reads PDF files: extract text, metadata, split by page, and merge multiple PDFs. Import as
import pypdf. - PDF text extraction only works on text-based PDFs, not scanned images. Always test by trying to select text manually before building automation.
- PDF extraction is heuristic, not structural. Use regular expressions to find patterns, and always implement a manual review queue for low-confidence results.
- python-docx creates and reads
.docxfiles. UseDocument()for a blank document orDocument(path)to open an existing one. - The report generation pattern: load data from any source, populate a template or build a document programmatically, save to
.docx. - Placeholder replacement (finding
{{CLIENT_NAME}}and replacing it) is the standard approach for template-based document generation. python-docxhandles headings (add_heading()), paragraphs (add_paragraph()), and tables (add_table()). Formatting is applied throughRunobjects with.font.bold,.font.size,.font.color, and similar attributes.- reportlab generates PDFs from scratch via a canvas API. It requires positioning content at explicit coordinates, making it more complex than
python-docx. Full coverage in Chapter 36. - For batch document generation (eight regional reports, forty vendor invoices), loop over your data and call the generation function once per record.
Chapter 19: Email Automation and Notifications →