Exercises — Chapter 18: Working with PDFs and Word Documents

Starred exercises (*) have worked solutions in Appendix B.

Tier 1: Recall

1.1 What Python library do you use to read PDF files? What is the correct import statement? (Note: PyPDF2 is deprecated — name the current library.)

1.2 What does "PDF text extraction" mean, and what is the primary method used to do it with pypdf?

1.3 ★ When is PDF text extraction likely to fail completely? Describe the diagnostic test you can perform in a PDF viewer to predict whether extraction will work before writing any code.

1.4 What is python-docx? What file format does it work with, and what file format does it NOT support?

1.5 In python-docx, what is the difference between a Paragraph and a Run? When would you need to work with Run objects directly?

1.6 ★ What does pypdf.PdfReader.is_encrypted tell you, and why does it matter before attempting text extraction?

1.7 What is a "placeholder replacement" approach to document generation? Give an example of what a placeholder might look like in a template.

1.8 What is reportlab used for, and how does it differ in approach from python-docx?

Tier 2: Apply

2.1 ★ Write a function count_pdf_pages(pdf_path: Path) -> int that returns the number of pages in a PDF file. Return 0 if the file cannot be read.

2.2 Write a function extract_emails_from_pdf(pdf_path: Path) -> list[str] that: - Extracts all text from the PDF - Uses a regular expression to find all email addresses in the text - Returns a deduplicated, sorted list of email addresses found

Test it by creating a simple PDF that contains some email addresses.

2.3 ★ Write a function merge_pdfs(pdf_paths: list[Path], output_path: Path) -> int that merges a list of PDFs into a single output file. Return the total number of pages in the merged document.

2.4 Using python-docx, write a function create_meeting_notes(title: str, date: str, attendees: list[str], agenda_items: list[str], action_items: list[dict]) -> Path that creates a Word document with: - A bold title heading - Date and attendee list - Numbered agenda items - An action items table with columns: Action, Owner, Due Date Save the file as meeting_notes_YYYYMMDD.docx in the current directory.

2.5 ★ Write a function replace_text_in_document(template_path: Path, output_path: Path, replacements: dict) -> Path that opens a .docx file, replaces all placeholder strings with their values, and saves the result. Replacements should work in body paragraphs and table cells. Test it with a simple template containing three different placeholders.

2.6 Write a function extract_table_data(docx_path: Path) -> list[list[list[str]]] that extracts all tables from a Word document. Return a list of tables, where each table is a list of rows, and each row is a list of cell text strings. Print the result in a readable format.

Tier 3: Analyze

3.1 ★ The chapter describes a three-tier confidence system (high/medium/low) for invoice extraction results. Why is this preferable to either: (a) treating all extracted results as reliable, or (b) flagging everything that isn't 100% certain for manual review?

Describe the business consequences of each of those two alternatives.

3.2 PDF text extraction of tables often produces garbled, incorrectly-ordered text. Explain why this happens based on how PDF files store content. What are your options when you need structured table data from a PDF?

3.3 The placeholder replacement function in the chapter has a subtle issue: if a placeholder like {{CLIENT_NAME}} spans multiple Run objects in the document, the simple approach of checking run.text will miss it. The chapter's implementation handles this by reconstructing the full paragraph text. What is the tradeoff of this approach? Under what circumstances might it lose formatting applied to individual runs?

3.4 ★ Compare the following two approaches to generating 50 Word reports:

# Approach A: Template filling
for client_data in clients:
    fill_template(template_path, output_path, client_data)

# Approach B: Build from scratch
for client_data in clients:
    build_report_from_scratch(client_data, output_path)

What are the tradeoffs? When would you prefer Approach A over B, and vice versa?

3.5 The chapter recommends using "rb" (read binary) mode when opening PDF files. Why is binary mode required? What would happen if you tried to open a PDF with open(path, "r") in text mode?

Tier 4: Synthesize

4.1 ★ Build a complete statement_parser.py script that processes a folder of bank statement PDFs and extracts: - Statement date (month/year) - Opening balance - Closing balance - Total credits - Total debits

Write the results to a CSV with one row per statement. Include a confidence level for each extraction. Handle the case where some statements are scanned images (no extractable text) by flagging them in the output.

4.2 Create a contract_generator.py that: - Reads a .docx contract template with at least 8 different placeholders - Reads client data from a CSV file with one row per client - Generates one filled contract per CSV row - Names output files as YYYY-MM-DD_ClientName_Contract.docx - Logs a summary of generated contracts including any clients with missing required fields

4.3 Build a pdf_splitter_by_keyword.py that: - Takes a multi-page PDF (e.g., a combined monthly statement file) - Splits it into separate PDFs at each page that contains a specific keyword (e.g., "Invoice", "Statement", or a customer name) - Names the output files based on the keyword found on the first page of each section

4.4 Build a weekly_report_generator.py for Maya that: - Reads her maya_projects.csv (active and invoiced projects) - Generates a Word document summary report showing active projects, recently completed projects, and outstanding invoices - Includes a summary table at the top with total active project value, total outstanding invoices, and total invoiced this month - Appends a "generated on" timestamp footer to each page

Tier 5: Challenge

5.1 (Research and Build) The chapter introduces reportlab briefly. Research the reportlab.platypus (Page Layout and Typography Using Scripts) higher-level API. Build a financial_report_pdf.py that generates a multi-page PDF report with: - A branded cover page (company name, logo placeholder, date) - A Table of Contents (manually constructed with page references) - At least two data tables using platypus.Table with custom table styles - A page header and footer on every page (using a BaseDocTemplate with frames) - Proper page numbering

5.2 (Open-Ended) Research the limitations of PDF text extraction for financial documents. Write a 400-word analysis covering: (1) Why scanned PDFs cannot be processed with pypdf alone and what additional step is needed (hint: OCR), (2) Two Python libraries that can perform OCR on PDFs and their tradeoffs, (3) At what point the complexity of automated extraction does not justify the effort and manual entry or a vendor-provided API is the better solution. Include at least one concrete business scenario where you would recommend against automated PDF extraction.

5.3 (Build) Create a pdf_to_excel.py tool that: - Reads a PDF containing tabular data (with a text layer — not scanned) - Attempts to detect and parse tables using positional text analysis (group text elements by their Y-coordinate, then by X-coordinate within each row) - Exports detected tables to Excel, one sheet per detected table - Works without any third-party table extraction library — use only pypdf and openpyxl - Includes a confidence score for each detected table based on row/column regularity Note: This is genuinely hard. The challenge is that pypdf's extract_text() does not preserve table structure. You will need to use page.extract_text(extraction_mode="layout") in newer versions of pypdf.