Quiz — Chapter 18: Working with PDFs and Word Documents

DataField.Dev

Quiz — Chapter 18: Working with PDFs and Word Documents

Questions

1. Which Python library should you use in 2024/2025 to read PDF files? (PyPDF2 is deprecated.)

A) pdfplumber B) pypdf C) PyPDF2 D) pdf-python

2. You open a PDF in your PDF viewer and try to select some text. No text can be selected — clicking anywhere on the page just shows a cursor but never highlights anything. What does this tell you about automated text extraction?

A) The PDF is password-protected; you need the password to extract text B) The PDF is a scanned image with no embedded text layer; pypdf cannot extract its text C) The PDF uses an uncommon font encoding; pypdf needs a font mapping to decode it D) The PDF is too large; you need to split it into smaller files before extracting

3. What is the correct way to open a PDF file with pypdf?

A) reader = pypdf.PdfReader("file.pdf") B) reader = pypdf.PdfReader(open("file.pdf", "r")) C) reader = pypdf.PdfReader(open("file.pdf", "rb")) D) Both A and C are correct

4. pypdf.PdfReader.pages returns:

A) A list of strings containing the text of each page B) A list of Page objects; call .extract_text() on each to get the text C) An integer count of pages in the document D) A generator that yields page number/text tuples

5. Which of the following correctly creates a new, blank Word document with python-docx?

A) doc = docx.Word() B) doc = docx.Document() C) doc = docx.Document.new() D) doc = docx.open()

6. In python-docx, what is the difference between doc.add_heading("Title", level=0) and doc.add_heading("Executive Summary", level=1)?

A) Level 0 creates a bold paragraph; level 1 creates a document title B) Level 0 applies the "Title" style (document title, largest); level 1 applies "Heading 1" style C) Level 0 is the first heading in the document; level 1 is the second D) Level 0 and level 1 produce identical output

7. You want to add bold text followed by normal text in the same paragraph using python-docx. Which is the correct approach?

A) doc.add_paragraph("**Revenue grew** by 4.2%") B) Use para.add_run() twice: one with run.font.bold = True, one without C) Use para.add_paragraph("Revenue grew", style="Bold") then para.add_paragraph(" by 4.2%") D) doc.add_paragraph("<b>Revenue grew</b> by 4.2%")

8. When using the placeholder replacement pattern, why do placeholders typically use a distinctive format like {{CLIENT_NAME}} rather than just CLIENT_NAME?

A) Python requires double curly braces for variable substitution in strings B) Word's built-in spell checker flags CLIENT_NAME as a spelling error C) Distinctive markers prevent accidentally replacing common words that appear in normal document text D) The {{}} syntax makes the placeholder detectable by python-docx's built-in find_replace() method

9. What does pypdf.PdfWriter.add_page(page) do?

A) Adds a blank new page to the writer B) Copies an existing page from a reader into the writer for output C) Adds a page from an image file to the writer D) Appends all pages from a reader to the writer

10. You run doc.save("report.docx") with python-docx. The file already exists at that path. What happens?

A) A FileExistsError is raised B) The file is overwritten without warning C) The new content is appended to the existing document D) A backup of the old file is created before overwriting

11. shutil.make_archive() creates a ZIP file. To create a PDF from Python data, you use:

A) pypdf.PdfWriter — it can create PDFs from scratch B) python-docx — save as .pdf using the format="pdf" parameter C) reportlab — it generates PDFs from Python code via a canvas API D) PyPDF2 — the modern replacement for pypdf supports PDF creation

12. You extract text from a vendor invoice PDF and the result is:

Invoice#: 2024-1042 Total$: 12,450.00 Due: 2025-01-15
DESCRIPTION QTY UNIT PRICETOTAL Office Supplies 50 $45.00 $2,250...

The text is all on one line with no proper spacing between table columns. This is because:

A) pypdf has a bug that removes newlines from extracted text B) PDF table cells do not have structural boundaries — text is positioned by coordinates and may be extracted in reading order that concatenates cells C) The invoice was generated with a non-standard encoding that pypdf cannot parse correctly D) The PDF was created with a password that garbled the text layer

13. In the invoice extraction case study, Priya routes "low confidence" results to a manual review queue rather than including them in the main output or raising an error. What is the business justification for this approach?

A) Low confidence results are always wrong and should be discarded B) Finance requires 100% automated extraction with no manual involvement C) The hybrid approach captures the efficiency gain from automation on most invoices while ensuring data quality through human review of edge cases D) Python cannot display an error when confidence is low

14. Which python-docx method correctly adds a table with 3 rows and 4 columns to a document?

A) doc.add_table(3, 4) B) doc.add_table(rows=3, cols=4) C) doc.Table(rows=3, cols=4) D) doc.insert_table(rows=3, columns=4)

15. You want to set the font color of a run to a specific RGB value (dark blue: R=0, G=53, B=107). Which code is correct?

A) run.font.color = (0, 53, 107) B) run.font.color.rgb = RGBColor(0, 53, 107) C) run.font.color = "#00356B" D) run.color = RGBColor(0x00, 0x35, 0x6B)

16. To merge three PDF files into one with pypdf, you should:

A) Concatenate the file bytes and save B) Create a PdfWriter, call writer.add_page(page) for every page from every reader, then save the writer C) Use pypdf.merge(file1, file2, file3, output) D) Open all three with a single PdfReader call and save

17. What does reader.metadata.get("/Title", "") return for a PDF that has no title set in its properties?

A) Raises a KeyError B) Returns None C) Returns "" D) Returns "/Title"

18. In python-docx, styles like "List Bullet" and "Light Shading Accent 1" must:

A) Be defined in your Python code before use B) Be imported from docx.styles C) Already exist in the document's style definitions (built-in to Word's default document or the template you opened) D) Be downloaded from the python-docx style repository

Answer Key

Q	Answer	Explanation
1	B	`pypdf` is the actively maintained library. PyPDF2 was deprecated in 2022 and should not be used in new code.
2	B	If text cannot be selected in a PDF viewer, the document is a scanned image with no text layer. pypdf extracts text from the text layer; it cannot read image content.
3	D	Both are correct. Option A is the modern shorthand; option C uses explicit binary mode. Binary mode is required because PDF is a binary format, not plain text.
4	B	`reader.pages` is a list of `PageObject` instances. You call `.extract_text()` on each one to get text.
5	B	`docx.Document()` with no arguments creates a new blank document. `docx.Document(path)` opens an existing one.
6	B	Level 0 applies the "Title" style (the large centered document title). Level 1 applies "Heading 1" (the standard section heading).
7	B	python-docx does not parse markdown or HTML. You add runs to a paragraph and set formatting on each run individually.
8	C	Placeholders like `{{CLIENT_NAME}}` are intentionally distinctive. If you used `CLIENT_NAME`, you might accidentally replace that text where it appears naturally in the document.
9	B	`writer.add_page(page)` copies an existing page object (from a reader) into the writer. This is the foundation of PDF splitting and merging.
10	B	`doc.save()` overwrites the file silently. If you need to preserve the original, copy it first or use a different output filename.
11	C	`reportlab` generates PDFs from Python code. pypdf can manipulate existing PDFs (merge, split) but not create them from scratch. `python-docx` works only with `.docx`.
12	B	PDFs store text as positioned strings on a canvas. There are no row/column boundaries for tables. Extracted text reflects reading order which often concatenates what visually appear as separate cells.
13	C	The hybrid model (automate most, review the rest) is the professional approach. It maximizes efficiency gains while maintaining data quality. Automating blindly risks financial errors; routing everything to manual review defeats the purpose.
14	B	`doc.add_table(rows=3, cols=4)` is the correct method signature.
15	B	`run.font.color.rgb` accepts an `RGBColor` object. Import `RGBColor` from `docx.shared`.
16	B	Create a `PdfWriter`, iterate over pages from each reader using `writer.add_page()`, then write the output.
17	C	`metadata.get("/Title", "")` returns the default value (`""`) when the key is absent — standard Python dict behavior.
18	C	python-docx style names must already be defined in the document's style set. Built-in Word styles (like "List Bullet", "Heading 1", table styles) are available in the default blank document. Custom styles must be in the document or template you opened.