Chapter 9 Further Reading: File I/O — Reading and Writing Business Data

This list is organized by topic and annotated with guidance on who each resource is most useful for. All resources listed here were accurate and actively maintained as of early 2025. URLs for official documentation are stable; for third-party resources, titles are provided so you can search if a URL changes.


Official Python Documentation

pathlib — Object-Oriented Filesystem Paths

URL: https://docs.python.org/3/library/pathlib.html

The complete reference for the pathlib module. The "Basic use" section at the top is worth reading as a unit — it walks through the most common operations with clear examples. The full method reference is the place to go when you want to know whether a method exists (e.g., "Can I compare two paths? Can I get the home directory?"). Intermediate level.

io — Core Tools for Working with Streams

URL: https://docs.python.org/3/library/io.html

The technical reference for Python's I/O system, including the TextIOWrapper class that underlies open(). Most working developers rarely need this level of detail, but it is the authoritative source if you encounter unusual encoding behavior or need to understand why a file object behaves a certain way. Advanced level.

csv — CSV File Reading and Writing

URL: https://docs.python.org/3/library/csv.html

The complete reference for the csv module. Pay particular attention to the Dialect class and the csv.register_dialect() function — these allow you to handle CSVs with non-standard delimiters (pipes, tabs, semicolons) or unusual quoting rules, which you will encounter when dealing with data exported from older enterprise systems. Intermediate level.

json — JSON Encoder and Decoder

URL: https://docs.python.org/3/library/json.html

The complete reference for the json module. The section on encoding/decoding custom Python objects (the default parameter of json.dump()) is relevant once you start working with datetime objects, Decimal values, or Path objects that are not natively JSON-serializable. Intermediate level.

os.path — Common Pathname Manipulations

URL: https://docs.python.org/3/library/os.path.html

The older, string-based API for path operations. You will encounter this in legacy codebases. Understanding the differences between os.path and pathlib helps you work confidently with code written before Python 3.4. The key functions — os.path.join(), os.path.exists(), os.path.basename() — have direct pathlib equivalents.

glob — Unix Style Pathname Pattern Expansion

URL: https://docs.python.org/3/library/glob.html

The standalone glob module (distinct from pathlib's .glob() method). Useful when you need to pass a glob pattern as a string to a function or work with paths as strings rather than Path objects. The pattern syntax is identical to pathlib's.


Python Enhancement Proposals (PEPs)

PEP 428 — The pathlib Module

URL: https://peps.python.org/pep-0428/

The design rationale for pathlib, written by Antoine Pitrou when proposing the module. Reading the "Rationale" section gives you the full picture of why the pathlib API was designed the way it was, and why the older string-based approach caused so many cross-platform bugs. Useful context, not required reading.


Books

"Fluent Python" by Luciano Ramalho (O'Reilly)

Chapter coverage varies by edition, but the sections on the data model and I/O are consistently excellent. Ramalho's explanation of how Python's I/O layer works at the buffer level provides the mental model for why newline="" matters. Best for learners who want deep understanding rather than quick answers. Intermediate to advanced.

"Python Cookbook" by David Beazley and Brian K. Jones (O'Reilly)

Chapter 5 ("Files and I/O") contains a compact set of recipes covering binary files, compressed files, temporary files, memory-mapped files, and serialization. Each recipe is self-contained and includes a discussion section. Excellent for learning the "why" behind less obvious patterns. Intermediate level.

"Automate the Boring Stuff with Python" by Al Sweigart

Chapters 8 and 9 cover reading/writing files and organizing files with pathlib and os. Sweigart's emphasis is on practical automation, and his examples are business-adjacent (working with files, directories, and Excel sheets). Freely available online at https://automatetheboringstuff.com. Beginner to intermediate.

"Python for Data Analysis" by Wes McKinney (O'Reilly)

While primarily about pandas, the early chapters cover reading and writing CSV and other data formats in ways that complement this chapter. McKinney's treatment of encoding issues and the differences between the standard library csv module and pandas' read_csv() is practically useful. Intermediate level; assumes some data analysis context.


Online Tutorials and Guides

Real Python: "Reading and Writing Files in Python (Guide)"

URL: https://realpython.com/read-write-files-python/

A thorough, well-structured tutorial covering the same ground as this chapter with additional depth on binary files, buffering, and working with file-like objects. The "Tips and Tricks" section at the end covers some patterns not in the standard introductory curriculum. Beginner to intermediate.

Real Python: "Working With JSON Data in Python"

URL: https://realpython.com/python-json/

Covers the json module in detail, including the challenge of serializing types that JSON does not natively support (datetime, Decimal, custom classes). The section on building a custom encoder by subclassing json.JSONEncoder is the right approach when you need to serialize Python objects with mixed types. Intermediate level.

Real Python: "Python's pathlib Module: Taming the File System"

URL: https://realpython.com/python-pathlib/

Possibly the most comprehensive pathlib tutorial available outside the official docs. Covers path components, reading and writing shortcuts (.read_text(), .write_text()), directory scanning, moving and deleting files, and the interaction between pathlib and the older os module. Intermediate level.

Real Python: "Reading and Writing CSV Files in Python"

URL: https://realpython.com/python-csv/

Covers the csv module with additional depth on dialects, custom delimiters, and the quoting constants (csv.QUOTE_ALL, csv.QUOTE_MINIMAL). Useful when working with non-standard CSV exports from enterprise software. Intermediate level.


Specific Topics Worth Exploring Next

Encoding: Understanding UTF-8 and Why It Matters

If you work with international data — names from multiple languages, currency symbols, smart quotes from Word documents — encoding is something you will encounter. Two resources:

  • "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets" by Joel Spolsky — a 2003 essay that remains the clearest plain-English explanation of character encoding. Available at https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

  • Python documentation: "Unicode HOWTO" at https://docs.python.org/3/howto/unicode.html — the official technical reference.

configparser — INI-Style Configuration Files

While JSON is the recommended configuration format in this chapter, many Python programs and Linux tools use .ini or .cfg files (the Windows INI format). Python's built-in configparser module reads and writes this format. If you are integrating with existing tooling that uses INI files, this is the right module.

URL: https://docs.python.org/3/library/configparser.html

tomllib — TOML Configuration Files (Python 3.11+)

TOML (Tom's Obvious Minimal Language) is a more readable configuration format gaining popularity in the Python ecosystem — Python's own pyproject.toml uses it. Python 3.11 added tomllib to the standard library for reading TOML files. If you are starting new projects and targeting Python 3.11+, TOML is worth knowing.

URL: https://docs.python.org/3/library/tomllib.html

tempfile — Temporary Files and Directories

The tempfile module creates temporary files that are automatically deleted when closed or when the script exits. This is safer than manually creating and deleting temp files, and is the right approach for the "safe write" pattern (write to temp, then rename to final).

URL: https://docs.python.org/3/library/tempfile.html

shutil — High-Level File Operations

For copying, moving, archiving, and deleting files and entire directory trees, the shutil module provides high-level operations that pathlib does not. shutil.copy2() (copies with metadata), shutil.move(), and shutil.make_archive() (creates ZIP or TAR archives) are the most commonly needed.

URL: https://docs.python.org/3/library/shutil.html


When You Outgrow the Standard Library

The patterns in this chapter handle the vast majority of business file I/O tasks. When the data grows larger or the requirements become more complex, these libraries become relevant:

pandas

Pandas is the dominant library for tabular data analysis in Python. Its pd.read_csv() function handles CSV files with a single line of code and performs automatic type inference, handles large files efficiently, and has built-in support for handling bad rows. If you find yourself writing complex logic to aggregate or reshape CSV data, pandas is the natural next step.

URL: https://pandas.pydata.org/docs/

Recommended starting point: "10 minutes to pandas" in the official docs.

openpyxl and xlsxwriter

For working directly with Excel files (.xlsx) — reading data from spreadsheets, writing Excel files with formatting and multiple sheets — these libraries are the standard tools. openpyxl handles both reading and writing; xlsxwriter is write-only but has richer formatting support.

  • openpyxl: https://openpyxl.readthedocs.io
  • xlsxwriter: https://xlsxwriter.readthedocs.io

Note: for simple data extraction from Excel, exporting to CSV and using the standard csv module is often faster and more reliable.

SQLite and sqlite3

When your data exceeds a few thousand rows or you need to query it with complex filters and joins, a database is the right tool. Python's built-in sqlite3 module provides a full SQL database that lives in a single file with no server required. It is the natural evolution of the CSV-based patterns in this chapter.

URL: https://docs.python.org/3/library/sqlite3.html

Watchdog

For real-time file monitoring — triggering code when a new file appears in a directory — the watchdog library is the standard Python solution. This is the production implementation of the "file watcher" concept introduced in Exercise 16.

URL: https://python-watchdog.readthedocs.io


Practice Datasets

Working through this chapter with real-looking data makes the learning more concrete. These public data sources provide clean, downloadable CSV files suitable for the exercises:

  • Kaggle Datasets (https://www.kaggle.com/datasets) — thousands of real-world datasets in CSV format; the sales, finance, and HR categories are most relevant to this book's themes. Free account required.

  • Data.gov (https://data.gov) — U.S. government open data; CSV exports available for most datasets. Good for practicing with real business and economic data.

  • Sample data generators: websites like Mockaroo (https://www.mockaroo.com) let you design a custom schema and download generated CSV data with realistic-looking names, dates, and numbers. Useful for creating test data exactly shaped to your exercise needs.


A Note on Learning Sequencing

The concepts in this chapter feed directly into several important topics later in this book:

Chapter 10 (Exception Handling) builds directly on file I/O, because file operations are among the most common places where exceptions occur in production Python. The try/except patterns introduced in Chapter 10 will make your file-handling code substantially more robust.

Chapter 12 (Working with APIs) uses json.loads() to parse API responses — the same json module introduced here, applied to data arriving over HTTP rather than from a file on disk.

Chapter 14 (Automation and Scheduling) shows how to schedule scripts to run automatically, turning the manual "run this script every Monday" into a fully automated pipeline.

The file I/O skills you have built here are not a standalone topic. They are infrastructure — the layer everything else runs on.