Chapter 9 Key Takeaways: File I/O — Reading and Writing Business Data

The Big Ideas

Files make programs persistent. Every program you wrote before this chapter forgot everything the moment it stopped running. File I/O is the mechanism that lets your work survive between sessions — and that makes the difference between a script you run once to see what happens and a tool you actually rely on.

Knowing the file type determines your tool. Text files use open() with .read(), .write(), and iteration. CSV files use the csv module's DictReader and DictWriter. JSON files use json.load() and json.dump(). Using the right tool for each format prevents a class of parsing bugs that would take you hours to diagnose.

pathlib is not optional — it is the right way. String concatenation for file paths was always fragile. pathlib.Path objects build correct paths on any operating system, expose useful metadata, and read like English. Use them everywhere.

Concept-by-Concept Summary

pathlib and File Paths

Import with from pathlib import Path
Build paths with the / operator: Path("data") / "sales" / "report.csv"
Key attributes: .name (filename), .stem (name without extension), .suffix (extension), .parent (containing directory), .resolve() (absolute path)
Key methods: .exists(), .is_file(), .is_dir(), .stat() (size, timestamps), .mkdir(parents=True, exist_ok=True), .unlink() (delete), .rename()
Find files: .glob("*.csv") (current folder), .rglob("*.csv") (recursive), .iterdir() (all items)
Always sort glob() results with sorted() for deterministic ordering

The open() Function

Core parameters: file, mode, encoding, newline
Common modes: "r" read (default), "w" write (overwrites), "a" append, "x" create-only
Always specify encoding="utf-8" — relying on the OS default produces encoding bugs on Windows
Always use a with statement — this guarantees the file is closed even if an exception occurs

Reading Text Files

.read() — loads entire file as one string; good for small files
.readline() — one line at a time; returns "" at end of file
.readlines() — all lines as a list; good when you need index access
Direct iteration (for line in file_handle) — best for large files; one line per iteration, never loads the whole file into memory
Use .strip() to remove trailing \n and other whitespace from lines you read

Writing Text Files

.write(string) — writes a string; does not add newlines automatically
.writelines(iterable) — writes each string from an iterable; also does not add newlines
Mode "w" truncates (empties) the file on open — all previous content is gone
Mode "a" appends to the end; creates the file if it does not exist
Close the file after each individual write for critical append logs to ensure data is flushed to disk

The csv Module

Always open CSV files with newline="" — prevents blank-row bugs from newline translation
csv.reader — rows as lists; positional access only; brittle when columns change
csv.DictReader — rows as dicts keyed by header; recommended for all business use
csv.writer — write rows as lists
csv.DictWriter — write rows as dicts; requires explicit fieldnames; call writeheader() before writerows()
extrasaction="ignore" on DictWriter — silently drops dict keys not in fieldnames
CSV always returns strings — convert float(row["revenue"]), int(row["units_sold"]) before arithmetic
The most common CSV bug: forgetting type conversion and getting string concatenation instead of addition

JSON Files

json.load(file_handle) — parse a JSON file into a Python dict/list
json.dump(data, file_handle, indent=2) — write a Python object to a JSON file
json.loads(string) — parse a JSON string (not a file) into a Python object
json.dumps(data, indent=2) — serialize a Python object to a JSON string
JSON → Python type mapping: {} → dict, [] → list, "string" → str, numbers → int/float, true/false → True/False, null → None
Use indent=2 for any JSON that a human might read; omit for compact machine-to-machine output
Dates are not a native JSON type — store them as ISO strings ("2024-04-01")
JSON is the right choice for: config files, API response caching, structured metadata/summaries

The Patterns You Will Use Again and Again

Pattern 1: Load → Modify → Save

records = load_from_csv(path)     # read everything into memory
records = transform(records)      # modify the in-memory list
save_to_csv(path, records)        # overwrite the file

Simple, predictable, works perfectly for files up to tens of thousands of rows.

Pattern 2: Append Logging

with open(log_path, mode="a", encoding="utf-8") as log_file:
    log_file.write(f"{timestamp}  {message}\n")

Open, write one entry, close. Repeat each time you have something to log. Closing after each write ensures data survives a crash.

Pattern 3: Bulk Processing with glob

for csv_path in sorted(input_dir.glob("*.csv")):
    records = read_one_file(csv_path)
    all_records.extend(records)
write_combined(output_path, all_records)

The sorted() call ensures the output is deterministic regardless of filesystem ordering.

Pattern 4: Config from JSON

with open(config_path, mode="r", encoding="utf-8") as json_file:
    config = json.load(json_file)
output_dir = Path(config["output_directory"])

Never hardcode settings in source code. Read them from a JSON or INI file at startup.

Pattern 5: Validate Before Processing

required = {"rep_id", "rep_name", "revenue", "quota"}
actual   = set(reader.fieldnames or [])
missing  = required - actual
if missing:
    raise ValueError(f"Missing columns: {missing}")

Check that required columns exist before reading any data. Fail loudly at the source rather than confusingly in the middle of processing.

Common Mistakes and How to Avoid Them

Mistake	Symptom	Fix
Forgetting `with` statement	File left open; data may not be written to disk	Always use `with open(...) as f:`
Omitting `encoding="utf-8"`	`UnicodeDecodeError` or garbled text on Windows	Add `encoding="utf-8"` to every `open()` call
Omitting `newline=""` for CSV	Blank rows between every data row	Add `newline=""` to `open()` when using the `csv` module
Not converting CSV string types	`"1500" + "2000"` = `"15002000"` instead of `3500`	Always call `float()` or `int()` on numeric CSV fields
Opening in `"w"` mode accidentally	Wipes out an existing file	Double-check your mode; use `"a"` when you want to append
Using string concatenation for paths	Path breaks on a different OS	Use `pathlib.Path` and the `/` operator
Calling `.mkdir()` without `exist_ok=True`	Crashes if directory already exists	Always use `mkdir(parents=True, exist_ok=True)`
Forgetting `writeheader()` with DictWriter	CSV has no column names	Call `writer.writeheader()` before `writer.writerows()`
Dividing by a field that might be zero	`ZeroDivisionError`	Check `if field != 0:` before dividing
Storing computed values in a CSV	Stale values, sync bugs	Compute derived values at read time, not write time

The Professional Mindset

Separate your concerns. Your data-reading functions should not know anything about the business logic. Your business logic functions should not know anything about file formats. This makes each piece independently testable and reusable.

Write metadata. Whenever a script processes files, write a JSON metadata file recording what it did: which files it processed, how many records it found, whether any errors occurred, and when it ran. This costs you five minutes to add and saves you hours of debugging six months later.

Validate at the boundary. The moment data enters your program from a file, validate it: check that required columns exist, that numeric fields can actually be converted, that dates are in an expected format. Errors caught at the boundary produce clear messages. Errors that slip through produce mysterious failures later.

Make scripts idempotent. A script is idempotent if running it twice produces the same result as running it once. For file-writing scripts, this usually means using "w" mode (not "a") for the final output, and using exist_ok=True when creating directories. Idempotent scripts are safe to re-run after a failure.

What These Skills Unlock

The techniques in this chapter are the foundation of a very large category of real business automation:

ETL pipelines: Extract data from source files, Transform it, Load it somewhere useful
Report automation: Read raw data, compute summaries, write formatted reports on a schedule
Data auditing: Compare old and new files, flag differences, track changes over time
Config-driven tools: Scripts that non-programmers can customize by editing a JSON file
Batch processing: Process a folder of files automatically without touching each one manually

Priya's Monday consolidation went from an hour of copy-pasting to a two-second script run. Maya's project tracking went from a confusing multi-tab spreadsheet to a clean CSV with automated reports. These are not toy examples — they are the kind of work that genuinely changes how a business day feels.