Chapter 9 Key Takeaways: File I/O — Reading and Writing Business Data


The Big Ideas

Files make programs persistent. Every program you wrote before this chapter forgot everything the moment it stopped running. File I/O is the mechanism that lets your work survive between sessions — and that makes the difference between a script you run once to see what happens and a tool you actually rely on.

Knowing the file type determines your tool. Text files use open() with .read(), .write(), and iteration. CSV files use the csv module's DictReader and DictWriter. JSON files use json.load() and json.dump(). Using the right tool for each format prevents a class of parsing bugs that would take you hours to diagnose.

pathlib is not optional — it is the right way. String concatenation for file paths was always fragile. pathlib.Path objects build correct paths on any operating system, expose useful metadata, and read like English. Use them everywhere.


Concept-by-Concept Summary

pathlib and File Paths

  • Import with from pathlib import Path
  • Build paths with the / operator: Path("data") / "sales" / "report.csv"
  • Key attributes: .name (filename), .stem (name without extension), .suffix (extension), .parent (containing directory), .resolve() (absolute path)
  • Key methods: .exists(), .is_file(), .is_dir(), .stat() (size, timestamps), .mkdir(parents=True, exist_ok=True), .unlink() (delete), .rename()
  • Find files: .glob("*.csv") (current folder), .rglob("*.csv") (recursive), .iterdir() (all items)
  • Always sort glob() results with sorted() for deterministic ordering

The open() Function

  • Core parameters: file, mode, encoding, newline
  • Common modes: "r" read (default), "w" write (overwrites), "a" append, "x" create-only
  • Always specify encoding="utf-8" — relying on the OS default produces encoding bugs on Windows
  • Always use a with statement — this guarantees the file is closed even if an exception occurs

Reading Text Files

  • .read() — loads entire file as one string; good for small files
  • .readline() — one line at a time; returns "" at end of file
  • .readlines() — all lines as a list; good when you need index access
  • Direct iteration (for line in file_handle) — best for large files; one line per iteration, never loads the whole file into memory
  • Use .strip() to remove trailing \n and other whitespace from lines you read

Writing Text Files

  • .write(string) — writes a string; does not add newlines automatically
  • .writelines(iterable) — writes each string from an iterable; also does not add newlines
  • Mode "w" truncates (empties) the file on open — all previous content is gone
  • Mode "a" appends to the end; creates the file if it does not exist
  • Close the file after each individual write for critical append logs to ensure data is flushed to disk

The csv Module

  • Always open CSV files with newline="" — prevents blank-row bugs from newline translation
  • csv.reader — rows as lists; positional access only; brittle when columns change
  • csv.DictReader — rows as dicts keyed by header; recommended for all business use
  • csv.writer — write rows as lists
  • csv.DictWriter — write rows as dicts; requires explicit fieldnames; call writeheader() before writerows()
  • extrasaction="ignore" on DictWriter — silently drops dict keys not in fieldnames
  • CSV always returns strings — convert float(row["revenue"]), int(row["units_sold"]) before arithmetic
  • The most common CSV bug: forgetting type conversion and getting string concatenation instead of addition

JSON Files

  • json.load(file_handle) — parse a JSON file into a Python dict/list
  • json.dump(data, file_handle, indent=2) — write a Python object to a JSON file
  • json.loads(string) — parse a JSON string (not a file) into a Python object
  • json.dumps(data, indent=2) — serialize a Python object to a JSON string
  • JSON → Python type mapping: {}dict, []list, "string"str, numbers → int/float, true/falseTrue/False, nullNone
  • Use indent=2 for any JSON that a human might read; omit for compact machine-to-machine output
  • Dates are not a native JSON type — store them as ISO strings ("2024-04-01")
  • JSON is the right choice for: config files, API response caching, structured metadata/summaries

The Patterns You Will Use Again and Again

Pattern 1: Load → Modify → Save

records = load_from_csv(path)     # read everything into memory
records = transform(records)      # modify the in-memory list
save_to_csv(path, records)        # overwrite the file

Simple, predictable, works perfectly for files up to tens of thousands of rows.

Pattern 2: Append Logging

with open(log_path, mode="a", encoding="utf-8") as log_file:
    log_file.write(f"{timestamp}  {message}\n")

Open, write one entry, close. Repeat each time you have something to log. Closing after each write ensures data survives a crash.

Pattern 3: Bulk Processing with glob

for csv_path in sorted(input_dir.glob("*.csv")):
    records = read_one_file(csv_path)
    all_records.extend(records)
write_combined(output_path, all_records)

The sorted() call ensures the output is deterministic regardless of filesystem ordering.

Pattern 4: Config from JSON

with open(config_path, mode="r", encoding="utf-8") as json_file:
    config = json.load(json_file)
output_dir = Path(config["output_directory"])

Never hardcode settings in source code. Read them from a JSON or INI file at startup.

Pattern 5: Validate Before Processing

required = {"rep_id", "rep_name", "revenue", "quota"}
actual   = set(reader.fieldnames or [])
missing  = required - actual
if missing:
    raise ValueError(f"Missing columns: {missing}")

Check that required columns exist before reading any data. Fail loudly at the source rather than confusingly in the middle of processing.


Common Mistakes and How to Avoid Them

Mistake Symptom Fix
Forgetting with statement File left open; data may not be written to disk Always use with open(...) as f:
Omitting encoding="utf-8" UnicodeDecodeError or garbled text on Windows Add encoding="utf-8" to every open() call
Omitting newline="" for CSV Blank rows between every data row Add newline="" to open() when using the csv module
Not converting CSV string types "1500" + "2000" = "15002000" instead of 3500 Always call float() or int() on numeric CSV fields
Opening in "w" mode accidentally Wipes out an existing file Double-check your mode; use "a" when you want to append
Using string concatenation for paths Path breaks on a different OS Use pathlib.Path and the / operator
Calling .mkdir() without exist_ok=True Crashes if directory already exists Always use mkdir(parents=True, exist_ok=True)
Forgetting writeheader() with DictWriter CSV has no column names Call writer.writeheader() before writer.writerows()
Dividing by a field that might be zero ZeroDivisionError Check if field != 0: before dividing
Storing computed values in a CSV Stale values, sync bugs Compute derived values at read time, not write time

The Professional Mindset

Separate your concerns. Your data-reading functions should not know anything about the business logic. Your business logic functions should not know anything about file formats. This makes each piece independently testable and reusable.

Write metadata. Whenever a script processes files, write a JSON metadata file recording what it did: which files it processed, how many records it found, whether any errors occurred, and when it ran. This costs you five minutes to add and saves you hours of debugging six months later.

Validate at the boundary. The moment data enters your program from a file, validate it: check that required columns exist, that numeric fields can actually be converted, that dates are in an expected format. Errors caught at the boundary produce clear messages. Errors that slip through produce mysterious failures later.

Make scripts idempotent. A script is idempotent if running it twice produces the same result as running it once. For file-writing scripts, this usually means using "w" mode (not "a") for the final output, and using exist_ok=True when creating directories. Idempotent scripts are safe to re-run after a failure.


What These Skills Unlock

The techniques in this chapter are the foundation of a very large category of real business automation:

  • ETL pipelines: Extract data from source files, Transform it, Load it somewhere useful
  • Report automation: Read raw data, compute summaries, write formatted reports on a schedule
  • Data auditing: Compare old and new files, flag differences, track changes over time
  • Config-driven tools: Scripts that non-programmers can customize by editing a JSON file
  • Batch processing: Process a folder of files automatically without touching each one manually

Priya's Monday consolidation went from an hour of copy-pasting to a two-second script run. Maya's project tracking went from a confusing multi-tab spreadsheet to a clean CSV with automated reports. These are not toy examples — they are the kind of work that genuinely changes how a business day feels.