Chapter 9 Key Takeaways: File I/O — Reading and Writing Business Data
The Big Ideas
Files make programs persistent. Every program you wrote before this chapter forgot everything the moment it stopped running. File I/O is the mechanism that lets your work survive between sessions — and that makes the difference between a script you run once to see what happens and a tool you actually rely on.
Knowing the file type determines your tool. Text files use open() with .read(), .write(), and iteration. CSV files use the csv module's DictReader and DictWriter. JSON files use json.load() and json.dump(). Using the right tool for each format prevents a class of parsing bugs that would take you hours to diagnose.
pathlib is not optional — it is the right way. String concatenation for file paths was always fragile. pathlib.Path objects build correct paths on any operating system, expose useful metadata, and read like English. Use them everywhere.
Concept-by-Concept Summary
pathlib and File Paths
- Import with
from pathlib import Path - Build paths with the
/operator:Path("data") / "sales" / "report.csv" - Key attributes:
.name(filename),.stem(name without extension),.suffix(extension),.parent(containing directory),.resolve()(absolute path) - Key methods:
.exists(),.is_file(),.is_dir(),.stat()(size, timestamps),.mkdir(parents=True, exist_ok=True),.unlink()(delete),.rename() - Find files:
.glob("*.csv")(current folder),.rglob("*.csv")(recursive),.iterdir()(all items) - Always sort
glob()results withsorted()for deterministic ordering
The open() Function
- Core parameters:
file,mode,encoding,newline - Common modes:
"r"read (default),"w"write (overwrites),"a"append,"x"create-only - Always specify
encoding="utf-8"— relying on the OS default produces encoding bugs on Windows - Always use a
withstatement — this guarantees the file is closed even if an exception occurs
Reading Text Files
.read()— loads entire file as one string; good for small files.readline()— one line at a time; returns""at end of file.readlines()— all lines as a list; good when you need index access- Direct iteration (
for line in file_handle) — best for large files; one line per iteration, never loads the whole file into memory - Use
.strip()to remove trailing\nand other whitespace from lines you read
Writing Text Files
.write(string)— writes a string; does not add newlines automatically.writelines(iterable)— writes each string from an iterable; also does not add newlines- Mode
"w"truncates (empties) the file on open — all previous content is gone - Mode
"a"appends to the end; creates the file if it does not exist - Close the file after each individual write for critical append logs to ensure data is flushed to disk
The csv Module
- Always open CSV files with
newline=""— prevents blank-row bugs from newline translation csv.reader— rows as lists; positional access only; brittle when columns changecsv.DictReader— rows as dicts keyed by header; recommended for all business usecsv.writer— write rows as listscsv.DictWriter— write rows as dicts; requires explicitfieldnames; callwriteheader()beforewriterows()extrasaction="ignore"onDictWriter— silently drops dict keys not infieldnames- CSV always returns strings — convert
float(row["revenue"]),int(row["units_sold"])before arithmetic - The most common CSV bug: forgetting type conversion and getting string concatenation instead of addition
JSON Files
json.load(file_handle)— parse a JSON file into a Python dict/listjson.dump(data, file_handle, indent=2)— write a Python object to a JSON filejson.loads(string)— parse a JSON string (not a file) into a Python objectjson.dumps(data, indent=2)— serialize a Python object to a JSON string- JSON → Python type mapping:
{}→dict,[]→list,"string"→str, numbers →int/float,true/false→True/False,null→None - Use
indent=2for any JSON that a human might read; omit for compact machine-to-machine output - Dates are not a native JSON type — store them as ISO strings (
"2024-04-01") - JSON is the right choice for: config files, API response caching, structured metadata/summaries
The Patterns You Will Use Again and Again
Pattern 1: Load → Modify → Save
records = load_from_csv(path) # read everything into memory
records = transform(records) # modify the in-memory list
save_to_csv(path, records) # overwrite the file
Simple, predictable, works perfectly for files up to tens of thousands of rows.
Pattern 2: Append Logging
with open(log_path, mode="a", encoding="utf-8") as log_file:
log_file.write(f"{timestamp} {message}\n")
Open, write one entry, close. Repeat each time you have something to log. Closing after each write ensures data survives a crash.
Pattern 3: Bulk Processing with glob
for csv_path in sorted(input_dir.glob("*.csv")):
records = read_one_file(csv_path)
all_records.extend(records)
write_combined(output_path, all_records)
The sorted() call ensures the output is deterministic regardless of filesystem ordering.
Pattern 4: Config from JSON
with open(config_path, mode="r", encoding="utf-8") as json_file:
config = json.load(json_file)
output_dir = Path(config["output_directory"])
Never hardcode settings in source code. Read them from a JSON or INI file at startup.
Pattern 5: Validate Before Processing
required = {"rep_id", "rep_name", "revenue", "quota"}
actual = set(reader.fieldnames or [])
missing = required - actual
if missing:
raise ValueError(f"Missing columns: {missing}")
Check that required columns exist before reading any data. Fail loudly at the source rather than confusingly in the middle of processing.
Common Mistakes and How to Avoid Them
| Mistake | Symptom | Fix |
|---|---|---|
Forgetting with statement |
File left open; data may not be written to disk | Always use with open(...) as f: |
Omitting encoding="utf-8" |
UnicodeDecodeError or garbled text on Windows |
Add encoding="utf-8" to every open() call |
Omitting newline="" for CSV |
Blank rows between every data row | Add newline="" to open() when using the csv module |
| Not converting CSV string types | "1500" + "2000" = "15002000" instead of 3500 |
Always call float() or int() on numeric CSV fields |
Opening in "w" mode accidentally |
Wipes out an existing file | Double-check your mode; use "a" when you want to append |
| Using string concatenation for paths | Path breaks on a different OS | Use pathlib.Path and the / operator |
Calling .mkdir() without exist_ok=True |
Crashes if directory already exists | Always use mkdir(parents=True, exist_ok=True) |
Forgetting writeheader() with DictWriter |
CSV has no column names | Call writer.writeheader() before writer.writerows() |
| Dividing by a field that might be zero | ZeroDivisionError |
Check if field != 0: before dividing |
| Storing computed values in a CSV | Stale values, sync bugs | Compute derived values at read time, not write time |
The Professional Mindset
Separate your concerns. Your data-reading functions should not know anything about the business logic. Your business logic functions should not know anything about file formats. This makes each piece independently testable and reusable.
Write metadata. Whenever a script processes files, write a JSON metadata file recording what it did: which files it processed, how many records it found, whether any errors occurred, and when it ran. This costs you five minutes to add and saves you hours of debugging six months later.
Validate at the boundary. The moment data enters your program from a file, validate it: check that required columns exist, that numeric fields can actually be converted, that dates are in an expected format. Errors caught at the boundary produce clear messages. Errors that slip through produce mysterious failures later.
Make scripts idempotent. A script is idempotent if running it twice produces the same result as running it once. For file-writing scripts, this usually means using "w" mode (not "a") for the final output, and using exist_ok=True when creating directories. Idempotent scripts are safe to re-run after a failure.
What These Skills Unlock
The techniques in this chapter are the foundation of a very large category of real business automation:
- ETL pipelines: Extract data from source files, Transform it, Load it somewhere useful
- Report automation: Read raw data, compute summaries, write formatted reports on a schedule
- Data auditing: Compare old and new files, flag differences, track changes over time
- Config-driven tools: Scripts that non-programmers can customize by editing a JSON file
- Batch processing: Process a folder of files automatically without touching each one manually
Priya's Monday consolidation went from an hour of copy-pasting to a two-second script run. Maya's project tracking went from a confusing multi-tab spreadsheet to a clean CSV with automated reports. These are not toy examples — they are the kind of work that genuinely changes how a business day feels.