21 min read

> "The file system is the most democratic database — every programming language, every operating system, every era of computing agrees on its fundamental idea: bytes in a file."

Learning Objectives

  • Open, read, and write text files using context managers (the with statement)
  • Process files line by line for memory-efficient handling of large files
  • Work with file paths using pathlib.Path for cross-platform compatibility
  • Read and write structured data in CSV and JSON formats
  • Handle common file I/O errors gracefully

Chapter 10: File Input and Output: Persistent Data

"The file system is the most democratic database — every programming language, every operating system, every era of computing agrees on its fundamental idea: bytes in a file." — Adapted from Rob Pike

Chapter Overview

Every program you've written so far has a fatal flaw: it forgets everything the moment it stops running. Your grade calculator computed a perfect average, displayed it on screen, and then — poof — the data vanished. Your TaskFlow task list? Gone as soon as you pressed Ctrl+C. All that work, evaporated.

In the real world, this is a non-starter. Your phone's contacts persist when you restart it. Spreadsheets survive power outages. Web applications remember your login across sessions. How? They write data to files — or, at scale, to databases, which are themselves just sophisticated file systems under the hood.

This chapter is about making your programs remember. You'll learn to read data from files, write results to disk, and work with the two most common structured data formats in the industry: CSV and JSON. By the end of this chapter, your programs will outlive their own execution — and that's a fundamental shift in what your code can do.

In this chapter, you will learn to: - Open files for reading, writing, and appending using open() and context managers - Process files efficiently, line by line, without loading everything into memory - Use pathlib.Path for cross-platform file path handling - Read and write CSV files using the csv module - Read and write JSON files using the json module - Diagnose and fix common file I/O errors

🏃 Fast Track: If you're comfortable with basic file reading/writing and want to jump to structured data formats, skim sections 10.1-10.5 and start at section 10.7 (CSV) or 10.8 (JSON).

🔬 Deep Dive: After this chapter, read Case Study 02 for a comparison of data formats used in real-world applications, and the Further Reading for links to the Python documentation on io, csv, and json.


10.1 Why File I/O Matters

🚪 Threshold Concept: Persistence

Until now, every variable you've created lives only as long as your program is running. When the program ends, Python reclaims that memory, and the data is gone forever. Persistence is the idea that programs can outlive their execution by writing data to disk. This is the difference between a calculator and a spreadsheet — the spreadsheet remembers. Once you understand persistence, you start thinking about programs differently: every application becomes a conversation between the running code and the data it stores.

Think about the programs you use every day. Your text editor saves documents. Your music app remembers your playlists. Your web browser keeps your bookmarks across updates, crashes, and new laptop setups. All of these programs read data from files when they start and write data to files when things change.

File I/O (input/output) is the bridge between your program's temporary memory (RAM) and permanent storage (your hard drive or SSD). Here's the mental model:

The File I/O Lifecycle:

Program starts → Open file → Read data into variables → Process data
    → Write results to file → Close file → Program ends
                                              ↓
                              Data survives on disk
                                              ↓
                              Next time program starts → Read saved data → Continue

This lifecycle is at the heart of virtually every useful application. The pattern has three phases:

  1. Open the file (establish a connection between your program and a file on disk)
  2. Read or write data through that connection
  3. Close the file (release the connection so the operating system can clean up)

🧩 Productive Struggle

Before reading further, think about this: You built a grade calculator in earlier chapters. It computes averages, letter grades, maybe even handles weighted categories. But every time you close the program, all the student data disappears. A student asks: "Can I save my grades and load them next week?" How would you solve this? What would you need to store? Where would you put it? Jot down your ideas before reading on — you'll be surprised how close your intuition gets.

Why Not Just Use a Database?

Fair question. Databases (like SQLite, PostgreSQL, or MongoDB) are powerful tools for storing data, and you'll encounter them in later courses. But they're overkill for many tasks, and they all sit on top of file I/O at the lowest level. Understanding files first means:

  • You understand what databases are actually doing under the hood.
  • You can work with data formats (CSV, JSON, log files) that don't require a database.
  • You can prototype quickly — write to a file now, switch to a database later when your needs grow.

💡 Intuition: File I/O is like paper. Everyone can read it, everyone can write on it, and it doesn't require electricity to store. Databases are like filing cabinets with locks, indexes, and a librarian — more powerful, but sometimes you just need to jot something on a Post-it.


10.2 Opening and Reading Files

The built-in open() function is your gateway to the file system. It creates a file object — a Python object that represents a connection to a file on disk.

The Basics of open()

file = open("myfile.txt", "r")   # open for reading
content = file.read()             # read the entire file
file.close()                      # ALWAYS close when done

The first argument is the filename (or path). The second argument is the mode — what you intend to do with the file:

Mode Meaning Creates file if missing? Overwrites existing?
"r" Read (default) No (raises error) N/A
"w" Write Yes Yes — erases all content
"a" Append Yes No (adds to end)
"x" Exclusive create Yes (but errors if exists) N/A

⚠️ Pitfall: The "w" mode is dangerous. If the file exists, opening it in write mode immediately erases all its content — before you've written a single byte. There is no undo. This is the file I/O equivalent of rm — it doesn't ask for confirmation.

Reading Methods

Once you have a file object open in read mode, you have three ways to get data out of it:

read() — everything at once:

with open("zen.txt", "r") as f:
    contents = f.read()   # one big string
print(len(contents))      # number of characters

This loads the entire file into a single string. Simple and convenient for small files, but dangerous for large ones — a 2 GB log file would try to consume 2 GB of RAM.

readline() — one line at a time:

with open("zen.txt", "r") as f:
    first_line = f.readline()    # "Beautiful is better than ugly.\n"
    second_line = f.readline()   # "Explicit is better than implicit.\n"

Each call reads up to and including the next newline character (\n). When you reach the end of the file, readline() returns an empty string "".

readlines() — all lines as a list:

with open("zen.txt", "r") as f:
    lines = f.readlines()   # ["Beautiful is...\n", "Explicit is...\n", ...]
print(len(lines))           # number of lines

This gives you a list of strings, one per line, each ending with \n. It loads the entire file into memory, so the same size warning as read() applies.

🔗 Connection (Ch 8 — Lists): readlines() returns a list of strings. Every list operation you learned — indexing, slicing, iterating, list comprehensions — works on the result. lines[0] gives you the first line. lines[-1] gives you the last. [line.strip() for line in lines] removes all trailing newlines.

Dr. Patel's FASTA Files

Dr. Anika Patel processes DNA sequence files in a format called FASTA. Each sequence starts with a header line beginning with >, followed by the sequence data on subsequent lines:

# Reading a simplified FASTA file
sequences = {}
current_name = ""

with open("sequences.fasta", "r") as f:
    for line in f:
        line = line.strip()
        if line.startswith(">"):
            current_name = line[1:]   # remove the '>'
            sequences[current_name] = ""
        else:
            sequences[current_name] += line

# Now sequences is a dict: {"Gene_ABC": "ATCGATCG...", ...}
for name, seq in sequences.items():
    print(f"{name}: {len(seq)} nucleotides")

This pattern — reading a file and building a dictionary — shows how file I/O connects directly to the data structures you learned in Chapters 8 and 9.

🔄 Check Your Understanding

  1. What's the difference between read() and readlines()?
  2. If you call f.read() twice on the same file object (without reopening), what does the second call return? Why?
  3. Why does readline() include the \n character at the end?

Verify

  1. read() returns the entire file as a single string. readlines() returns a list of strings, one per line.
  2. The second call returns an empty string "". The file object maintains a cursor (position pointer) that advances as you read. After read() reaches the end, the cursor is at the end — there's nothing left to read.
  3. Because \n is a character in the file, just like any letter. Python faithfully reports what's there. Use .strip() to remove it if you don't want it.

10.3 Context Managers: The with Statement

You may have noticed that every example above uses with open(...) as f:. This is a context manager, and it's the single most important pattern in file I/O.

The Problem: Forgetting to Close

Without with, you must close the file manually:

f = open("data.txt", "r")
content = f.read()
f.close()   # easy to forget!

This looks fine, but what if an error occurs between open() and close()? The file stays open, and the operating system keeps holding onto it. Open too many files without closing them and your program (or your entire system) can run out of file handles and crash.

# This is BROKEN — if process() raises an error, f.close() never runs
f = open("data.txt", "r")
content = f.read()
result = process(content)   # What if this crashes?
f.close()                   # This line never executes

The Solution: with Guarantees Cleanup

The with statement guarantees that the file is closed when the indented block ends — whether it ends normally or because of an exception:

with open("data.txt", "r") as f:
    content = f.read()
    result = process(content)   # Even if this crashes...
# ...the file is closed here, guaranteed.

Best Practice: Always use with for file operations. There is no good reason to use open() without with in modern Python. If you see code that opens a file without with, that's a code smell — it's not necessarily broken, but it's fragile.

How with Works (Briefly)

A context manager is any object that defines __enter__ and __exit__ methods. When Python enters the with block, it calls __enter__ (which returns the file object). When the block ends — for any reason — Python calls __exit__ (which closes the file). You don't need to understand these methods yet; you just need to use with.

# What with open(...) as f: actually does (conceptually):
# 1. Call open("data.txt", "r") → returns a file object
# 2. Call file_object.__enter__() → returns self (assigned to f)
# 3. Execute the indented block
# 4. Call file_object.__exit__() → closes the file (always, even on error)

🔄 Spaced Review (Ch 6 — Functions): The with block creates a scope for file operations, similar to how functions create a scope for local variables. The file object f is accessible inside the block and technically exists after the block, but the file is closed — you can't read from or write to it anymore. Think of with as a function that automatically cleans up after itself.


10.4 Writing and Appending to Files

Reading data is only half the story. To achieve persistence, your programs need to write data back to disk.

Write Mode ("w"): Start Fresh

with open("output.txt", "w") as f:
    f.write("Line one\n")
    f.write("Line two\n")
    f.write("Line three\n")

The write() method takes a string and writes it to the file. Unlike print(), it does not add a newline — you must include \n yourself.

# write() vs. print() comparison
with open("comparison.txt", "w") as f:
    f.write("write: no newline added")
    f.write("write: this is on the same line!")
    print("print: newline added automatically", file=f)
    print("print: this is on a new line", file=f)

The file would contain:

write: no newline addedwrite: this is on the same line!print: newline added automatically
print: this is on a new line

⚠️ Pitfall: Opening a file in "w" mode truncates (empties) the file immediately — even if you never call write(). This code destroys the file's content: python with open("important_data.txt", "w") as f: pass # Oops — file is now empty, even though we wrote nothing

Append Mode ("a"): Add to the End

Append mode opens the file for writing but positions the cursor at the end. Existing content is preserved:

# First, create a log file
with open("app.log", "w") as f:
    f.write("=== Application Log ===\n")

# Later, append entries
with open("app.log", "a") as f:
    f.write("2025-01-15 09:00: App started\n")

# Even later, append more
with open("app.log", "a") as f:
    f.write("2025-01-15 09:05: User logged in\n")

The file now contains all three lines. This is the pattern used for log files — you keep appending entries and the history accumulates.

writelines(): Write a List of Strings

The writelines() method writes each string in a list to the file. Like write(), it does not add newlines:

lines = ["Alice: 92\n", "Bob: 87\n", "Carol: 95\n"]

with open("grades.txt", "w") as f:
    f.writelines(lines)

💡 Intuition: Think of write() and writelines() as the opposites of read() and readlines(). The read/write pair works with a single string; the readlines/writelines pair works with a list of strings.

Grade Calculator: Writing a Report

Let's extend the grade calculator to produce an output file:

def write_grade_report(students: list[dict], filename: str) -> None:
    """Write a formatted grade report to a text file.

    Each student dict has 'name' and 'scores' keys.
    """
    with open(filename, "w") as f:
        f.write("Grade Report\n")
        f.write("=" * 40 + "\n\n")

        for student in students:
            avg = sum(student["scores"]) / len(student["scores"])
            f.write(f"{student['name']:<20} Average: {avg:.1f}\n")

        overall = sum(
            sum(s["scores"]) / len(s["scores"]) for s in students
        ) / len(students)
        f.write(f"\n{'Class average:':<20} {overall:.1f}\n")

    print(f"Report written to {filename}")


# Usage
students = [
    {"name": "Alice", "scores": [92, 88, 95]},
    {"name": "Bob", "scores": [85, 79, 88]},
    {"name": "Carol", "scores": [97, 93, 96]},
]
write_grade_report(students, "report.txt")

Output in report.txt:

Grade Report
========================================

Alice                Average: 91.7
Bob                  Average: 84.0
Carol                Average: 95.3

Class average:       90.3

🔄 Spaced Review (Ch 6 — Functions): Notice how write_grade_report() is a well-designed function — it takes data and a filename as parameters, does one job, and uses a context manager internally. The caller doesn't need to worry about file handles or closing. This is the power of combining functions with file I/O.


10.5 Processing Files Line by Line

When you're working with large files — log files, datasets, genome sequences — loading the entire file into memory with read() or readlines() is a bad idea. A 10 GB log file would consume 10 GB of RAM.

The solution is to iterate over the file object directly. Python reads one line at a time, keeping memory usage constant regardless of file size:

# Memory-efficient: only one line is in memory at a time
total = 0
count = 0

with open("huge_dataset.txt", "r") as f:
    for line in f:
        value = float(line.strip())
        total += value
        count += 1

print(f"Average: {total / count:.2f}")

This pattern works for files of any size — 1 KB or 100 GB. Each iteration, Python reads one line from disk, you process it, and then that line's memory can be reclaimed.

Common Line-by-Line Patterns

Counting specific lines:

# Count lines containing the word "ERROR"
error_count = 0
with open("server.log", "r") as f:
    for line in f:
        if "ERROR" in line:
            error_count += 1
print(f"Found {error_count} errors")

Building a list from a file:

# Read a file of names into a list (stripping whitespace)
with open("names.txt", "r") as f:
    names = [line.strip() for line in f if line.strip()]

Processing and writing simultaneously:

# Elena's pattern: read one file, write results to another
with open("raw_data.txt", "r") as infile, \
     open("processed.txt", "w") as outfile:
    for line in infile:
        cleaned = line.strip().upper()
        if cleaned:   # skip blank lines
            outfile.write(cleaned + "\n")

💡 Intuition: You can open multiple files in a single with statement by separating them with commas (or using a backslash \ for line continuation). Both files are guaranteed to be closed when the block ends.

Elena's Report: Processing Monthly Data

Elena Vasquez at the nonprofit receives a plain-text report each month where each line contains a donor name and amount separated by a pipe character:

def process_donations(input_file: str, output_file: str) -> None:
    """Read raw donation data, compute stats, write summary."""
    total = 0.0
    count = 0
    largest_donor = ""
    largest_amount = 0.0

    with open(input_file, "r") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("#"):   # skip blanks/comments
                continue
            name, amount_str = line.split("|")
            amount = float(amount_str.strip())
            total += amount
            count += 1
            if amount > largest_amount:
                largest_amount = amount
                largest_donor = name.strip()

    with open(output_file, "w") as f:
        f.write(f"Donations Summary\n")
        f.write(f"Total donations:  ${total:,.2f}\n")
        f.write(f"Number of donors: {count}\n")
        f.write(f"Average donation: ${total / count:,.2f}\n")
        f.write(f"Largest donor:    {largest_donor} (${largest_amount:,.2f})\n")

    print(f"Summary written to {output_file}")

🔄 Spaced Review (Ch 8 — Lists): The list comprehension [line.strip() for line in f if line.strip()] combines file iteration (Chapter 10) with list comprehension filtering (Chapter 8). Every line of the file is stripped of whitespace, and blank lines are excluded — all in one expression.


10.6 Working with Paths: pathlib

So far, we've used simple filenames like "data.txt". That works when the file is in the same directory as your script, but real programs need to work with files in other directories, on different operating systems, and in locations that might not exist yet.

The pathlib module provides the Path class — an object-oriented way to work with file system paths that works correctly on Windows, macOS, and Linux.

Creating Path Objects

from pathlib import Path

# Simple filename
p = Path("data.txt")

# Subdirectory path
p = Path("reports") / "2025" / "january.csv"
print(p)   # reports/2025/january.csv (or reports\2025\january.csv on Windows)

# Home directory
home = Path.home()
print(home)   # /Users/yourname (macOS) or C:\Users\yourname (Windows)

# Current working directory
cwd = Path.cwd()
print(cwd)   # wherever your script is running from

The / operator on Path objects joins path components. This is cleaner and more reliable than string concatenation, because Path handles the right separator for your operating system automatically.

Useful Path Operations

from pathlib import Path

p = Path("reports") / "2025" / "january.csv"

# Components
print(p.name)      # "january.csv"       — filename with extension
print(p.stem)      # "january"           — filename without extension
print(p.suffix)    # ".csv"              — file extension
print(p.parent)    # reports/2025        — directory containing the file

# Checking existence
print(p.exists())     # True or False
print(p.is_file())    # True if it's a file (not a directory)
print(p.is_dir())     # True if it's a directory

# Creating directories
output_dir = Path("output") / "processed"
output_dir.mkdir(parents=True, exist_ok=True)
# parents=True:   create intermediate directories if needed
# exist_ok=True:  don't error if the directory already exists

Path Objects Work with open()

Path objects work seamlessly with open():

from pathlib import Path

data_dir = Path("data")
report_path = data_dir / "monthly_report.csv"

# Both of these work:
with open(report_path, "r") as f:         # open() accepts Path objects
    content = f.read()

content = report_path.read_text()          # Path has its own read method

Path objects also have convenience methods for quick reads and writes:

from pathlib import Path

p = Path("quick.txt")

# Quick write (no need for open/with)
p.write_text("Hello, pathlib!\n")

# Quick read
content = p.read_text()
print(content)   # "Hello, pathlib!\n"

Best Practice: Use pathlib.Path for all file path manipulation. String concatenation with / or \\ is fragile and platform-dependent. Path("data") / "file.csv" works on every operating system. "data/" + "file.csv" might not work on Windows.

Finding Your Script's Directory

A common pattern is to locate files relative to your script, not relative to where the user happens to run it:

from pathlib import Path

# Directory containing the current script
SCRIPT_DIR = Path(__file__).parent

# Data file in the same directory as the script
data_path = SCRIPT_DIR / "sample-data.csv"

# Data file in a sibling directory
config_path = SCRIPT_DIR.parent / "config" / "settings.json"

This pattern is critical for distributing programs. Without it, your code breaks if someone runs it from a different directory.

🔄 Check Your Understanding

  1. What does Path("reports") / "q1" / "summary.csv" produce on macOS vs. Windows?
  2. Why is Path.mkdir(parents=True, exist_ok=True) safer than just Path.mkdir()?
  3. What does Path(__file__).parent give you, and why is it useful?

Verify

  1. On macOS/Linux: reports/q1/summary.csv. On Windows: reports\q1\summary.csv. The Path class uses the correct separator automatically.
  2. parents=True creates intermediate directories (like reports/q1/) that don't exist yet. exist_ok=True avoids a FileExistsError if the directory already exists. Without these flags, either condition raises an exception.
  3. It gives the directory containing the currently running script. It's useful because it lets you locate data files relative to your code, regardless of where the user runs the script from.

10.7 Reading and Writing CSV

CSV (Comma-Separated Values) is the most common format for tabular data — anything that looks like a spreadsheet. Every spreadsheet application can export CSV, every data tool can import it, and every programming language has CSV support.

A CSV file is just a text file where each line is a row and values are separated by commas:

name,department,hours_worked,hourly_rate
Elena Vasquez,Programs,42,28.50
Marcus Chen,Development,38,32.00

You could parse this yourself with line.split(","), but don't. Real CSV files have edge cases that will bite you: values containing commas, quoted strings, embedded newlines, different delimiters. The csv module handles all of this correctly.

Reading with csv.reader

import csv

with open("employees.csv", "r", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)   # grab the header row
    print(f"Columns: {header}")

    for row in reader:
        name, dept, hours, rate = row   # each row is a list of strings
        pay = float(hours) * float(rate)
        print(f"  {name}: ${pay:.2f}")

⚠️ Pitfall: Always pass newline="" when opening CSV files. Without it, Python's universal newline handling can interfere with the csv module's own newline parsing, leading to blank rows or corrupted data on some platforms.

DictReader maps each row to a dictionary using the header row as keys:

import csv

with open("employees.csv", "r", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        # row is a dict: {"name": "Elena Vasquez", "department": "Programs", ...}
        pay = float(row["hours_worked"]) * float(row["hourly_rate"])
        print(f"  {row['name']}: ${pay:.2f}")

DictReader is almost always preferable to csv.reader because: - Your code is readable: row["name"] vs. row[0] - Column order doesn't matter — you access by name - Adding a new column to the CSV doesn't break existing code

🔗 Connection (Ch 9 — Dictionaries): DictReader turns each row into a dictionary. All the dict operations from Chapter 9 — row["key"], row.get("key", default), iterating with .items() — work exactly as you'd expect.

Writing with csv.writer and csv.DictWriter

import csv

# csv.writer — write lists
with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["Name", "Score", "Grade"])   # header
    writer.writerow(["Alice", 92, "A"])
    writer.writerow(["Bob", 87, "B+"])

# csv.DictWriter — write from dicts (recommended)
with open("output.csv", "w", newline="") as f:
    fieldnames = ["Name", "Score", "Grade"]
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerow({"Name": "Alice", "Score": 92, "Grade": "A"})
    writer.writerow({"Name": "Bob", "Score": 87, "Grade": "B+"})

Elena's Report: CSV Pipeline

Elena processes monthly payroll CSVs to produce department summaries. Here's the complete read-process-write pattern:

import csv

def summarize_payroll(input_csv: str, output_csv: str) -> None:
    """Read employee data, compute department totals, write summary."""
    dept_data: dict[str, list[float]] = {}

    # Phase 1: Read and aggregate
    with open(input_csv, "r", newline="") as f:
        for row in csv.DictReader(f):
            dept = row["department"]
            pay = float(row["hours_worked"]) * float(row["hourly_rate"])
            dept_data.setdefault(dept, []).append(pay)

    # Phase 2: Compute and write
    with open(output_csv, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=[
            "department", "employees", "total_pay", "avg_pay"
        ])
        writer.writeheader()
        for dept in sorted(dept_data):
            pays = dept_data[dept]
            writer.writerow({
                "department": dept,
                "employees": len(pays),
                "total_pay": f"{sum(pays):.2f}",
                "avg_pay": f"{sum(pays) / len(pays):.2f}",
            })

    print(f"Summary: {len(dept_data)} departments → {output_csv}")

# Usage
summarize_payroll("sample-data.csv", "department_summary.csv")

10.8 Reading and Writing JSON

JSON (JavaScript Object Notation) is the lingua franca of the modern web. APIs return JSON. Configuration files use JSON. Mobile apps store settings in JSON. If you've ever looked at data from a web service, you've seen JSON.

JSON looks almost identical to Python dictionaries and lists:

{
  "name": "Alice Chen",
  "gpa": 3.87,
  "courses": ["CS 101", "MATH 201"],
  "graduated": false,
  "advisor": null
}

Python's json module provides four functions — two for files, two for strings:

Function Direction Works With
json.dump(obj, file) Python → JSON file File objects
json.load(file) JSON file → Python File objects
json.dumps(obj) Python → JSON string Strings
json.loads(string) JSON string → Python Strings

💡 Intuition: The s in dumps and loads stands for "string." dump writes to a file; dumps writes to a string. load reads from a file; loads reads from a string.

Writing JSON

import json

student = {
    "name": "Alice",
    "scores": [92, 88, 95],
    "graduated": False,
}

# Write to a file (pretty-printed)
with open("student.json", "w") as f:
    json.dump(student, f, indent=2)

# Convert to a string
json_string = json.dumps(student, indent=2)
print(json_string)

Output:

{
  "name": "Alice",
  "scores": [
    92,
    88,
    95
  ],
  "graduated": false
}

Notice that Python False becomes JSON false, and Python None becomes JSON null. The indent=2 parameter produces human-readable output; without it, everything lands on one line.

Reading JSON

import json

# Read from a file
with open("student.json", "r") as f:
    student = json.load(f)

print(student["name"])           # "Alice"
print(student["scores"])         # [92, 88, 95]
print(type(student["scores"]))   # <class 'list'>

# Parse from a string
json_text = '{"city": "Portland", "pop": 652503}'
data = json.loads(json_text)
print(data["city"])   # "Portland"

Python ↔ JSON Type Mapping

Python JSON Notes
dict object {} Keys must be strings in JSON
list array []
str string
int, float number
True / False true / false Lowercase in JSON
None null
tuple array [] Tuples become lists — the distinction is lost

⚠️ Pitfall: JSON dictionary keys must be strings. If you have integer keys in a Python dict, json.dump converts them to strings. When you json.load the data back, those keys are still strings — not integers. This can cause subtle KeyError bugs: python scores = {1: "Alice", 2: "Bob"} json_text = json.dumps(scores) # '{"1": "Alice", "2": "Bob"}' loaded = json.loads(json_text) print(loaded[1]) # KeyError! Keys are now "1", "2" print(loaded["1"]) # "Alice" — works

Grade Calculator: JSON Persistence

Here's the grade calculator with save/load functionality:

import json
from pathlib import Path

RECORDS_FILE = Path("student_records.json")

def save_records(records: list[dict]) -> None:
    """Save student records to a JSON file."""
    with open(RECORDS_FILE, "w") as f:
        json.dump(records, f, indent=2)
    print(f"Saved {len(records)} records.")

def load_records() -> list[dict]:
    """Load student records from JSON, or return empty list."""
    if not RECORDS_FILE.exists():
        return []
    with open(RECORDS_FILE, "r") as f:
        return json.load(f)

# Usage: records persist between runs
records = load_records()
records.append({"name": "David", "scores": [76, 82, 80]})
save_records(records)

When to Use CSV vs. JSON

This is one of the most common design decisions in data programming. Here's a comparison:

Criterion Plain Text CSV JSON
Best for Logs, notes, config Tabular data (rows & columns) Nested/hierarchical data
Human readable Excellent Good Good (with indent)
Structure Unstructured Flat table Nested objects & arrays
Excel compatible No Yes No (without conversion)
API standard No Rare Yes (dominant)
Python module Built-in I/O csv json
Example use case Server log Spreadsheet export Config file, API response

Rule of thumb: - If your data looks like a spreadsheet (rows and columns, all the same fields), use CSV. - If your data is nested, has varying fields per record, or needs to round-trip through a web API, use JSON. - If your data is just human-readable notes or logs, plain text is fine.

🔄 Check Your Understanding

  1. What's the difference between json.dump() and json.dumps()?
  2. You have a list of 10,000 student records, each with the same fields (name, ID, GPA). Would you choose CSV or JSON? Why?
  3. What happens when you json.dump a Python dict with tuple values?

Verify

  1. json.dump(obj, file) writes directly to a file object. json.dumps(obj) returns a JSON-formatted string. The "s" stands for "string."
  2. CSV — the data is tabular (same columns for every record), CSV is more compact than JSON for flat data, and spreadsheet tools can open it directly.
  3. Tuples are converted to JSON arrays (which become Python lists when loaded back). The tuple-vs-list distinction is lost in the round-trip.

10.9 Common File I/O Errors

File I/O is one of the most error-prone areas in programming. Files might not exist, you might not have permission to access them, the encoding might be wrong, or the disk might be full. Here are the errors you'll encounter most often.

FileNotFoundError: Wrong Path

The most common beginner error. You try to read a file that doesn't exist — usually because the path is wrong:

# This fails if data.txt isn't in the current working directory
with open("data.txt", "r") as f:
    content = f.read()
# FileNotFoundError: [Errno 2] No such file or directory: 'data.txt'

🐛 Debugging Walkthrough: FileNotFoundError

Symptom: FileNotFoundError: [Errno 2] No such file or directory: 'data/results.csv'

Common causes: 1. Typo in the filename. Double-check spelling and extension (.csv vs .CSV). 2. Wrong working directory. Your script assumes the file is in the same folder, but you're running it from a different directory. Fix: use Path(__file__).parent / "data" / "results.csv" instead of a relative path. 3. The file genuinely doesn't exist yet. If your program is supposed to create the file on first run, check for existence first:

```python from pathlib import Path

path = Path("data") / "results.csv" if path.exists(): with open(path, "r") as f: data = f.read() else: print(f"File not found: {path}") print(f"Current directory: {Path.cwd()}") print(f"Files here: {list(Path.cwd().iterdir())}") ```

The Path.cwd() trick is your best debugging friend. When a file can't be found, print the current working directory — the answer is almost always "I thought I was in folder X, but I'm actually in folder Y."

PermissionError: Access Denied

You don't have permission to read or write the file — common on shared servers or when trying to write to system directories:

# This might fail on some systems
with open("/etc/shadow", "r") as f:   # Linux system file
    content = f.read()
# PermissionError: [Errno 13] Permission denied: '/etc/shadow'

Encoding Errors

Text files are stored as bytes, and those bytes must be interpreted using a character encoding. The most common encoding today is UTF-8, but older files might use Latin-1, Windows-1252, or other encodings. If Python guesses wrong, you get garbled text or crashes:

🐛 Debugging Walkthrough: Encoding Errors

Symptom: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 42

What happened: The file isn't UTF-8 encoded. That byte 0xe9 is the letter "e" in Latin-1 encoding.

Fix: Specify the correct encoding: ```python

Try UTF-8 first (most common), fall back to latin-1

try: with open("data.txt", "r", encoding="utf-8") as f: content = f.read() except UnicodeDecodeError: with open("data.txt", "r", encoding="latin-1") as f: content = f.read() print("Warning: file is not UTF-8 — read as Latin-1") ```

Prevention: When creating files, always specify encoding="utf-8": python with open("output.txt", "w", encoding="utf-8") as f: f.write("Caf\u00e9, na\u00efve, r\u00e9sum\u00e9\n")

Defensive File I/O Pattern

Here's a robust pattern that handles the most common errors:

from pathlib import Path

def safe_read(filepath: str | Path) -> str | None:
    """Read a file safely, returning None on failure."""
    path = Path(filepath)
    if not path.exists():
        print(f"File not found: {path}")
        return None
    try:
        with open(path, "r", encoding="utf-8") as f:
            return f.read()
    except PermissionError:
        print(f"Permission denied: {path}")
        return None
    except UnicodeDecodeError:
        print(f"Encoding error — trying latin-1: {path}")
        with open(path, "r", encoding="latin-1") as f:
            return f.read()

📊 Real-World Application: In production systems, file I/O errors are expected, not exceptional. Elena's automated report pipeline includes error handling for every file operation — because in a real nonprofit, someone will eventually rename the monthly CSV, change permissions on the shared drive, or save a file in the wrong encoding. Robust code handles all of these gracefully.


10.10 Project Checkpoint: TaskFlow v0.9

It's time for the biggest upgrade to TaskFlow yet: persistence. After this checkpoint, your tasks will survive between program runs. Close the program, shut down your computer, come back tomorrow — your tasks will still be there.

What's New in v0.9

  • Tasks are saved to a JSON file (taskflow_data.json)
  • Tasks load automatically when the program starts
  • Every change (add, delete, complete) auto-saves immediately
  • The program handles missing or corrupted data files gracefully

Implementation

The core persistence logic is two functions:

import json
from pathlib import Path

DATA_FILE = Path(__file__).parent / "taskflow_data.json"

def load_tasks(path: Path) -> list[dict]:
    """Load tasks from a JSON file.

    Returns an empty list if the file doesn't exist or is corrupted.
    """
    if not path.exists():
        print("  No saved tasks found — starting fresh.")
        return []
    try:
        with open(path, "r") as f:
            tasks = json.load(f)
        print(f"  Loaded {len(tasks)} task(s) from {path.name}")
        return tasks
    except json.JSONDecodeError:
        print(f"  Warning: {path.name} is corrupted. Starting fresh.")
        return []

def save_tasks(tasks: list[dict], path: Path) -> None:
    """Save tasks to a JSON file with pretty formatting."""
    with open(path, "w") as f:
        json.dump(tasks, f, indent=2)

Then integrate auto-save into every operation:

def add_task(tasks: list[dict]) -> None:
    """Add a new task and auto-save."""
    title = input("  Task title: ").strip()
    if not title:
        print("  Title cannot be empty.")
        return

    priority = input("  Priority (high/medium/low) [medium]: ").strip().lower()
    if priority not in ("high", "medium", "low"):
        priority = "medium"

    category = input("  Category [general]: ").strip() or "general"

    task = {
        "title": title,
        "priority": priority,
        "category": category,
        "done": False,
        "created": datetime.now().strftime("%Y-%m-%d %H:%M"),
    }
    tasks.append(task)
    save_tasks(tasks, DATA_FILE)   # <-- auto-save after every change
    print(f"  Added: '{title}'")

Why JSON (and Not CSV)?

This is a design decision. We chose JSON over CSV for TaskFlow because:

  1. Tasks have nested structure. A task might eventually have sub-tasks or tags (a list within a dict). JSON handles nesting naturally; CSV doesn't.
  2. Fields vary. Not every task needs every field. JSON handles missing fields gracefully; CSV requires every row to have the same columns.
  3. Human-readable. With indent=2, the JSON file is easy to inspect and debug.
  4. Round-trip fidelity. Booleans stay booleans, numbers stay numbers. In CSV, everything is a string that you'd need to convert back.

What the JSON File Looks Like

[
  {
    "title": "Read Chapter 10",
    "priority": "high",
    "category": "homework",
    "done": false,
    "created": "2025-01-15 09:30"
  },
  {
    "title": "Buy groceries",
    "priority": "medium",
    "category": "personal",
    "done": true,
    "created": "2025-01-14 18:00"
  }
]

Try It Yourself

  1. Run the TaskFlow v0.9 script from code/project-checkpoint.py.
  2. Add three tasks with different priorities and categories.
  3. Close the program (option 7).
  4. Reopen the program — your tasks should still be there.
  5. Open taskflow_data.json in a text editor and inspect it.
  6. Try deleting taskflow_data.json — the program should start fresh without crashing.

🚪 Threshold Concept Callback: This is persistence in action. Your TaskFlow program now has memory — it outlives its own execution. Every professional application you use (email, social media, games, banking) is built on this same principle: read state from disk, let the user modify it, write state back to disk. The details get more sophisticated (databases, cloud storage, distributed systems), but the core idea is exactly what you just built.

What's Next

In Chapter 11, we'll add robust error handling to every part of TaskFlow. Right now, if the user enters non-numeric input where a number is expected, the program crashes. Chapter 11 fixes that with try/except blocks — making TaskFlow bulletproof.


Chapter Summary

This chapter introduced the fundamental skill of file I/O — making programs that persist data beyond a single execution.

Key concepts: - open() creates a file object; the mode ("r", "w", "a") determines what you can do with it. - Context managers (with) guarantee files are closed, even when errors occur. - Process large files line by line to keep memory usage constant. - pathlib.Path provides cross-platform path handling — always prefer it over string concatenation. - The csv module handles tabular data with reader/DictReader and writer/DictWriter. - The json module handles nested/hierarchical data with dump/load (files) and dumps/loads (strings). - Common errors (FileNotFoundError, PermissionError, encoding issues) are expected in production code and should be handled gracefully.

New terms introduced: file object, open(), context manager, with, read mode, write mode, append mode, pathlib, Path, CSV, JSON, json module, csv module, encoding, newline, readline(), readlines()

Looking ahead: Chapter 11 introduces error handling with try/except — the Python philosophy of EAFP (Easier to Ask Forgiveness than Permission). You'll learn to catch specific exceptions, write your own error messages, and make your programs resilient to bad input, missing files, and unexpected conditions. The file I/O error patterns from section 10.9 are just the beginning.