15 min read

> "Every business runs on data. Data lives in files. Learning to read and write files fluently is the moment Python stops being a toy and starts being a tool you actually use at work every single day."

Chapter 9: File I/O — Reading and Writing Business Data

"Every business runs on data. Data lives in files. Learning to read and write files fluently is the moment Python stops being a toy and starts being a tool you actually use at work every single day."


Why This Chapter Changes Everything

Up to this point every program you have written starts fresh. You type some numbers, the program runs, Python forgets everything the moment the script ends. That is fine for learning. It is useless for business.

Real business data lives in files — CSV exports from your CRM, JSON responses from payment APIs, text logs from automated processes, Excel sheets your finance team has maintained for a decade. The moment you learn to read those files and write back to them, you unlock the thing most people are actually trying to do when they pick up Python at work: automate the tedious, repetitive data tasks that are eating their afternoons.

This chapter covers the full toolkit for file operations in Python. We will start with the fundamentals — how files and paths work — and build up to the patterns you will use in production: reading and writing CSV records, storing configuration in JSON, iterating through folders of files, and building the kind of logging system that makes your future self grateful.

Our two recurring characters anchor everything to real scenarios. Priya Okonkwo, the analyst at Acme Corp, needs to consolidate regional sales reports that land as separate CSV files every Monday morning. Maya Reyes, the freelance consultant, needs a project log that she can open, update, and query without ever touching a spreadsheet application. By the end of this chapter, both of them have working solutions — and so do you.


9.1 How Files and Paths Work

Before you write a single line of file-reading code, you need to understand what Python means when it talks about a file path.

What Is a File Path?

A file path is the address of a file in your operating system's storage hierarchy. Just as a mailing address specifies country, state, city, street, and house number, a file path specifies drive, folder, subfolder (and so on), and filename.

Two styles of path exist:

Absolute paths start from the root of the filesystem and are complete on their own. They tell the operating system exactly where to go, regardless of where your Python script is running.

# Windows absolute path
C:\Users\Priya\Documents\acme\sales\q1_report.csv

# Mac / Linux absolute path
/home/priya/acme/sales/q1_report.csv

Relative paths are measured from your script's current working directory — the folder Python considers "here" when your script runs. They are shorter and more portable across machines, but they only work if your working directory is what you expect.

# Relative paths (current working directory is /home/priya/acme/)
sales/q1_report.csv          # one folder down
../finance/budget.xlsx       # one folder up, then into finance
q1_report.csv                # same folder as the script

A common source of frustration for beginners: running a script from a different directory than expected and getting a FileNotFoundError. The pathlib module, described next, helps you avoid this.

The pathlib Module — The Right Way to Handle Paths

Python 3.4 introduced pathlib, and it has been the recommended approach to file paths ever since. Before pathlib, developers concatenated strings like "/home/priya/" + "sales/" + "report.csv" — which broke on Windows because Windows uses backslashes (\) instead of forward slashes. pathlib handles all of that automatically.

The central class is Path. You construct a Path object and then manipulate it with clean, readable operators.

from pathlib import Path

# Build a path — works identically on Windows, Mac, and Linux
reports_folder = Path("data") / "sales" / "regional"
report_file    = reports_folder / "north_q1.csv"

print(report_file)           # data/sales/regional/north_q1.csv  (or \  on Windows)
print(report_file.name)      # north_q1.csv
print(report_file.stem)      # north_q1          (name without extension)
print(report_file.suffix)    # .csv              (extension only)
print(report_file.parent)    # data/sales/regional
print(report_file.resolve()) # /home/priya/acme/data/sales/regional/north_q1.csv

The / operator — when used between Path objects or between a Path and a string — builds a new path. This is far more readable and safer than string concatenation.

Useful Path Operations

pathlib goes well beyond constructing paths. Here are the operations you will reach for constantly:

from pathlib import Path

config_file = Path("config") / "settings.json"

# Does the file exist?
if config_file.exists():
    print("Config found")

# Is it a file or a directory?
print(config_file.is_file())       # True if it's a regular file
print(config_file.is_dir())        # True if it's a directory

# File metadata
stat = config_file.stat()
print(stat.st_size)                # size in bytes
print(stat.st_mtime)               # last-modified time (Unix timestamp)

# Create directories (like mkdir -p in the terminal)
output_dir = Path("output") / "reports" / "2024"
output_dir.mkdir(parents=True, exist_ok=True)
# parents=True   → create intermediate directories if needed
# exist_ok=True  → do nothing (no error) if directory already exists

# Rename / move a file
old_path = Path("data") / "temp_report.csv"
new_path = Path("data") / "archive" / "q1_report_final.csv"
old_path.rename(new_path)

# Delete a file
temp_file = Path("data") / "temp.txt"
if temp_file.exists():
    temp_file.unlink()

Listing Files in a Directory

pathlib makes it straightforward to iterate through a folder's contents — something Priya does every Monday when her regional CSVs arrive:

from pathlib import Path

reports_dir = Path("data") / "regional_reports"

# Iterate over every item in the directory
for item in reports_dir.iterdir():
    if item.is_file():
        print(f"  {item.name}  —  {item.stat().st_size} bytes")

# Find only CSV files using glob pattern matching
for csv_file in reports_dir.glob("*.csv"):
    print(f"  Found CSV: {csv_file.name}")

# Recursive glob: find all CSV files in all subdirectories
for csv_file in reports_dir.rglob("*.csv"):
    print(f"  {csv_file.relative_to(reports_dir)}")

Glob patterns use wildcards: * matches any sequence of characters, ? matches exactly one character, and ** (in rglob, or in glob("**/*.csv")) matches across directory levels. Common business patterns:

# All CSV files in the folder
list(reports_dir.glob("*.csv"))

# All files whose name starts with "q1_"
list(reports_dir.glob("q1_*"))

# All JSON files anywhere in the tree
list(reports_dir.rglob("*.json"))

9.2 Opening Files: The open() Function

With a good understanding of paths, you are ready to actually open files. The built-in open() function is the gateway to all file I/O in Python.

Signature and Mode Parameter

open(file, mode='r', encoding=None, newline=None)

The mode parameter is a one-or-two-character string that tells Python what you intend to do with the file:

Mode Meaning File must exist? Overwrites?
'r' Read text (default) Yes No
'w' Write text No (creates) Yes
'a' Append text No (creates) No
'x' Create new file, fail if exists No (creates) Error if exists
'r+' Read and write Yes No
'rb' Read binary Yes No
'wb' Write binary No (creates) Yes

For everyday text file work, you will use 'r', 'w', and 'a'. The binary modes ('rb', 'wb') are for images, PDFs, and other non-text files.

The encoding Parameter — Always Specify It

Python's open() uses your operating system's default encoding if you omit encoding. On Windows that is often cp1252; on Mac/Linux it is typically utf-8. If your file was created on a different system (or if it contains any non-ASCII characters — accented names, currency symbols, em dashes), you will get garbled text or a UnicodeDecodeError.

The safe professional practice is to always specify encoding="utf-8" explicitly:

# Always do this:
with open("report.csv", mode="r", encoding="utf-8") as file_handle:
    content = file_handle.read()

# Never rely on the default:
with open("report.csv") as file_handle:   # risky on Windows
    content = file_handle.read()

9.3 Context Managers: The with Statement

Whenever you open a file, you create a connection between Python and the operating system. That connection consumes resources. If your script crashes, raises an exception, or simply forgets to close the file, those resources are leaked — and in some cases the file's data may not be fully written to disk.

The with statement solves this elegantly. It is called a context manager, and it guarantees that the file is closed when the block exits, regardless of whether the block completes normally or raises an exception.

# WITHOUT a context manager — risky
file_handle = open("report.txt", mode="r", encoding="utf-8")
content = file_handle.read()
file_handle.close()   # if an exception happens above, this never runs

# WITH a context manager — always safe
with open("report.txt", mode="r", encoding="utf-8") as file_handle:
    content = file_handle.read()
# file_handle is automatically closed here, even if read() raised an exception

The variable after asfile_handle in the example above — is simply the name you give to the open file object inside the block. The name file_handle, f, fh, infile, and csv_file are all common conventions. Choose whatever makes your intent clear.

You can also open multiple files in a single with statement when you need to read from one and write to another simultaneously:

with open("input.csv", mode="r", encoding="utf-8") as infile, \
     open("output.csv", mode="w", encoding="utf-8") as outfile:
    for line in infile:
        outfile.write(line.upper())

Rule of thumb: every file open call in professional Python code should be inside a with statement. There is essentially never a good reason to use the bare open() + close() pattern.


9.4 Reading Text Files

Python gives you three methods for reading text, each suited to different situations.

.read() — Load the Entire File

.read() returns the entire file as a single string. Use it when you need all the content at once and the file is small enough to fit comfortably in memory (a few hundred megabytes at most).

from pathlib import Path

memo_path = Path("data") / "q1_memo.txt"

with open(memo_path, mode="r", encoding="utf-8") as file_handle:
    full_text = file_handle.read()

print(f"Loaded {len(full_text)} characters")
print(full_text[:200])   # preview the first 200 characters

.readline() — One Line at a Time

.readline() returns the next line from the file, including the trailing newline character \n. When the end of the file is reached, it returns an empty string "". This is useful for processing a file incrementally when you need fine-grained control.

with open("report.txt", mode="r", encoding="utf-8") as file_handle:
    line_number = 0
    while True:
        line = file_handle.readline()
        if not line:      # empty string means end of file
            break
        line_number += 1
        print(f"Line {line_number}: {line.rstrip()}")   # rstrip() removes the \n

.readlines() — All Lines as a List

.readlines() reads the entire file and returns a list of strings, one per line (each including its \n). This is handy when you need random access to specific lines by index.

with open("report.txt", mode="r", encoding="utf-8") as file_handle:
    all_lines = file_handle.readlines()

print(f"Total lines: {len(all_lines)}")
print(f"Line 3: {all_lines[2].strip()}")        # index 2 is the 3rd line

Iterating Directly — Best for Large Files

The most Pythonic and memory-efficient approach for line-by-line processing is to iterate over the file object directly. Python reads one line per iteration without loading the whole file into memory:

word_count = 0

with open("large_report.txt", mode="r", encoding="utf-8") as file_handle:
    for line in file_handle:
        word_count += len(line.split())

print(f"Total words: {word_count}")

This pattern scales to files of any size. For gigabyte-scale log files, this is the only approach that will not crash your program.

Stripping Whitespace

Lines read from a file include the trailing \n character (and sometimes \r\n on Windows). Use .strip() to remove all leading and trailing whitespace, or .rstrip() to remove only trailing whitespace (preserving any intentional leading indentation):

with open("report.txt", mode="r", encoding="utf-8") as file_handle:
    for line in file_handle:
        clean_line = line.strip()    # remove leading and trailing whitespace
        if clean_line:               # skip blank lines
            process(clean_line)

9.5 Writing Text Files

Writing is the mirror of reading. The key decision is whether you want to overwrite ('w'), append ('a'), or create-only-if-new ('x').

.write() — Write a String

.write() takes a string and writes it to the file. It returns the number of characters written. Unlike print(), it does not add a newline automatically — you must include \n when you need line breaks.

from pathlib import Path

output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
report_path = output_dir / "weekly_summary.txt"

with open(report_path, mode="w", encoding="utf-8") as file_handle:
    file_handle.write("ACME CORP — WEEKLY SUMMARY\n")
    file_handle.write("=" * 40 + "\n")
    file_handle.write(f"Total revenue this week: $142,500\n")
    file_handle.write(f"Units sold: 312\n")

.writelines() — Write a List of Strings

.writelines() accepts any iterable of strings and writes them in sequence. Like .write(), it does not add newlines — each string in your list must include its own \n if you want one.

summary_lines = [
    "Region     Revenue\n",
    "------     -------\n",
    "North      $142,500\n",
    "South      $ 98,200\n",
    "East       $167,800\n",
    "West       $131,400\n",
]

with open(report_path, mode="w", encoding="utf-8") as file_handle:
    file_handle.writelines(summary_lines)

Appending to an Existing File

Mode 'a' positions the write cursor at the end of the file. New content is added after whatever was already there. If the file does not exist, 'a' creates it automatically — this makes 'a' perfect for logs that grow over time.

log_path = Path("logs") / "process_log.txt"
log_path.parent.mkdir(exist_ok=True)

def log_event(message: str) -> None:
    """Append a timestamped event to the process log."""
    from datetime import datetime
    timestamp = datetime.now().isoformat(timespec="seconds")
    with open(log_path, mode="a", encoding="utf-8") as log_file:
        log_file.write(f"{timestamp}  {message}\n")

log_event("Started weekly report processing")
log_event("Read 4 regional CSV files")
log_event("Wrote consolidated report: 847 rows")

Notice that each call to log_event() opens, writes, and closes the file. This is intentional: closing the file after each write flushes the data to disk. If your process crashes after log_event("Read 4 regional CSV files"), you still have that entry preserved. A file that is held open across many writes may lose its most recent content on a crash.


9.6 The csv Module

Plain text files are for human-readable content. When your data has structure — rows and columns — you want CSV (Comma-Separated Values). CSV is the universal language of tabular data. Every spreadsheet application, database, CRM, and analytics tool can export and import CSV.

Python's standard library includes the csv module, which handles the numerous edge cases that make hand-parsing CSV unreliable: fields containing commas, fields containing newlines, quoted fields, different delimiters (tabs, pipes, semicolons), and BOM markers added by Excel.

The newline="" Parameter for CSV Files

Before anything else: when opening CSV files, always pass newline="" to open(). This prevents Python's universal newline translation from interfering with the csv module's own newline handling, which matters particularly on Windows where files may have \r\n line endings:

with open("sales.csv", mode="r", newline="", encoding="utf-8") as csv_file:
    reader = csv.reader(csv_file)
    ...

csv.reader — List-Based Reading

csv.reader wraps a file object and yields each row as a list of strings:

import csv
from pathlib import Path

sales_path = Path("data") / "q1_sales.csv"

with open(sales_path, mode="r", newline="", encoding="utf-8") as csv_file:
    reader = csv.reader(csv_file)
    header_row = next(reader)    # read the first row as headers
    print(f"Columns: {header_row}")

    for row in reader:           # each row is a list of strings
        region   = row[0]
        revenue  = float(row[3])   # must convert types manually
        print(f"  {region}: ${revenue:,.2f}")

Positional indexing (row[0], row[3]) is fragile — if someone adds a column to the CSV, your code breaks silently. The dictionary-based approach described next is almost always preferable.

csv.DictReader reads each row as a dictionary, using the first row as keys. You access fields by name instead of position, making your code self-documenting and resilient to column reordering:

import csv
from pathlib import Path

with open("data/q1_sales.csv", mode="r", newline="", encoding="utf-8") as csv_file:
    reader = csv.DictReader(csv_file)
    # reader.fieldnames contains the header row

    for row in reader:
        # Access by column name — clear and resilient
        region        = row["region"]
        product_line  = row["product_line"]
        revenue       = float(row["revenue"])    # still need type conversion
        quota         = float(row["quota"])

        attainment = (revenue / quota) * 100
        print(f"  {region} / {product_line}: ${revenue:,.0f}  ({attainment:.1f}% of quota)")

Important: csv.DictReader (like all CSV readers) always returns strings. Every numeric field will be a string "142500.00" until you convert it. This is a very common source of bugs — the + operator on two CSV strings produces concatenation, not addition.

# WRONG — string concatenation
total = row["q1_revenue"] + row["q2_revenue"]   # "142500.00108000.00" !!!

# RIGHT — convert first
total = float(row["q1_revenue"]) + float(row["q2_revenue"])   # 250500.0

csv.writer — List-Based Writing

csv.writer accepts a list for each row and handles all quoting and escaping:

import csv

rows = [
    ["region", "product_line", "revenue", "quota"],
    ["North",  "Enterprise",   189000,    175000],
    ["South",  "Enterprise",   130500,    160000],
    ["East",   "Enterprise",   247500,    220000],
]

with open("output/summary.csv", mode="w", newline="", encoding="utf-8") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(rows[0])       # write header
    writer.writerows(rows[1:])     # write all data rows at once

csv.DictWriter takes dictionaries and writes them according to the fieldnames you specify at construction. You control column order through fieldnames, not through the order keys appear in your dicts.

import csv

sales_records = [
    {"rep_name": "Fatima Hassan", "region": "North", "revenue": 105600, "quota": 100000},
    {"rep_name": "Marcus Webb",   "region": "East",  "revenue": 247500, "quota": 220000},
    {"rep_name": "Tom Reilly",    "region": "South", "revenue": 130500, "quota": 160000},
]

fieldnames = ["rep_name", "region", "revenue", "quota"]

with open("output/reps.csv", mode="w", newline="", encoding="utf-8") as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()           # writes the header row from fieldnames
    writer.writerows(sales_records)

The extrasaction="ignore" parameter on DictWriter is worth knowing — it tells the writer to silently drop any keys in your dicts that are not in fieldnames. Without it, a single extra key raises a ValueError:

writer = csv.DictWriter(
    csv_file,
    fieldnames=["rep_name", "revenue"],
    extrasaction="ignore",     # silently drop "region", "quota", etc.
)

9.7 JSON Files: Configuration and API Responses

JSON (JavaScript Object Notation) is the most common format for structured data exchanged between systems: REST API responses, configuration files, cached query results, and webhook payloads. Python's json module makes reading and writing JSON straightforward.

When to Use JSON vs. CSV

  • CSV: tabular data, rows of records with the same fields, large datasets
  • JSON: nested structures, configuration settings, API payloads, data with mixed types

An Acme Corp example: the sales records belong in CSV (hundreds of rows, consistent columns). The configuration that describes how the consolidation script should run — which regions to include, where to write output, when it was last run — belongs in JSON.

json.load() — Reading JSON from a File

import json
from pathlib import Path

config_path = Path("config") / "report_settings.json"

with open(config_path, mode="r", encoding="utf-8") as json_file:
    config = json.load(json_file)

# config is now a plain Python dict / list / str / int / float / bool / None
print(config["output_directory"])
print(config["regions"])           # might be a list
print(config["last_run"])          # might be a string date

JSON maps directly to Python types:

JSON Python
{} object dict
[] array list
"string" str
123 number int
3.14 number float
true / false True / False
null None

json.dump() — Writing JSON to a File

import json
from datetime import datetime
from pathlib import Path

processing_metadata = {
    "report_name":      "Q1 Consolidated Sales",
    "generated_at":     datetime.now().isoformat(timespec="seconds"),
    "source_files":     ["north_q1.csv", "south_q1.csv", "east_q1.csv", "west_q1.csv"],
    "total_records":    847,
    "total_revenue":    539900.00,
    "processing_flags": {
        "excluded_cancelled": True,
        "currency_normalized": True,
    },
}

metadata_path = Path("output") / "processing_metadata.json"
metadata_path.parent.mkdir(exist_ok=True)

with open(metadata_path, mode="w", encoding="utf-8") as json_file:
    json.dump(processing_metadata, json_file, indent=2)

The indent=2 argument makes the output human-readable (pretty-printed). Without it, json.dump() writes everything on one line — compact but unreadable for debugging. Use indent=2 for any file a human might read; omit it for pure machine-to-machine communication where file size matters.

json.loads() and json.dumps() — For Strings, Not Files

If you are working with JSON as a string (for example, parsing an API response body), use json.loads() (load string) and json.dumps() (dump to string):

import json

api_response_text = '{"status": "success", "records": 42, "flags": ["processed", "verified"]}'

# Parse a JSON string into a Python object
data = json.loads(api_response_text)
print(data["records"])     # 42

# Serialize a Python object to a JSON string
output_string = json.dumps(data, indent=2)
print(output_string)

9.8 Maya's Project Log — Putting It Together

Let us see how Maya uses everything covered so far. She needs a system that:

  1. Stores her consulting projects in a CSV file that persists between sessions
  2. Lets her add new projects and update hours logged
  3. Flags any project that has exceeded the estimated hours by more than 10%
  4. Produces a clean "active projects" report she can review each morning
  5. Tracks her earnings in a JSON summary she can glance at without opening Python

Here is the core of her project log system:

import csv
import json
from pathlib import Path
from datetime import date, datetime

LOG_DIR = Path("data") / "maya_logs"
LOG_DIR.mkdir(parents=True, exist_ok=True)

PROJECT_CSV = LOG_DIR / "project_log.csv"
EARNINGS_JSON = LOG_DIR / "earnings_summary.json"

FIELDNAMES = [
    "project_id", "project_name", "client", "start_date",
    "status", "estimated_hours", "actual_hours", "hourly_rate", "notes",
]

OVER_BUDGET_THRESHOLD = 1.10   # flag at 110% of estimated hours


def load_projects() -> list[dict]:
    """Read all projects from the CSV log, with type conversion."""
    if not PROJECT_CSV.exists():
        return []

    projects = []
    with open(PROJECT_CSV, mode="r", newline="", encoding="utf-8") as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            projects.append({
                "project_id":       row["project_id"],
                "project_name":     row["project_name"],
                "client":           row["client"],
                "start_date":       row["start_date"],
                "status":           row["status"],
                "estimated_hours":  float(row["estimated_hours"]),
                "actual_hours":     float(row["actual_hours"]),
                "hourly_rate":      float(row["hourly_rate"]),
                "notes":            row["notes"],
            })
    return projects


def save_projects(projects: list[dict]) -> None:
    """Overwrite the CSV log with the current project list."""
    with open(PROJECT_CSV, mode="w", newline="", encoding="utf-8") as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=FIELDNAMES)
        writer.writeheader()
        writer.writerows(projects)


def is_over_budget(project: dict) -> bool:
    return project["actual_hours"] > project["estimated_hours"] * OVER_BUDGET_THRESHOLD


# Load, update, save pattern
projects = load_projects()

# Flag over-budget projects
for project in projects:
    if is_over_budget(project) and project["status"] == "active":
        excess_hours = project["actual_hours"] - project["estimated_hours"]
        print(
            f"WARNING: {project['project_name']} ({project['client']}) "
            f"is {excess_hours:.1f}h over estimate"
        )

save_projects(projects)

The full implementation, with earnings calculations and report writing, lives in code/maya_project_log.py.


9.9 Working With Multiple Files: Priya's Consolidation Script

Priya's Monday morning task: four regional CSV files appear in her data/regional_reports/ folder — north_q1.csv, south_q1.csv, east_q1.csv, west_q1.csv. She needs to combine them into a single master CSV and write a JSON metadata file recording how the processing went.

The key pattern here is: iterate over the folder, read each file, accumulate the data, write the combined output.

import csv
import json
from pathlib import Path
from datetime import datetime

REGIONAL_DIR = Path("data") / "regional_reports"
OUTPUT_DIR   = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

MASTER_CSV      = OUTPUT_DIR / "q1_consolidated.csv"
METADATA_JSON   = OUTPUT_DIR / "processing_metadata.json"

FIELDNAMES = ["region", "rep_name", "product_line", "revenue", "quota", "month"]

def consolidate_regional_reports() -> dict:
    """
    Read all CSV files in REGIONAL_DIR, combine into one master CSV,
    and return processing metadata.
    """
    all_records = []
    source_files = []
    error_files = []

    # glob("*.csv") returns a generator of Path objects
    csv_files = sorted(REGIONAL_DIR.glob("*.csv"))

    if not csv_files:
        raise FileNotFoundError(f"No CSV files found in {REGIONAL_DIR}")

    for csv_path in csv_files:
        try:
            with open(csv_path, mode="r", newline="", encoding="utf-8") as csv_file:
                reader = csv.DictReader(csv_file)
                file_records = list(reader)
            all_records.extend(file_records)
            source_files.append(csv_path.name)
            print(f"  Read {len(file_records):>4} rows from {csv_path.name}")
        except Exception as error:
            print(f"  ERROR reading {csv_path.name}: {error}")
            error_files.append(csv_path.name)

    # Write the consolidated CSV
    with open(MASTER_CSV, mode="w", newline="", encoding="utf-8") as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=FIELDNAMES, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(all_records)

    # Build metadata
    return {
        "generated_at":   datetime.now().isoformat(timespec="seconds"),
        "source_files":   source_files,
        "error_files":    error_files,
        "total_records":  len(all_records),
        "output_file":    str(MASTER_CSV),
    }


metadata = consolidate_regional_reports()

with open(METADATA_JSON, mode="w", encoding="utf-8") as json_file:
    json.dump(metadata, json_file, indent=2)

print(f"\nConsolidated {metadata['total_records']} records from {len(metadata['source_files'])} files")

Notice several production-quality practices in this code:

  • sorted(REGIONAL_DIR.glob("*.csv")) processes files in alphabetical order, making the output deterministic
  • The try/except around each file read allows one bad file to log an error without stopping the entire consolidation
  • Metadata is written to JSON separately from the data, so you can always audit what the script did
  • extrasaction="ignore" makes the DictWriter tolerant of any extra columns in source files

9.10 Common Real-World Patterns

Pattern 1: Reading a Config File at Startup

Hardcoding paths, credentials, and settings directly in a script is fragile — every deployment requires editing source code. The professional pattern is to read configuration from a JSON or .ini file at startup:

import json
from pathlib import Path

CONFIG_PATH = Path("config") / "app_settings.json"

def load_config() -> dict:
    """Load application settings from JSON. Raise clearly if missing."""
    if not CONFIG_PATH.exists():
        raise FileNotFoundError(
            f"Config file not found: {CONFIG_PATH}\n"
            "Copy config/app_settings.example.json and customize it."
        )
    with open(CONFIG_PATH, mode="r", encoding="utf-8") as json_file:
        return json.load(json_file)

config = load_config()
output_directory = Path(config["output_directory"])
max_records      = config.get("max_records", 10000)   # default if key absent

Pattern 2: Append-Based Logging

For scripts that run on a schedule, you want a log file that accumulates entries over time. Mode 'a' makes this trivial:

from datetime import datetime
from pathlib import Path

LOG_PATH = Path("logs") / "weekly_consolidation.log"
LOG_PATH.parent.mkdir(exist_ok=True)

def write_log_entry(level: str, message: str) -> None:
    timestamp = datetime.now().isoformat(timespec="seconds")
    entry = f"{timestamp}  [{level.upper():<7}]  {message}\n"
    with open(LOG_PATH, mode="a", encoding="utf-8") as log_file:
        log_file.write(entry)

write_log_entry("info",    "Weekly consolidation started")
write_log_entry("info",    "Read 4 regional files, 847 records total")
write_log_entry("warning", "South region file had 2 malformed rows — skipped")
write_log_entry("info",    "Consolidation complete. Output: output/q1_master.csv")

Pattern 3: Safe File Writing with a Temporary File

When overwriting an important file, there is a brief window during which the file is partially written and therefore corrupt. The safe professional pattern is to write to a temporary file first, then rename it:

import os
from pathlib import Path

TARGET = Path("output") / "master_report.csv"
TEMP   = TARGET.with_suffix(".csv.tmp")

# Write to temp file
with open(TEMP, mode="w", newline="", encoding="utf-8") as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(records)

# Atomically replace the target (on most OS, rename is atomic)
TEMP.replace(TARGET)
print(f"Safely updated {TARGET}")

Pattern 4: Bulk File Processing with glob

A common data engineering task: process every file that matches a pattern in a directory and produce one output file per input file.

from pathlib import Path
import csv

INPUT_DIR  = Path("data") / "incoming"
OUTPUT_DIR = Path("data") / "processed"
OUTPUT_DIR.mkdir(exist_ok=True)

for input_file in sorted(INPUT_DIR.glob("sales_*.csv")):
    # Derive output filename from input filename
    output_file = OUTPUT_DIR / input_file.name.replace("sales_", "processed_")

    with open(input_file, mode="r", newline="", encoding="utf-8") as infile, \
         open(output_file, mode="w", newline="", encoding="utf-8") as outfile:

        reader = csv.DictReader(infile)
        writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames, extrasaction="ignore")
        writer.writeheader()

        for row in reader:
            # Transform: add a computed column
            row["revenue_vs_quota_pct"] = str(
                round(float(row["revenue"]) / float(row["quota"]) * 100, 1)
            )
            writer.writerow(row)

    print(f"Processed {input_file.name} → {output_file.name}")

9.11 Error Handling for File Operations

File operations fail for reasons outside your code's control: the file does not exist, you lack permission, the disk is full, the network drive is unavailable. Robust file-handling code anticipates these failures.

Common File Exceptions

Exception Cause
FileNotFoundError File or directory does not exist
PermissionError You do not have read/write access
IsADirectoryError You tried to open a directory as a file
FileExistsError Mode 'x' and file already exists
UnicodeDecodeError File encoding does not match what you specified
OSError Generic OS-level failure (parent of most file exceptions)

Handling Errors Gracefully

from pathlib import Path

def read_report_safely(file_path: Path) -> str | None:
    """
    Read a report file, returning None on failure instead of crashing.
    Always logs what happened.
    """
    try:
        with open(file_path, mode="r", encoding="utf-8") as f:
            return f.read()
    except FileNotFoundError:
        print(f"ERROR: File not found — {file_path}")
        return None
    except PermissionError:
        print(f"ERROR: Permission denied — {file_path}")
        return None
    except UnicodeDecodeError:
        print(f"ERROR: Encoding issue — {file_path}. Try encoding='latin-1'")
        return None

9.12 Quick Reference

Opening Files

# Read text
with open(path, mode="r", encoding="utf-8") as f:

# Write text (overwrite)
with open(path, mode="w", encoding="utf-8") as f:

# Append text
with open(path, mode="a", encoding="utf-8") as f:

# Read CSV
with open(path, mode="r", newline="", encoding="utf-8") as f:

# Write CSV
with open(path, mode="w", newline="", encoding="utf-8") as f:

# Read JSON
with open(path, mode="r", encoding="utf-8") as f:
    data = json.load(f)

# Write JSON
with open(path, mode="w", encoding="utf-8") as f:
    json.dump(data, f, indent=2)

pathlib Cheat Sheet

from pathlib import Path

p = Path("data") / "sales" / "report.csv"

p.name          # "report.csv"
p.stem          # "report"
p.suffix        # ".csv"
p.parent        # Path("data/sales")
p.resolve()     # absolute path

p.exists()      # True / False
p.is_file()     # True / False
p.is_dir()      # True / False
p.stat().st_size  # size in bytes

p.mkdir(parents=True, exist_ok=True)   # create directory
p.unlink()                              # delete file
p.rename(new_path)                      # rename / move

list(p.parent.glob("*.csv"))           # find CSVs in same folder
list(p.parent.rglob("*.csv"))          # find CSVs recursively

csv Module Cheat Sheet

import csv

# Reading with DictReader
with open(path, newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        value = row["column_name"]     # always a string — convert if needed

# Writing with DictWriter
with open(path, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["col1", "col2"], extrasaction="ignore")
    writer.writeheader()
    writer.writerows(list_of_dicts)

# Appending a row
with open(path, "a", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["col1", "col2"])
    writer.writerow(one_dict)

Chapter Summary

Files are how Python programs communicate with the world beyond a single run. This chapter gave you the complete toolkit:

Paths and pathlib: Use pathlib.Path objects instead of string concatenation. The / operator builds paths that work on any OS. Key methods: exists(), mkdir(), glob(), rglob(), iterdir(), stat().

open() and context managers: Always use with open(...) as f:. Always specify encoding="utf-8". Choose your mode deliberately: 'r' to read, 'w' to overwrite, 'a' to append.

Reading text files: .read() for small files you need whole, direct iteration for large files, .readlines() when you need a list of lines. Strip \n with .strip() or .rstrip().

Writing text files: .write() for single strings (include your own \n), .writelines() for iterables of strings. Use mode 'a' for append-based logs.

csv module: Always pass newline="" when opening CSV files. Use csv.DictReader and csv.DictWriter — they are more readable and more resilient than the list-based counterparts. Remember that all CSV values arrive as strings — convert numeric fields explicitly.

json module: json.load() / json.dump() for files; json.loads() / json.dumps() for strings. Use indent=2 for human-readable output. JSON is the right format for config files and structured metadata.

Real-world patterns: bulk processing with glob(), append logging, config files, and safe atomic writes.

Priya can now automate her Monday consolidation in minutes rather than hours. Maya has a project log that tracks her time, flags scope creep, and summarizes her earnings — all without touching a spreadsheet application. Both of them are working with Python the way professionals do.


What's Next

Chapter 10 introduces exception handling in depth — the try/except/finally structure, custom exception classes, and how to build programs that fail gracefully rather than crashing into a traceback. We will build directly on this chapter, because file operations are among the most common places where things go wrong in production code.