Chapter 9 Exercises: File I/O — Reading and Writing Business Data

These exercises are organized into five tiers of increasing complexity. Complete each tier before advancing. All exercises use the characters, scenarios, and file formats introduced in Chapter 9.


Tier 1: Foundations (Exercises 1–4)

Core mechanics — file modes, context managers, reading and writing text


Exercise 1: Your First File, Written and Read Back

Scenario: You are writing a utility to generate a simple business memo as a text file.

Task:

Write a Python script that does the following in order:

  1. Creates a folder called output/ if it does not already exist (use pathlib)
  2. Writes a text file called output/business_memo.txt with at least 8 lines of content. Include: - A "To:" line - A "From:" line - A "Date:" line (use today's date — you can hardcode it for this exercise) - A blank line - At least four lines of memo body text about a fictional business topic
  3. Reads the file back using .read() and prints the entire contents to the console
  4. Prints the file's size in bytes using pathlib's .stat() method

Constraints: - Use a with statement for both the write and the read - Specify encoding="utf-8" explicitly - Use pathlib.Path to construct the file path (no string concatenation)

Expected output: The memo text printed to the console, followed by a line like:

File size: 312 bytes

Exercise 2: The Line Counter

Scenario: Priya needs a quick utility to count lines, words, and characters in any text report file.

Task:

Write a function called count_file_contents(file_path) that:

  1. Takes a pathlib.Path as its argument
  2. Opens the file and reads it line by line using direct iteration (not .readlines())
  3. Counts and returns a dictionary with three keys: - "lines" — total number of lines - "words" — total number of words (split on whitespace) - "characters" — total number of characters (including spaces and newlines)
  4. Handles a missing file gracefully: if the file does not exist, print an error message and return {"lines": 0, "words": 0, "characters": 0}

Then write a main block that: - Creates a sample text file with at least 10 lines - Calls your function on it - Prints the result in a formatted way

Stretch goal: Add a "non_blank_lines" key that counts only lines with at least one non-whitespace character.


Exercise 3: The Append Log

Scenario: Marcus Webb wants a simple log file that records each time a script runs.

Task:

Write a function called log_script_run(script_name, status, message) that:

  1. Appends one line to logs/run_history.log
  2. Creates the logs/ directory if it does not exist
  3. Each line should follow this format: 2024-04-01T08:52:14 [SUCCESS] weekly_consolidation.py Processed 847 records
  4. Uses datetime.now().isoformat(timespec="seconds") for the timestamp

Then simulate three separate script runs by calling log_script_run() three times with different arguments (including at least one "ERROR" status call). After the three calls, read the log file back and print its contents to verify all three entries are present.

Key concept to practice: Each call to log_script_run() should open the file, write one line, and close it — not hold the file open across all three calls.


Exercise 4: File Existence Checker

Scenario: Before Priya runs her weekly consolidation, she wants to verify all four regional report files are present.

Task:

Write a function called check_required_files(file_paths) that:

  1. Takes a list of pathlib.Path objects
  2. Checks whether each file exists using .exists()
  3. Returns a dictionary with two keys: - "found" — list of Path objects for files that exist - "missing" — list of Path objects for files that do not exist
  4. Prints a clear status line for each file: either "[FOUND] filename.csv" or "[MISSING] filename.csv"

Create a list of four fictitious regional CSV paths (you only need to actually create two of them on disk — leave the other two missing). Call your function and print the summary counts.

Stretch goal: Also check file size for files that exist, and flag any that are suspiciously small (under 100 bytes) as "[EMPTY?]".


Tier 2: CSV Fundamentals (Exercises 5–8)

Reading and writing CSV files with the csv module


Exercise 5: Client Contact List

Scenario: Maya needs to maintain a CSV of her client contacts.

Task:

  1. Create a list of at least 6 client contact dictionaries, each with these keys: - client_name, contact_person, email, phone, city, industry
  2. Write them to data/clients.csv using csv.DictWriter - Specify fieldnames explicitly - Call writer.writeheader() before writing rows
  3. Read the file back using csv.DictReader and print each row in this format: Hartwell & Sons | Jennifer Walsh | jennifer@hartwell.com | Boston
  4. Count and print the number of clients in each industry

Key concept: Verify that the column order in the output CSV matches your fieldnames list, regardless of the order keys appear in your dictionaries.


Exercise 6: Type Conversion and Validation

Scenario: Priya receives a CSV of expense reports. Some rows have bad data that must be filtered out.

Task:

Write a function called load_and_validate_expenses(file_path) that:

  1. Reads a CSV with columns: expense_id, employee_name, department, amount, category, date
  2. For each row, attempts to convert amount to a float
  3. Skips any row where: - amount cannot be converted (print a warning with the row number) - amount is negative (print a warning) - employee_name is blank
  4. Returns the list of valid records with amount as a float

Create a sample CSV file with at least 10 rows, including 2–3 intentionally invalid rows. Call your function and print: - How many rows were loaded - How many were skipped and why - The total amount of all valid expense records

Stretch goal: Write the valid records to a new output/cleaned_expenses.csv file.


Exercise 7: Filtered Report Writer

Scenario: Sandra wants a CSV containing only the sales reps who hit their quota this quarter.

Task:

Using the SAMPLE_ROWS data from csv_handler.py (or create your own similar dataset), write a complete script that:

  1. Reads the full sales data CSV into a list of dicts using csv.DictReader
  2. Converts revenue and quota to floats
  3. Calculates quota_attainment_pct for each rep as (revenue / quota) * 100
  4. Writes a new CSV called output/quota_achievers.csv containing only reps who achieved >= 100% quota
  5. Sorts the output by quota_attainment_pct descending

The output CSV must include these columns: rep_name, region, product_line, revenue, quota, quota_attainment_pct

Print to the console: how many reps made quota out of the total.


Exercise 8: Appending New Records

Scenario: At the end of each week, Maya adds new time entries to her project log.

Task:

Write a function called append_time_entry(csv_path, entry_dict) that:

  1. Reads the existing CSV to get the current fieldnames (from the header row)
  2. Opens the file in append mode ("a") using csv.DictWriter
  3. Writes the new entry without writing the header again
  4. Validates that the entry contains all required fields before appending — if a required field is missing, raise a ValueError with a clear message

Create a sample CSV with 3 rows and valid headers. Then call append_time_entry() three times with different entries. Read the file back at the end and confirm it has 6 rows (3 original + 3 appended).

Important: Remember that appending CSV rows requires newline="" just like writing.


Tier 3: JSON and Pathlib (Exercises 9–12)

Working with JSON configuration, pathlib operations, and directory processing


Exercise 9: Configuration File System

Scenario: Priya wants the consolidation script to read its settings from a JSON config file instead of hardcoding values.

Task:

  1. Create a JSON config file at config/consolidation_settings.json with at least these keys: json { "input_directory": "data/regional_reports", "output_directory": "output/consolidated", "required_columns": ["rep_id", "rep_name", "region", "revenue", "quota"], "over_quota_threshold_pct": 110, "report_title": "Q1 Consolidated Sales" }
  2. Write a function load_config(config_path) that reads this JSON and returns the parsed dict. If the file is missing, it should raise a FileNotFoundError with a helpful message.
  3. Write a function save_config(config_path, config_dict) that writes an updated config back to the file with indent=2
  4. Demonstrate: load the config, change over_quota_threshold_pct to 105, save it back, reload it, and confirm the change persisted

Stretch goal: Add a "last_run" key that gets updated with the current timestamp each time the config is loaded. This creates a simple "last accessed" audit trail.


Exercise 10: Directory Scanner

Scenario: Marcus needs a script that scans a folder and produces a report of all files found, organized by extension.

Task:

Write a function called scan_directory(directory_path) that:

  1. Takes a pathlib.Path as input (raises FileNotFoundError if it does not exist)
  2. Uses .iterdir() to examine every item in the directory (non-recursive)
  3. Returns a dictionary where: - Keys are file extensions (e.g., ".csv", ".json", ".txt") - Values are lists of tuples: (filename, size_in_bytes) - Files with no extension go under the key "(no extension)" - Directories are counted separately under a "(directories)" key
  4. Prints a formatted summary grouped by extension

Create a test directory with at least 6 files of mixed types. Call your function and print the result.

Stretch goal: Accept an optional pattern argument (e.g., "*.csv") and use .glob() instead of .iterdir() when a pattern is provided.


Exercise 11: Bulk File Renamer

Scenario: Priya receives regional report files named inconsistently: North Q1.csv, south_report_q1.csv, EAST-Q1-SALES.csv. She wants to normalize them all to north_q1.csv style.

Task:

Write a function called normalize_filename(original_name) that:

  1. Converts the filename to lowercase
  2. Replaces all spaces and hyphens with underscores
  3. Removes any characters that are not letters, digits, underscores, or dots

Then write a function called batch_rename_files(directory_path, dry_run=True) that:

  1. Scans all CSV files in the directory
  2. Computes the normalized name for each
  3. If dry_run=True, prints what would be renamed without actually doing it
  4. If dry_run=False, performs the rename using pathlib's .rename() method and logs each rename

Create 5 test files with messy names, run in dry-run mode first, then run for real, and confirm the results.

Key constraint: Before renaming, check whether the normalized name would collide with an existing file. If so, skip that file and print a warning.


Exercise 12: JSON Earnings Loader

Scenario: Maya wants to compare her earnings across multiple weeks by reading a folder of JSON summary files and producing a trend report.

Task:

  1. Create at least 3 JSON earnings summary files in data/earnings_history/, named earnings_2024_week_10.json, earnings_2024_week_11.json, earnings_2024_week_12.json. Each should have at least: week, total_hours, total_earnings, active_projects_count.
  2. Write a function load_earnings_history(directory_path) that: - Uses .glob("earnings_*.json") to find all matching files - Loads each JSON file - Returns a list of dicts sorted by week
  3. Write a function print_earnings_trend(history) that prints a simple week-by-week trend table showing earnings and hours, plus the change from the previous week

Stretch goal: Calculate and print the average earnings per week and flag any week that is more than 20% above or below the average.


Tier 4: Integration Challenges (Exercises 13–15)

Combining file I/O with business logic in complete mini-programs


Exercise 13: Priya's Region Consolidator (Reduced Version)

Scenario: Replicate the core of the Case Study 1 consolidation, building it yourself from scratch.

Task:

Build a complete script (without looking at the case study code) that:

  1. Creates four sample regional CSV files in data/test_regions/, each with 5 rows and columns: region, rep_name, revenue, quota
  2. Reads all four files using pathlib.glob("*.csv")
  3. Validates that each file has the required columns — skip and log any that fail validation
  4. Combines all valid records into one list
  5. Writes the combined list to output/combined_regions.csv using csv.DictWriter
  6. Writes a JSON metadata file with: total_files, total_records, files_processed (list), generated_at

Quality criteria: - Uses context managers for all file operations - Handles a missing directory or empty directory gracefully - The combined CSV has rows sorted by region then rep_name - All numeric fields in the output CSV are formatted consistently


Exercise 14: Maya's Weekly Time Report

Scenario: Maya wants to generate a weekly earnings report from her project log.

Task:

Using the project log CSV format from maya_project_log.py, write a complete script that:

  1. Loads the project log (create a sample file with at least 8 projects, mix of statuses)
  2. Calculates for each project: actual_earnings, projected_earnings, hours_over_under
  3. Groups projects by status and calculates subtotals for each group
  4. Writes a formatted weekly report to output/weekly_report_YYYY_MM_DD.txt (use today's date in the filename) — this should be a human-readable text file, not CSV
  5. The text report should include: a header with the date, a table of active projects, subtotals per status group, and a total earnings line
  6. Also writes the raw data to output/weekly_report_data.csv for use in other tools

The text report should look something like:

MAYA REYES — WEEKLY EARNINGS REPORT
Generated: 2024-04-01
=====================================

ACTIVE PROJECTS (7)
--------------------
Hartwell & Sons  / Financial Dashboard   38.5h / 40.0h est.   $6,737.50
...

SUBTOTALS
  Active    : $28,525.00  (7 projects)
  Completed : $19,162.50  (2 projects)
  ...

TOTAL EARNINGS TO DATE: $51,362.50

Exercise 15: The Master Batch Processor

Scenario: Acme Corp receives CSV files from multiple vendors throughout the week. Each file has different columns, but all share vendor_id, invoice_number, amount, and date. Marcus needs a script that processes them all and writes one unified invoice register.

Task:

  1. Create 4 vendor CSV files in data/vendor_invoices/. Each should have the required columns plus 2–3 vendor-specific extra columns.
  2. Write a process_vendor_file(file_path) function that: - Reads the file - Extracts only the four required columns - Converts amount to float and date to YYYY-MM-DD format (assume dates may come in as MM/DD/YYYY) - Returns valid records and a list of any skipped rows with reasons
  3. Write a main function that: - Processes all files using .rglob("*.csv") - Collects all valid records across all files - Writes the unified register to output/invoice_register.csv, sorted by date then vendor_id - Writes a processing log to logs/vendor_processing.log using append mode, with one entry per file processed - Writes a summary JSON at output/invoice_summary.json with total count and total amount

Quality criteria: - extrasaction="ignore" on all DictWriters - The date normalization handles both MM/DD/YYYY and YYYY-MM-DD gracefully - Any file that fails completely (unreadable, missing required columns) is logged and skipped, but does not stop the rest of processing


Tier 5: Stretch Challenges (Exercises 16–17)

Open-ended problems for learners ready to go beyond the chapter


Exercise 16: File Watcher Simulation

Scenario: Design a "polling" file processor that simulates what a real-time file watcher would do.

Task:

Create a script that:

  1. Maintains a JSON "state file" at state/processed_files.json tracking which files have already been processed (stored as a dict mapping filename to timestamp)
  2. On each run, scans a data/incoming/ directory for new CSV files
  3. Processes only files that are NOT in the state file (new files since the last run)
  4. After processing each new file (just read it and count its rows), adds it to the state file with the current timestamp
  5. Writes a summary of new vs. already-seen files to the console

Simulate three "runs" by manually dropping new files into data/incoming/ between runs. Confirm that the second run only processes files added since the first run.

This exercise teaches: idempotent processing, state management with JSON, and the foundation of real ETL (Extract, Transform, Load) pipelines.


Exercise 17: CSV Diff Tool

Scenario: Priya wants to know what changed between this week's consolidated report and last week's.

Task:

Write a csv_diff(old_path, new_path, key_column) function that:

  1. Reads both CSV files into lists of dicts
  2. Uses the key_column value (e.g., "rep_id") to match rows between the two files
  3. Identifies and returns: - "added": rows in new_path but not in old_path (by key) - "removed": rows in old_path but not in new_path (by key) - "changed": rows present in both but with different values in at least one column (report which columns changed) - "unchanged": rows identical in both
  4. Writes a diff report to output/csv_diff_report.txt in a human-readable format

Create two versions of a sales CSV (old and new) where some reps' revenue changed, one rep was added, and one was removed. Call your function and verify the output correctly identifies all four categories.

This is a genuine professional tool — variations of this are used in data pipeline validation everywhere.


Answer Notes

Exercise answers and worked solutions are provided in the instructor supplement. For self-study learners: the case studies and code files in this chapter contain all the patterns needed to complete these exercises. If you find yourself looking up a specific method, that is a sign you are engaging correctly with the material — not a sign you are doing it wrong.

Recommended time estimates per tier: - Tier 1: 15–20 minutes per exercise - Tier 2: 25–35 minutes per exercise - Tier 3: 35–45 minutes per exercise - Tier 4: 60–90 minutes per exercise - Tier 5: 90–120 minutes per exercise