Exercises — Chapter 17: Automating Repetitive Office Tasks

DataField.Dev

Exercises — Chapter 17: Automating Repetitive Office Tasks

Starred exercises (*) have worked solutions in Appendix B.

Tier 1: Recall

1.1 What is the automation audit? Describe the three dimensions of the Time × Frequency × Consistency matrix and explain what each dimension measures.

1.2 What is the difference between os and shutil? Give one example of an operation you would use each module for.

1.3 ★ Write a code snippet using pathlib that: - Creates a Path object for /data/reports/2024-01 - Creates all necessary directories (including parents) if they don't exist - Builds the full path to a file called summary.xlsx inside that directory

1.4 What does shutil.move() do? How is it different from shutil.copy2()?

1.5 Explain the purpose of the exist_ok=True parameter when calling Path.mkdir(). What happens without it if the directory already exists?

1.6 ★ Write a one-line expression using pathlib glob to find all .pdf files in a directory called /reports and its subdirectories.

1.7 What is watchdog and what problem does it solve? How is it different from writing a loop that checks a folder every few seconds?

1.8 What is a "dry run" in the context of file automation scripts? Why is it a professional best practice to implement one?

Tier 2: Apply

2.1 ★ Write a function list_files_by_extension(folder: Path) -> dict that: - Takes a folder path - Returns a dictionary mapping each file extension to a list of filenames with that extension - For example: {".csv": ["sales.csv", "inventory.csv"], ".xlsx": ["report.xlsx"]} - Skips directories and hidden files

2.2 Write a function get_largest_files(folder: Path, n: int = 10) -> list that returns the n largest files in a folder, sorted by size (largest first). Each item in the list should be a dict with name, size_bytes, and size_mb keys.

2.3 ★ Using the resolve_naming_conflict pattern from the chapter, write a function safe_copy(source: Path, destination_dir: Path) -> Path that: - Copies a file to the destination directory - If a file with the same name already exists, adds a numeric suffix (e.g., report_1.pdf, report_2.pdf) - Returns the path of the created copy

2.4 Write a script that generates a summary report of a given folder. The report should print: - Total number of files - Total size in MB - Breakdown by extension (count and total size for each) - Oldest and newest file (by modification date)

2.5 ★ Write a function create_dated_folder(base_dir: Path, date: datetime.date = None) -> Path that creates a subdirectory in base_dir named with the given date in YYYY/MM/DD format (nested). For example, calling it with 2024-01-15 would create base_dir/2024/01/15/. Defaults to today's date.

2.6 Write a function standardize_filename(name: str) -> str that: - Converts the name to lowercase - Replaces spaces and any character that is not alphanumeric or underscore with an underscore - Collapses multiple consecutive underscores to one - Strips leading and trailing underscores

Test it on at least five example filenames including ones with parentheses, hyphens, and multiple spaces.

Tier 3: Analyze

3.1 ★ Consider the following two approaches to checking if a file exists before moving it:

# Approach A
if os.path.exists(source_path):
    shutil.move(source_path, destination_path)

# Approach B
source = Path(source_path)
if source.exists():
    shutil.move(str(source), str(destination_path))

What are the functional differences, if any? Which is preferable and why? Is there a subtle bug or race condition that exists in both approaches?

3.2 The chapter uses shutil.move() extensively. Research what happens when shutil.move() is called across different file system volumes (e.g., from an SSD to a network drive). Does it always do the same thing internally? What are the implications for performance in a business context?

3.3 The folder_watcher.py code includes a SETTLE_TIME_SECONDS = 2.0 constant and a time.sleep() call. Why is this necessary? What problem does it solve, and what would happen without it?

3.4 ★ The chapter's archive_old_exports() function uses modification time (st_mtime) to determine whether a file is "old." What are the limitations of this approach? Give two scenarios where st_mtime might not reflect what you actually want, and describe alternative approaches for each.

3.5 Analyze the automation ROI formula from Section 17.1. What are its hidden assumptions? Identify at least three factors the formula doesn't account for (e.g., maintenance time, initial testing time, error scenarios). How would you modify the calculation to be more accurate?

Tier 4: Synthesize

4.1 Build a project_initializer.py script that, given a project name and type (from a configurable list like "analytics", "consulting", "audit"), creates the appropriate folder structure with correct subfolders, a README.md with project metadata, and a status.txt file set to "active". The folder structure should differ by project type (e.g., analytics projects have data/raw/, data/processed/, notebooks/ subdirectories; consulting projects have deliverables/, client_materials/, invoices/).

4.2 ★ Build a download_organizer.py script that organizes a user's Downloads folder: - Moves image files to ~/Pictures/from_downloads/YYYY-MM/ - Moves PDF files to ~/Documents/PDFs/YYYY-MM/ - Moves spreadsheets and CSV files to ~/Documents/Data/YYYY-MM/ - Archives any file older than 90 days to ~/Downloads/_archive/YYYY/ - Generates a summary log of what was moved - Fully supports dry-run mode

4.3 Create a folder_health_report.py script for Marcus at Acme Corp. Given a root directory, it should: - Walk all subdirectories recursively - Identify folders that have had no file modifications in more than 180 days (candidates for archiving) - Identify folders that are completely empty - Identify files larger than a configurable threshold (default: 100 MB) that might be worth archiving - Output a formatted report to both console and a text file

4.4 Extend the bulk_renamer.py from the chapter to support a --undo flag. The script should save a JSON file called .rename_log.json in the source directory that records every rename operation. When --undo is passed, it reads this log and reverses all the renames.

Tier 5: Challenge

5.1 (Research and Build) The chapter briefly covers Windows Task Scheduler and cron for scheduling. Research and implement a Python-based scheduling approach using the schedule library (pip install schedule). Build a task_runner.py that: - Runs morning_organizer.py equivalent logic at 7 AM on weekdays - Logs success/failure to a daily file - Keeps itself running indefinitely (runs in the background) - Sends a Windows desktop notification on failure (using the plyer library)

Compare this approach to using the operating system's native scheduler. When would you prefer each approach?

5.2 (Open-Ended) The chapter focuses on local file operations. Research and write a 500-word analysis comparing the pathlib/shutil approach to using cloud storage APIs (specifically Google Drive API or Microsoft Graph API for OneDrive). What changes when files are in the cloud? What stays the same? For a business where all files are in SharePoint, how would you rewrite the morning_organizer.py case study?

5.3 (Build) Create a complete file_deduplicator.py that: - Scans a directory tree for duplicate files (files with identical content, regardless of name) - Uses file hashing (MD5 or SHA-256) to identify duplicates efficiently — do NOT compare files byte-by-byte - Groups duplicates and presents them for review, showing which to keep and which to remove - Supports a safe mode that moves duplicates to a staging area rather than deleting - Handles large files efficiently (hash in chunks, not all at once) - Reports total disk space that would be recovered