> "The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency."
In This Chapter
- Opening Scenario: The 7 AM File Shuffle
- 17.1 The Automation Audit: What Should You Actually Automate?
- 17.2 The os and shutil Modules
- 17.3 pathlib: The Modern Way to Handle Paths
- 17.4 Organizing Files into Folders
- 17.5 Bulk Renaming
- 17.6 Copying, Moving, and Archiving
- 17.7 Generating Folder Structures Automatically
- 17.8 Folder Watching with watchdog
- 17.9 A Complete Monthly File Organizer
- 17.10 Running Scripts on a Schedule
- Summary
Chapter 17: Automating Repetitive Office Tasks
"The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." — Bill Gates
Opening Scenario: The 7 AM File Shuffle
Every morning at 7:15, Marcus Webb opens four different folders on his workstation and spends the next twenty minutes doing the same thing he did yesterday morning, and the morning before that.
He opens the data exports folder from the overnight ETL process. He manually checks whether the four regional CSV files are present — Chicago, Nashville, Cincinnati, and St. Louis. If they are, he moves them to the processing folder and updates a shared Excel log. If any are missing, he sends a Slack message to the regional IT contacts.
Then he checks the downloads folder on the shared drive where sales reps drop their daily call reports. He moves each file into a subfolder named for the rep, renames it with the date, and archives any files older than two weeks into a ZIP file.
The whole operation takes twenty-five minutes. Every single weekday.
Marcus has been doing this for three years. He has never once thought of it as a problem. It is, to him, just part of the job.
Priya sits three desks away and has been watching this ritual for four months. Two weeks ago she started wondering: what if none of that had to be manual?
This chapter is about answering that question systematically.
17.1 The Automation Audit: What Should You Actually Automate?
Not everything that feels tedious is worth automating. Automation takes time to build, test, and maintain. The decision is a genuine trade-off.
The Time × Frequency × Consistency Matrix
Before writing a single line of code, apply three filters to any candidate task:
Time: How long does the task take per occurrence? A task that takes 30 seconds once a week isn't worth a full automation script. A task that takes 30 minutes every day definitely is.
Frequency: How often does it happen? Monthly processes are easier to do manually. Daily or hourly processes beg for automation.
Consistency: Does the task follow the same rules every time, or does it require judgment? Automation excels at consistent, rule-based operations. If the rules change case-by-case, automation may create more problems than it solves.
Here is the matrix in practice:
| Task | Time | Frequency | Consistent? | Automate? |
|---|---|---|---|---|
| Moving ETL output files | 5 min | Daily | Yes | Strongly yes |
| Renaming files to standard format | 3 min | Daily | Yes | Strongly yes |
| Deciding which vendor to use | 45 min | Monthly | No (judgment) | No |
| Generating weekly status report | 20 min | Weekly | Yes | Yes |
| Checking for missing regional files | 2 min | Daily | Yes | Yes |
| Reviewing an unusual invoice | 30 min | Occasional | No | No |
Marcus's morning ritual scores high on all three dimensions. It takes meaningful time, happens every day without exception, and follows exactly the same rules every time. It is a textbook automation candidate.
The ROI Calculation
Be practical about the investment. A script that saves 25 minutes per day, 5 days per week, saves approximately 100 hours per year — for one person. If Priya can write the script in 4 hours, the ROI is a 25:1 ratio in the first year alone, with zero additional investment in subsequent years.
The formula:
Annual time saved = (Minutes saved per occurrence × Occurrences per year) / 60
Payback time = Hours to build ÷ Annual hours saved
For Marcus's morning routine:
Annual time saved = (25 min × 250 workdays) / 60 = ~104 hours
Payback time = 4 hours ÷ 104 hours = 0.04 years (about 10 working days)
That is a very compelling case.
17.2 The os and shutil Modules
Python's standard library includes two modules that handle virtually everything you need for file system operations: os and shutil.
os: The Operating System Interface
The os module lets Python interact with the file system and operating system.
import os
# Get the current working directory
current_dir = os.getcwd()
print(current_dir) # e.g., C:\Users\Marcus\Documents
# List files in a directory
file_list = os.listdir("/path/to/exports")
print(file_list) # ['chicago.csv', 'nashville.csv', 'readme.txt']
# Create a directory
os.mkdir("/path/to/new_folder")
# Create nested directories (like mkdir -p)
os.makedirs("/path/to/deep/nested/folder", exist_ok=True)
# exist_ok=True means: don't error if the folder already exists
# Check if a path exists
if os.path.exists("/path/to/some/file.csv"):
print("File is there")
# Check if it's a file vs. a directory
os.path.isfile("/path/to/some/file.csv") # True
os.path.isdir("/path/to/some/folder") # True
# Get file metadata
file_stats = os.stat("/path/to/file.csv")
print(file_stats.st_size) # File size in bytes
print(file_stats.st_mtime) # Last modified timestamp (seconds since epoch)
# Rename a file
os.rename("/path/to/old_name.csv", "/path/to/new_name.csv")
# Remove a file
os.remove("/path/to/unwanted_file.txt")
# Remove an empty directory
os.rmdir("/path/to/empty_folder")
The os.path submodule handles path string manipulation:
import os
# Build a path correctly for the current OS
# (backslash on Windows, forward slash on Mac/Linux)
full_path = os.path.join("/exports", "2024-01", "chicago.csv")
# On Windows: \exports\2024-01\chicago.csv
# On Mac/Linux: /exports/2024-01/chicago.csv
# Split a path into directory and filename
directory, filename = os.path.split("/exports/2024-01/chicago.csv")
# directory = "/exports/2024-01"
# filename = "chicago.csv"
# Split filename from extension
name, extension = os.path.splitext("chicago.csv")
# name = "chicago", extension = ".csv"
# Get the parent directory
parent = os.path.dirname("/exports/2024-01/chicago.csv")
# parent = "/exports/2024-01"
shutil: High-Level File Operations
While os handles individual operations, shutil (shell utilities) handles the more involved operations: copying, moving, archiving entire directory trees.
import shutil
# Copy a file (destination can be a file or directory)
shutil.copy("/source/report.pdf", "/destination/report.pdf")
# Copy with metadata preserved (timestamps, permissions)
shutil.copy2("/source/report.pdf", "/destination/report.pdf")
# Copy an entire directory tree
shutil.copytree("/source/folder", "/destination/folder")
# Move a file or directory
shutil.move("/source/file.csv", "/destination/file.csv")
# Delete an entire directory tree (including all contents)
# WARNING: this is permanent. Use carefully.
shutil.rmtree("/path/to/folder_to_delete")
# Create a ZIP archive from a directory
# Creates archive.zip containing all contents of /source_dir
shutil.make_archive(
base_name="/output/archive", # output path (no extension)
format="zip", # "zip", "tar", "gztar", etc.
root_dir="/source_dir", # directory to archive
)
# Unpack an archive
shutil.unpack_archive("/path/to/archive.zip", "/destination_folder")
17.3 pathlib: The Modern Way to Handle Paths
While os.path works, Python 3.4 introduced pathlib as the more ergonomic replacement. Rather than treating paths as strings and calling functions on them, pathlib gives you Path objects that you can work with directly.
Chapter 9 introduced pathlib briefly. This chapter is where it becomes indispensable, because file automation scripts involve dozens of path manipulations and the difference in readability matters.
The Path Object
from pathlib import Path
# Create a Path object
exports_dir = Path("/data/exports")
report_file = Path("/data/reports/q1_summary.xlsx")
# Path arithmetic with the / operator
# This is pathlib's best feature
data_dir = Path("/data")
daily_file = data_dir / "exports" / "2024-01-15" / "chicago.csv"
# Result: Path('/data/exports/2024-01-15/chicago.csv')
# Check existence
if daily_file.exists():
print("File is present")
if daily_file.is_file():
print("It's a file, not a directory")
# Get components
print(daily_file.name) # 'chicago.csv'
print(daily_file.stem) # 'chicago'
print(daily_file.suffix) # '.csv'
print(daily_file.parent) # Path('/data/exports/2024-01-15')
print(daily_file.parts) # ('/', 'data', 'exports', '2024-01-15', 'chicago.csv')
# Convert to string when needed
print(str(daily_file)) # '/data/exports/2024-01-15/chicago.csv'
Iterating Over Files
from pathlib import Path
exports_dir = Path("/data/exports")
# List all files in a directory
for file_path in exports_dir.iterdir():
if file_path.is_file():
print(file_path.name)
# Find files matching a pattern (glob)
for csv_file in exports_dir.glob("*.csv"):
print(csv_file.name)
# Find files recursively (in all subdirectories)
for csv_file in exports_dir.rglob("*.csv"):
print(csv_file)
# Find files matching a more specific pattern
for report_file in exports_dir.glob("*_regional_*.xlsx"):
print(report_file.name)
Creating Directories and Files
from pathlib import Path
archive_dir = Path("/data/archive/2024-01")
# Create directory and all parents (equivalent to mkdir -p)
archive_dir.mkdir(parents=True, exist_ok=True)
# Write text content to a file
output_file = archive_dir / "manifest.txt"
output_file.write_text("Archived: chicago.csv, nashville.csv\n")
# Read a file's text content
content = output_file.read_text()
print(content)
# Get file stats
stats = output_file.stat()
print(f"Size: {stats.st_size} bytes")
print(f"Last modified: {stats.st_mtime}")
Reading File Metadata with pathlib
import datetime
from pathlib import Path
def get_file_info(file_path: Path) -> dict:
"""Return a dictionary of useful metadata about a file."""
stats = file_path.stat()
modification_time = datetime.datetime.fromtimestamp(stats.st_mtime)
creation_time = datetime.datetime.fromtimestamp(stats.st_ctime)
return {
"name": file_path.name,
"extension": file_path.suffix.lower(),
"size_bytes": stats.st_size,
"size_kb": round(stats.st_size / 1024, 1),
"modified": modification_time,
"modified_date": modification_time.date(),
"created": creation_time,
"parent_folder": str(file_path.parent),
}
# Use it
exports_dir = Path("/data/exports")
for csv_file in exports_dir.glob("*.csv"):
info = get_file_info(csv_file)
print(f"{info['name']:<30} {info['size_kb']:>8.1f} KB {info['modified_date']}")
17.4 Organizing Files into Folders
One of the most common file automation tasks is sorting a folder full of mixed files into organized subdirectories. Here is how to approach it systematically.
Sorting by File Type
import shutil
from pathlib import Path
# Define which file types belong in which folders
FOLDER_RULES = {
"spreadsheets": [".xlsx", ".xls", ".csv", ".ods"],
"documents": [".docx", ".doc", ".pdf", ".txt", ".rtf"],
"presentations": [".pptx", ".ppt"],
"images": [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff"],
"archives": [".zip", ".tar", ".gz", ".7z", ".rar"],
"data": [".json", ".xml", ".parquet", ".feather"],
}
def build_extension_map(folder_rules: dict) -> dict:
"""
Reverse the folder_rules dictionary so extension maps to folder name.
{'spreadsheets': ['.xlsx', '.csv']} becomes {'.xlsx': 'spreadsheets', ...}
"""
extension_map = {}
for folder_name, extensions in folder_rules.items():
for ext in extensions:
extension_map[ext.lower()] = folder_name
return extension_map
def organize_folder_by_type(
source_dir: Path,
dry_run: bool = True,
) -> list[dict]:
"""
Sort files in source_dir into subdirectories based on file type.
Args:
source_dir: The folder to organize.
dry_run: If True, print what would happen without moving anything.
Set to False to actually move files.
Returns:
List of action dicts with keys: file, destination, status
"""
extension_map = build_extension_map(FOLDER_RULES)
actions = []
for file_path in source_dir.iterdir():
# Skip directories and hidden files
if file_path.is_dir():
continue
if file_path.name.startswith("."):
continue
extension = file_path.suffix.lower()
destination_folder_name = extension_map.get(extension, "misc")
destination_dir = source_dir / destination_folder_name
destination_file = destination_dir / file_path.name
action = {
"file": file_path.name,
"source": str(file_path),
"destination": str(destination_file),
"folder": destination_folder_name,
"status": "pending",
}
if not dry_run:
destination_dir.mkdir(exist_ok=True)
# Handle naming conflicts
if destination_file.exists():
destination_file = resolve_naming_conflict(destination_file)
action["destination"] = str(destination_file)
action["status"] = "moved_renamed"
else:
action["status"] = "moved"
shutil.move(str(file_path), str(destination_file))
actions.append(action)
return actions
def resolve_naming_conflict(file_path: Path) -> Path:
"""
If file_path already exists, add a numeric suffix until the name is unique.
chicago.csv -> chicago_1.csv -> chicago_2.csv, etc.
"""
counter = 1
stem = file_path.stem
suffix = file_path.suffix
parent = file_path.parent
while file_path.exists():
file_path = parent / f"{stem}_{counter}{suffix}"
counter += 1
return file_path
Sorting by Date
Often more useful than sorting by type is sorting by the file's modification date, which groups work done around the same time.
import datetime
import shutil
from pathlib import Path
def organize_folder_by_date(
source_dir: Path,
date_format: str = "%Y-%m",
dry_run: bool = True,
) -> list[dict]:
"""
Move files into subdirectories named by the file's modification month.
Creates subfolders named like '2024-01', '2024-02', etc.
Args:
source_dir: The folder to organize.
date_format: strftime format for the subfolder names.
"%Y-%m" gives "2024-01"; "%Y" gives "2024".
dry_run: If True, show actions without moving files.
Returns:
List of action dicts.
"""
actions = []
for file_path in source_dir.iterdir():
if file_path.is_dir():
continue
if file_path.name.startswith("."):
continue
# Get modification date
modified_timestamp = file_path.stat().st_mtime
modified_date = datetime.datetime.fromtimestamp(modified_timestamp)
folder_name = modified_date.strftime(date_format)
destination_dir = source_dir / folder_name
destination_file = destination_dir / file_path.name
action = {
"file": file_path.name,
"date_folder": folder_name,
"destination": str(destination_file),
"status": "pending",
}
if not dry_run:
destination_dir.mkdir(exist_ok=True)
if destination_file.exists():
destination_file = resolve_naming_conflict(destination_file)
action["status"] = "moved_renamed"
else:
action["status"] = "moved"
shutil.move(str(file_path), str(destination_file))
actions.append(action)
return actions
17.5 Bulk Renaming
File naming conventions drift over time. Files arrive from different systems with different naming styles. Bulk renaming scripts restore order.
Common Renaming Operations
import re
import datetime
from pathlib import Path
def standardize_filename(filename: str) -> str:
"""
Convert a filename to a standard format:
- Lowercase
- Spaces and special characters replaced with underscores
- Multiple underscores collapsed to one
Examples:
'Q1 Sales Report (FINAL).xlsx' -> 'q1_sales_report_final.xlsx'
'Chicago-RegionalData v2.csv' -> 'chicago_regionaldata_v2.csv'
"""
stem, suffix = filename.rsplit(".", 1) if "." in filename else (filename, "")
# Lowercase
stem = stem.lower()
# Replace non-alphanumeric characters (except underscores) with underscores
stem = re.sub(r"[^a-z0-9_]", "_", stem)
# Collapse multiple underscores
stem = re.sub(r"_+", "_", stem)
# Strip leading/trailing underscores
stem = stem.strip("_")
if suffix:
return f"{stem}.{suffix.lower()}"
return stem
def add_date_prefix(filename: str, date: datetime.date = None) -> str:
"""
Add an ISO-format date prefix to a filename.
Args:
filename: The original filename.
date: The date to use. Defaults to today if not provided.
Examples:
'chicago_sales.csv', date=2024-01-15 -> '2024-01-15_chicago_sales.csv'
"""
if date is None:
date = datetime.date.today()
date_prefix = date.strftime("%Y-%m-%d")
return f"{date_prefix}_{filename}"
def reformat_date_in_filename(filename: str) -> str:
"""
Find dates in various formats within a filename and normalize to ISO 8601.
Handles: MM-DD-YYYY, MM/DD/YYYY, MMDDYYYY, DD-MM-YYYY
Converts to: YYYY-MM-DD
Examples:
'report_01-15-2024.csv' -> 'report_2024-01-15.csv'
'data_01152024.xlsx' -> 'data_2024-01-15.xlsx'
"""
# Pattern: MM-DD-YYYY or MM/DD/YYYY
pattern_mdy = re.compile(r"(\d{1,2})[-/](\d{1,2})[-/](\d{4})")
def replace_mdy(match):
month, day, year = match.groups()
return f"{year}-{month.zfill(2)}-{day.zfill(2)}"
result = pattern_mdy.sub(replace_mdy, filename)
# Pattern: MMDDYYYY (8 digits with no separator — requires careful matching)
pattern_mmddyyyy = re.compile(r"(\d{2})(\d{2})(\d{4})")
def replace_mmddyyyy(match):
month, day, year = match.groups()
# Validate it looks like a real date before replacing
try:
datetime.date(int(year), int(month), int(day))
return f"{year}-{month}-{day}"
except ValueError:
return match.group(0) # Not a valid date — leave it alone
result = pattern_mmddyyyy.sub(replace_mmddyyyy, result)
return result
def bulk_rename_preview(source_dir: Path, rename_function) -> list[dict]:
"""
Preview the result of applying a rename function to all files in source_dir.
Does not move anything. Returns a list of before/after comparisons.
"""
previews = []
for file_path in sorted(source_dir.iterdir()):
if file_path.is_dir():
continue
new_name = rename_function(file_path.name)
previews.append({
"before": file_path.name,
"after": new_name,
"changed": file_path.name != new_name,
})
return previews
# Preview before/after for a dry run
def print_rename_preview(previews: list[dict]) -> None:
"""Print a formatted before/after table for rename operations."""
changes = [p for p in previews if p["changed"]]
unchanged = len(previews) - len(changes)
print(f"Rename preview: {len(changes)} files will change, {unchanged} unchanged\n")
print(f"{'BEFORE':<45} {'AFTER':<45}")
print("-" * 91)
for preview in changes:
print(f"{preview['before']:<45} {preview['after']:<45}")
17.6 Copying, Moving, and Archiving
The mechanics of moving files around and preserving them in archives.
Moving Files with Conflict Handling
import shutil
import logging
from pathlib import Path
logger = logging.getLogger(__name__)
def move_file_safely(
source: Path,
destination_dir: Path,
on_conflict: str = "rename",
) -> Path:
"""
Move a file to a destination directory, handling name conflicts.
Args:
source: Path to the file to move.
destination_dir: Directory to move it into.
on_conflict: What to do if the destination file already exists.
"rename" (default): add numeric suffix to avoid overwrite
"overwrite": replace the existing file
"skip": do nothing and return the existing file path
"error": raise FileExistsError
Returns:
The final path of the moved (or skipped) file.
"""
destination_dir.mkdir(parents=True, exist_ok=True)
destination = destination_dir / source.name
if destination.exists():
if on_conflict == "rename":
destination = resolve_naming_conflict(destination)
logger.info(f"Conflict resolved: {source.name} -> {destination.name}")
elif on_conflict == "overwrite":
logger.info(f"Overwriting existing file: {destination}")
elif on_conflict == "skip":
logger.info(f"Skipping {source.name}: already exists at {destination}")
return destination
elif on_conflict == "error":
raise FileExistsError(
f"Destination already exists: {destination}"
)
shutil.move(str(source), str(destination))
logger.info(f"Moved: {source} -> {destination}")
return destination
def resolve_naming_conflict(file_path: Path) -> Path:
"""Add a numeric suffix until the filename is unique."""
counter = 1
stem = file_path.stem
suffix = file_path.suffix
parent = file_path.parent
while file_path.exists():
file_path = parent / f"{stem}_{counter}{suffix}"
counter += 1
return file_path
Creating Archives
import shutil
import datetime
from pathlib import Path
def archive_folder(
source_dir: Path,
archive_dir: Path,
archive_name: str = None,
format: str = "zip",
) -> Path:
"""
Compress source_dir into an archive in archive_dir.
Args:
source_dir: The folder to archive.
archive_dir: Where to put the archive file.
archive_name: Name for the archive (without extension).
Defaults to source folder name + timestamp.
format: "zip", "tar", "gztar", or "bztar"
Returns:
Path to the created archive file.
"""
archive_dir.mkdir(parents=True, exist_ok=True)
if archive_name is None:
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
archive_name = f"{source_dir.name}_{timestamp}"
archive_base = archive_dir / archive_name
created_path = shutil.make_archive(
base_name=str(archive_base),
format=format,
root_dir=str(source_dir.parent),
base_dir=source_dir.name,
)
return Path(created_path)
def archive_old_files(
source_dir: Path,
archive_dir: Path,
older_than_days: int = 30,
dry_run: bool = True,
) -> list[Path]:
"""
Move files older than a threshold into an archive directory.
Args:
source_dir: Folder to check for old files.
archive_dir: Where to move old files.
older_than_days: Files older than this many days will be moved.
dry_run: If True, report what would happen without moving anything.
Returns:
List of files that were (or would be) moved.
"""
import datetime
cutoff = datetime.datetime.now() - datetime.timedelta(days=older_than_days)
files_to_archive = []
for file_path in source_dir.iterdir():
if file_path.is_dir():
continue
modified = datetime.datetime.fromtimestamp(file_path.stat().st_mtime)
if modified < cutoff:
files_to_archive.append(file_path)
if not dry_run:
move_file_safely(file_path, archive_dir, on_conflict="rename")
return files_to_archive
17.7 Generating Folder Structures Automatically
When a new project starts, creating the standard folder structure manually is tedious and prone to inconsistency. Python can generate it in seconds.
from pathlib import Path
# Define a project folder template as a nested dictionary
# None values represent files to create (empty or with placeholder content)
PROJECT_TEMPLATE = {
"data": {
"raw": {},
"processed": {},
"exports": {},
},
"reports": {
"weekly": {},
"monthly": {},
},
"scripts": {},
"archive": {},
"docs": {},
}
def create_folder_structure(
base_dir: Path,
template: dict,
create_gitkeep: bool = True,
) -> list[Path]:
"""
Recursively create a folder structure from a template dictionary.
Args:
base_dir: Root directory to create the structure under.
template: Nested dict where keys are folder names and values
are sub-templates (empty dict for leaf folders).
create_gitkeep: If True, add a .gitkeep file to empty directories
so they are tracked by git.
Returns:
List of all created directory paths.
"""
created = []
for folder_name, sub_template in template.items():
folder_path = base_dir / folder_name
folder_path.mkdir(parents=True, exist_ok=True)
created.append(folder_path)
if sub_template:
# Recurse into sub-template
sub_created = create_folder_structure(
folder_path, sub_template, create_gitkeep
)
created.extend(sub_created)
elif create_gitkeep:
# Leaf folder: create a .gitkeep placeholder
gitkeep = folder_path / ".gitkeep"
if not gitkeep.exists():
gitkeep.touch()
return created
def setup_regional_project(
base_dir: Path,
project_name: str,
regions: list[str],
) -> Path:
"""
Create a standard Acme Corp regional analysis project structure.
Creates a top-level project folder with subfolders for each region,
plus shared data and reports directories.
Returns:
The created project root directory.
"""
project_root = base_dir / project_name
# Build a dynamic template based on the regions list
regional_template = {region: {} for region in regions}
template = {
"data": {
"raw": regional_template.copy(),
"processed": {},
},
"reports": {
"drafts": {},
"final": {},
},
"scripts": {},
"archive": {},
}
created = create_folder_structure(project_root, template)
# Create a README in the project root
readme_content = f"""# {project_name}
Created: {__import__('datetime').date.today()}
Regions: {', '.join(regions)}
## Structure
- data/raw/ Source data files by region
- data/processed/ Cleaned and joined data
- reports/ Analysis outputs
- scripts/ Python scripts for this analysis
- archive/ Old versions and superseded files
"""
(project_root / "README.md").write_text(readme_content)
print(f"Created project structure: {project_root}")
print(f" {len(created)} directories created")
return project_root
17.8 Folder Watching with watchdog
All the operations above are run on-demand: you trigger the script when you want things organized. watchdog goes further — it monitors a folder and triggers actions the moment new files arrive, without any human involvement.
This is the pattern behind many real-world automation workflows: files dropped in a folder are automatically processed and moved.
Installing watchdog
pip install watchdog
A Basic Folder Watcher
import time
import logging
from pathlib import Path
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
logging.basicConfig(level=logging.INFO, format="%(asctime)s — %(message)s")
logger = logging.getLogger(__name__)
class CSVArrivalHandler(FileSystemEventHandler):
"""
Handles file system events in the watched directory.
Triggers processing when a CSV file is created or moved into the folder.
"""
def __init__(self, processing_function):
super().__init__()
self.processing_function = processing_function
def on_created(self, event):
"""Called when a file or directory is created."""
if event.is_directory:
return
file_path = Path(event.src_path)
# Only process CSV files
if file_path.suffix.lower() == ".csv":
logger.info(f"New CSV detected: {file_path.name}")
self.processing_function(file_path)
def on_moved(self, event):
"""Called when a file or directory is moved into the watched folder."""
if event.is_directory:
return
dest_path = Path(event.dest_path)
if dest_path.suffix.lower() == ".csv":
logger.info(f"CSV moved in: {dest_path.name}")
self.processing_function(dest_path)
def process_incoming_csv(file_path: Path) -> None:
"""
Process a newly arrived CSV file.
This function contains whatever business logic you need:
validate columns, move to processing folder, send a notification, etc.
"""
logger.info(f"Processing: {file_path.name}")
# Example: move to a 'processing' subdirectory
processing_dir = file_path.parent / "processing"
processing_dir.mkdir(exist_ok=True)
destination = processing_dir / file_path.name
import shutil
shutil.move(str(file_path), str(destination))
logger.info(f"Moved to processing: {destination}")
def watch_folder(folder_path: Path, processing_function) -> None:
"""
Watch a folder for new CSV files and process them automatically.
Runs indefinitely until interrupted with Ctrl+C.
Args:
folder_path: The directory to monitor.
processing_function: Called with the Path of each new CSV file.
"""
event_handler = CSVArrivalHandler(processing_function)
observer = Observer()
observer.schedule(event_handler, str(folder_path), recursive=False)
observer.start()
logger.info(f"Watching for new CSV files in: {folder_path}")
logger.info("Press Ctrl+C to stop.\n")
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
logger.info("Stopping watcher...")
observer.stop()
observer.join()
logger.info("Watcher stopped.")
if __name__ == "__main__":
watch_folder(
folder_path=Path("/data/exports/incoming"),
processing_function=process_incoming_csv,
)
How watchdog Works
watchdog uses the operating system's native file change notification mechanism (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows). This is far more efficient than polling — it does not repeatedly check every file; it receives an event the instant the OS detects a change.
The Observer object is the watcher. You attach one or more FileSystemEventHandler subclasses to it, each watching a path. When a file system event occurs (create, modify, delete, move), the corresponding method on your handler is called.
The key events:
- on_created(event) — a new file or directory appeared
- on_deleted(event) — a file or directory was removed
- on_modified(event) — a file's contents changed
- on_moved(event) — a file was renamed or moved (including moved in from outside)
17.9 A Complete Monthly File Organizer
Let's combine everything into a single, production-quality script that Marcus (or Priya's automated scheduler) can run every morning.
The script will: 1. Check that the four expected regional CSVs are present 2. Move those CSVs to the day's processing folder 3. Archive any files older than one day 4. Log everything
"""
monthly_file_organizer.py
Morning file organization script for Acme Corp data exports.
Designed to run daily via Task Scheduler or cron.
Usage:
python monthly_file_organizer.py
python monthly_file_organizer.py --dry-run
"""
import argparse
import datetime
import logging
import shutil
import sys
from pathlib import Path
# ── CONFIGURATION ─────────────────────────────────────────────────────────────
EXPORTS_DIR = Path("/data/acme/exports")
PROCESSING_DIR = Path("/data/acme/processing")
ARCHIVE_DIR = Path("/data/acme/archive")
LOG_DIR = Path("/data/acme/logs")
EXPECTED_REGIONAL_FILES = [
"chicago_sales.csv",
"nashville_sales.csv",
"cincinnati_sales.csv",
"stlouis_sales.csv",
]
ARCHIVE_AFTER_DAYS = 1 # Archive exports older than 1 day
# ── LOGGING SETUP ─────────────────────────────────────────────────────────────
def setup_logging(log_dir: Path) -> logging.Logger:
"""Configure logging to both console and a daily log file."""
log_dir.mkdir(parents=True, exist_ok=True)
today = datetime.date.today().strftime("%Y-%m-%d")
log_file = log_dir / f"organizer_{today}.log"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
handlers=[
logging.FileHandler(log_file),
logging.StreamHandler(sys.stdout),
],
)
return logging.getLogger(__name__)
# ── CORE FUNCTIONS ─────────────────────────────────────────────────────────────
def check_expected_files(
source_dir: Path,
expected_files: list[str],
logger: logging.Logger,
) -> tuple[list[str], list[str]]:
"""
Check which expected files are present and which are missing.
Returns:
(present_files, missing_files) — two lists of filenames
"""
present = []
missing = []
for filename in expected_files:
if (source_dir / filename).exists():
present.append(filename)
logger.info(f" PRESENT: {filename}")
else:
missing.append(filename)
logger.warning(f" MISSING: {filename}")
return present, missing
def move_files_to_processing(
source_dir: Path,
destination_dir: Path,
filenames: list[str],
logger: logging.Logger,
dry_run: bool = False,
) -> list[Path]:
"""
Move a list of files from source_dir to destination_dir.
Returns:
List of destination Paths for moved files.
"""
destination_dir.mkdir(parents=True, exist_ok=True)
moved = []
for filename in filenames:
source_file = source_dir / filename
dest_file = destination_dir / filename
if not source_file.exists():
logger.warning(f"Cannot move (not found): {filename}")
continue
if dest_file.exists():
# Avoid overwrite: add a timestamp suffix
timestamp = datetime.datetime.now().strftime("%H%M%S")
stem = source_file.stem
suffix = source_file.suffix
dest_file = destination_dir / f"{stem}_{timestamp}{suffix}"
logger.info(f" Conflict resolved: -> {dest_file.name}")
if not dry_run:
shutil.move(str(source_file), str(dest_file))
logger.info(f" Moved: {filename} -> {dest_file}")
else:
logger.info(f" [DRY RUN] Would move: {filename} -> {dest_file}")
moved.append(dest_file)
return moved
def archive_old_exports(
source_dir: Path,
archive_dir: Path,
older_than_days: int,
logger: logging.Logger,
dry_run: bool = False,
) -> list[Path]:
"""
Move files older than the threshold into the archive directory,
organized into date-named subfolders.
"""
cutoff = datetime.datetime.now() - datetime.timedelta(days=older_than_days)
archived = []
for file_path in source_dir.iterdir():
if file_path.is_dir():
continue
modified = datetime.datetime.fromtimestamp(file_path.stat().st_mtime)
if modified < cutoff:
date_folder = modified.strftime("%Y-%m-%d")
dest_dir = archive_dir / date_folder
dest_file = dest_dir / file_path.name
if not dry_run:
dest_dir.mkdir(parents=True, exist_ok=True)
if dest_file.exists():
# Already archived (can happen on reruns)
logger.info(f" Already archived, skipping: {file_path.name}")
continue
shutil.move(str(file_path), str(dest_file))
logger.info(f" Archived: {file_path.name} -> {date_folder}/")
else:
logger.info(
f" [DRY RUN] Would archive: {file_path.name} -> {date_folder}/"
)
archived.append(dest_file)
return archived
# ── MAIN ORCHESTRATION ─────────────────────────────────────────────────────────
def run_morning_organizer(dry_run: bool = False) -> None:
"""
Full morning file organization routine.
1. Archive files from yesterday
2. Check for today's expected regional files
3. Move present files to processing
4. Report missing files
"""
logger = setup_logging(LOG_DIR)
today = datetime.date.today().strftime("%Y-%m-%d")
today_processing_dir = PROCESSING_DIR / today
logger.info("=" * 60)
logger.info(f"Morning File Organizer — {today}")
if dry_run:
logger.info("DRY RUN MODE: no files will be moved")
logger.info("=" * 60)
# Step 1: Archive old files
logger.info("\n[1/3] Archiving old export files...")
archived = archive_old_exports(
source_dir=EXPORTS_DIR,
archive_dir=ARCHIVE_DIR,
older_than_days=ARCHIVE_AFTER_DAYS,
logger=logger,
dry_run=dry_run,
)
logger.info(f" {len(archived)} files archived")
# Step 2: Check for expected files
logger.info("\n[2/3] Checking for expected regional files...")
present_files, missing_files = check_expected_files(
source_dir=EXPORTS_DIR,
expected_files=EXPECTED_REGIONAL_FILES,
logger=logger,
)
logger.info(f" {len(present_files)}/4 expected files present")
# Step 3: Move present files to processing
if present_files:
logger.info(f"\n[3/3] Moving files to today's processing folder...")
moved = move_files_to_processing(
source_dir=EXPORTS_DIR,
destination_dir=today_processing_dir,
filenames=present_files,
logger=logger,
dry_run=dry_run,
)
logger.info(f" {len(moved)} files moved to {today_processing_dir}")
else:
logger.warning("\n[3/3] No expected files found — nothing to move")
# Final summary
logger.info("\n" + "=" * 60)
logger.info("SUMMARY")
logger.info(f" Archived: {len(archived)} files")
logger.info(f" Present: {len(present_files)}/4 regional CSVs")
logger.info(f" Missing: {missing_files or 'None'}")
logger.info(f" Processing: {today_processing_dir}")
logger.info("=" * 60)
if missing_files:
logger.warning(
f"\nACTION REQUIRED: {len(missing_files)} regional files are missing."
)
for f in missing_files:
logger.warning(f" - {f}")
sys.exit(1) # Non-zero exit signals failure to the scheduler
# ── ENTRY POINT ───────────────────────────────────────────────────────────────
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Acme Corp morning file organization script"
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Print what would happen without moving any files",
)
args = parser.parse_args()
run_morning_organizer(dry_run=args.dry_run)
Running with --dry-run first is a professional discipline. Always verify what a file-moving script will do before it does it.
17.10 Running Scripts on a Schedule
Writing the script is only half the work. For daily automation to be truly hands-off, the script must run without anyone triggering it.
Windows Task Scheduler
Task Scheduler is built into Windows. To schedule a Python script:
- Open Task Scheduler (search for it in the Start menu)
- Click "Create Basic Task"
- Set the trigger: Daily, at 7:00 AM
- Set the action: Start a program
- Program:
C:\Users\Marcus\AppData\Local\Programs\Python\Python311\python.exe- Arguments:C:\data\scripts\morning_organizer.py- Start in:C:\data\scripts\
From the command line (using schtasks):
schtasks /create /tn "Acme Morning Organizer" ^
/tr "python C:\data\scripts\morning_organizer.py" ^
/sc DAILY /st 07:00
macOS/Linux: cron
Open the cron table editor:
crontab -e
Add a line:
# Run morning organizer at 7:00 AM on weekdays
0 7 * * 1-5 /usr/local/bin/python3 /data/scripts/morning_organizer.py
The format is: minute hour day month weekday command
Summary
- Start with the automation audit: filter candidates by time × frequency × consistency
oshandles file system operations;os.pathhandles path string manipulationpathlib.Pathis the modern, recommended approach — use the/operator to build pathsshutilhandles copying, moving, and archiving at a higher level- Conflict handling is non-optional: every file-moving script must handle the case where the destination already exists
- Dry-run mode is professional practice: verify before committing
watchdogenables reactive automation — trigger actions when files arrive without polling- Build scripts to log everything; logs are how you debug problems that happen at 7 AM while you're still asleep
- Schedule with Windows Task Scheduler or cron to make automation truly hands-off
Chapter 18: Working with PDFs and Word Documents →