7 min read

> "The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency."

Chapter 17: Automating Repetitive Office Tasks

"The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency." — Bill Gates


Opening Scenario: The 7 AM File Shuffle

Every morning at 7:15, Marcus Webb opens four different folders on his workstation and spends the next twenty minutes doing the same thing he did yesterday morning, and the morning before that.

He opens the data exports folder from the overnight ETL process. He manually checks whether the four regional CSV files are present — Chicago, Nashville, Cincinnati, and St. Louis. If they are, he moves them to the processing folder and updates a shared Excel log. If any are missing, he sends a Slack message to the regional IT contacts.

Then he checks the downloads folder on the shared drive where sales reps drop their daily call reports. He moves each file into a subfolder named for the rep, renames it with the date, and archives any files older than two weeks into a ZIP file.

The whole operation takes twenty-five minutes. Every single weekday.

Marcus has been doing this for three years. He has never once thought of it as a problem. It is, to him, just part of the job.

Priya sits three desks away and has been watching this ritual for four months. Two weeks ago she started wondering: what if none of that had to be manual?

This chapter is about answering that question systematically.


17.1 The Automation Audit: What Should You Actually Automate?

Not everything that feels tedious is worth automating. Automation takes time to build, test, and maintain. The decision is a genuine trade-off.

The Time × Frequency × Consistency Matrix

Before writing a single line of code, apply three filters to any candidate task:

Time: How long does the task take per occurrence? A task that takes 30 seconds once a week isn't worth a full automation script. A task that takes 30 minutes every day definitely is.

Frequency: How often does it happen? Monthly processes are easier to do manually. Daily or hourly processes beg for automation.

Consistency: Does the task follow the same rules every time, or does it require judgment? Automation excels at consistent, rule-based operations. If the rules change case-by-case, automation may create more problems than it solves.

Here is the matrix in practice:

Task Time Frequency Consistent? Automate?
Moving ETL output files 5 min Daily Yes Strongly yes
Renaming files to standard format 3 min Daily Yes Strongly yes
Deciding which vendor to use 45 min Monthly No (judgment) No
Generating weekly status report 20 min Weekly Yes Yes
Checking for missing regional files 2 min Daily Yes Yes
Reviewing an unusual invoice 30 min Occasional No No

Marcus's morning ritual scores high on all three dimensions. It takes meaningful time, happens every day without exception, and follows exactly the same rules every time. It is a textbook automation candidate.

The ROI Calculation

Be practical about the investment. A script that saves 25 minutes per day, 5 days per week, saves approximately 100 hours per year — for one person. If Priya can write the script in 4 hours, the ROI is a 25:1 ratio in the first year alone, with zero additional investment in subsequent years.

The formula:

Annual time saved = (Minutes saved per occurrence × Occurrences per year) / 60
Payback time = Hours to build ÷ Annual hours saved

For Marcus's morning routine:

Annual time saved = (25 min × 250 workdays) / 60 = ~104 hours
Payback time = 4 hours ÷ 104 hours = 0.04 years (about 10 working days)

That is a very compelling case.


17.2 The os and shutil Modules

Python's standard library includes two modules that handle virtually everything you need for file system operations: os and shutil.

os: The Operating System Interface

The os module lets Python interact with the file system and operating system.

import os

# Get the current working directory
current_dir = os.getcwd()
print(current_dir)   # e.g., C:\Users\Marcus\Documents

# List files in a directory
file_list = os.listdir("/path/to/exports")
print(file_list)   # ['chicago.csv', 'nashville.csv', 'readme.txt']

# Create a directory
os.mkdir("/path/to/new_folder")

# Create nested directories (like mkdir -p)
os.makedirs("/path/to/deep/nested/folder", exist_ok=True)
# exist_ok=True means: don't error if the folder already exists

# Check if a path exists
if os.path.exists("/path/to/some/file.csv"):
    print("File is there")

# Check if it's a file vs. a directory
os.path.isfile("/path/to/some/file.csv")   # True
os.path.isdir("/path/to/some/folder")      # True

# Get file metadata
file_stats = os.stat("/path/to/file.csv")
print(file_stats.st_size)    # File size in bytes
print(file_stats.st_mtime)   # Last modified timestamp (seconds since epoch)

# Rename a file
os.rename("/path/to/old_name.csv", "/path/to/new_name.csv")

# Remove a file
os.remove("/path/to/unwanted_file.txt")

# Remove an empty directory
os.rmdir("/path/to/empty_folder")

The os.path submodule handles path string manipulation:

import os

# Build a path correctly for the current OS
# (backslash on Windows, forward slash on Mac/Linux)
full_path = os.path.join("/exports", "2024-01", "chicago.csv")
# On Windows: \exports\2024-01\chicago.csv
# On Mac/Linux: /exports/2024-01/chicago.csv

# Split a path into directory and filename
directory, filename = os.path.split("/exports/2024-01/chicago.csv")
# directory = "/exports/2024-01"
# filename = "chicago.csv"

# Split filename from extension
name, extension = os.path.splitext("chicago.csv")
# name = "chicago", extension = ".csv"

# Get the parent directory
parent = os.path.dirname("/exports/2024-01/chicago.csv")
# parent = "/exports/2024-01"

shutil: High-Level File Operations

While os handles individual operations, shutil (shell utilities) handles the more involved operations: copying, moving, archiving entire directory trees.

import shutil

# Copy a file (destination can be a file or directory)
shutil.copy("/source/report.pdf", "/destination/report.pdf")

# Copy with metadata preserved (timestamps, permissions)
shutil.copy2("/source/report.pdf", "/destination/report.pdf")

# Copy an entire directory tree
shutil.copytree("/source/folder", "/destination/folder")

# Move a file or directory
shutil.move("/source/file.csv", "/destination/file.csv")

# Delete an entire directory tree (including all contents)
# WARNING: this is permanent. Use carefully.
shutil.rmtree("/path/to/folder_to_delete")

# Create a ZIP archive from a directory
# Creates archive.zip containing all contents of /source_dir
shutil.make_archive(
    base_name="/output/archive",    # output path (no extension)
    format="zip",                   # "zip", "tar", "gztar", etc.
    root_dir="/source_dir",         # directory to archive
)

# Unpack an archive
shutil.unpack_archive("/path/to/archive.zip", "/destination_folder")

17.3 pathlib: The Modern Way to Handle Paths

While os.path works, Python 3.4 introduced pathlib as the more ergonomic replacement. Rather than treating paths as strings and calling functions on them, pathlib gives you Path objects that you can work with directly.

Chapter 9 introduced pathlib briefly. This chapter is where it becomes indispensable, because file automation scripts involve dozens of path manipulations and the difference in readability matters.

The Path Object

from pathlib import Path

# Create a Path object
exports_dir = Path("/data/exports")
report_file = Path("/data/reports/q1_summary.xlsx")

# Path arithmetic with the / operator
# This is pathlib's best feature
data_dir = Path("/data")
daily_file = data_dir / "exports" / "2024-01-15" / "chicago.csv"
# Result: Path('/data/exports/2024-01-15/chicago.csv')

# Check existence
if daily_file.exists():
    print("File is present")

if daily_file.is_file():
    print("It's a file, not a directory")

# Get components
print(daily_file.name)        # 'chicago.csv'
print(daily_file.stem)        # 'chicago'
print(daily_file.suffix)      # '.csv'
print(daily_file.parent)      # Path('/data/exports/2024-01-15')
print(daily_file.parts)       # ('/', 'data', 'exports', '2024-01-15', 'chicago.csv')

# Convert to string when needed
print(str(daily_file))        # '/data/exports/2024-01-15/chicago.csv'

Iterating Over Files

from pathlib import Path

exports_dir = Path("/data/exports")

# List all files in a directory
for file_path in exports_dir.iterdir():
    if file_path.is_file():
        print(file_path.name)

# Find files matching a pattern (glob)
for csv_file in exports_dir.glob("*.csv"):
    print(csv_file.name)

# Find files recursively (in all subdirectories)
for csv_file in exports_dir.rglob("*.csv"):
    print(csv_file)

# Find files matching a more specific pattern
for report_file in exports_dir.glob("*_regional_*.xlsx"):
    print(report_file.name)

Creating Directories and Files

from pathlib import Path

archive_dir = Path("/data/archive/2024-01")

# Create directory and all parents (equivalent to mkdir -p)
archive_dir.mkdir(parents=True, exist_ok=True)

# Write text content to a file
output_file = archive_dir / "manifest.txt"
output_file.write_text("Archived: chicago.csv, nashville.csv\n")

# Read a file's text content
content = output_file.read_text()
print(content)

# Get file stats
stats = output_file.stat()
print(f"Size: {stats.st_size} bytes")
print(f"Last modified: {stats.st_mtime}")

Reading File Metadata with pathlib

import datetime
from pathlib import Path

def get_file_info(file_path: Path) -> dict:
    """Return a dictionary of useful metadata about a file."""
    stats = file_path.stat()

    modification_time = datetime.datetime.fromtimestamp(stats.st_mtime)
    creation_time = datetime.datetime.fromtimestamp(stats.st_ctime)

    return {
        "name": file_path.name,
        "extension": file_path.suffix.lower(),
        "size_bytes": stats.st_size,
        "size_kb": round(stats.st_size / 1024, 1),
        "modified": modification_time,
        "modified_date": modification_time.date(),
        "created": creation_time,
        "parent_folder": str(file_path.parent),
    }


# Use it
exports_dir = Path("/data/exports")
for csv_file in exports_dir.glob("*.csv"):
    info = get_file_info(csv_file)
    print(f"{info['name']:<30} {info['size_kb']:>8.1f} KB  {info['modified_date']}")

17.4 Organizing Files into Folders

One of the most common file automation tasks is sorting a folder full of mixed files into organized subdirectories. Here is how to approach it systematically.

Sorting by File Type

import shutil
from pathlib import Path


# Define which file types belong in which folders
FOLDER_RULES = {
    "spreadsheets": [".xlsx", ".xls", ".csv", ".ods"],
    "documents": [".docx", ".doc", ".pdf", ".txt", ".rtf"],
    "presentations": [".pptx", ".ppt"],
    "images": [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff"],
    "archives": [".zip", ".tar", ".gz", ".7z", ".rar"],
    "data": [".json", ".xml", ".parquet", ".feather"],
}


def build_extension_map(folder_rules: dict) -> dict:
    """
    Reverse the folder_rules dictionary so extension maps to folder name.

    {'spreadsheets': ['.xlsx', '.csv']} becomes {'.xlsx': 'spreadsheets', ...}
    """
    extension_map = {}
    for folder_name, extensions in folder_rules.items():
        for ext in extensions:
            extension_map[ext.lower()] = folder_name
    return extension_map


def organize_folder_by_type(
    source_dir: Path,
    dry_run: bool = True,
) -> list[dict]:
    """
    Sort files in source_dir into subdirectories based on file type.

    Args:
        source_dir: The folder to organize.
        dry_run: If True, print what would happen without moving anything.
                 Set to False to actually move files.

    Returns:
        List of action dicts with keys: file, destination, status
    """
    extension_map = build_extension_map(FOLDER_RULES)
    actions = []

    for file_path in source_dir.iterdir():
        # Skip directories and hidden files
        if file_path.is_dir():
            continue
        if file_path.name.startswith("."):
            continue

        extension = file_path.suffix.lower()
        destination_folder_name = extension_map.get(extension, "misc")

        destination_dir = source_dir / destination_folder_name
        destination_file = destination_dir / file_path.name

        action = {
            "file": file_path.name,
            "source": str(file_path),
            "destination": str(destination_file),
            "folder": destination_folder_name,
            "status": "pending",
        }

        if not dry_run:
            destination_dir.mkdir(exist_ok=True)

            # Handle naming conflicts
            if destination_file.exists():
                destination_file = resolve_naming_conflict(destination_file)
                action["destination"] = str(destination_file)
                action["status"] = "moved_renamed"
            else:
                action["status"] = "moved"

            shutil.move(str(file_path), str(destination_file))

        actions.append(action)

    return actions


def resolve_naming_conflict(file_path: Path) -> Path:
    """
    If file_path already exists, add a numeric suffix until the name is unique.

    chicago.csv -> chicago_1.csv -> chicago_2.csv, etc.
    """
    counter = 1
    stem = file_path.stem
    suffix = file_path.suffix
    parent = file_path.parent

    while file_path.exists():
        file_path = parent / f"{stem}_{counter}{suffix}"
        counter += 1

    return file_path

Sorting by Date

Often more useful than sorting by type is sorting by the file's modification date, which groups work done around the same time.

import datetime
import shutil
from pathlib import Path


def organize_folder_by_date(
    source_dir: Path,
    date_format: str = "%Y-%m",
    dry_run: bool = True,
) -> list[dict]:
    """
    Move files into subdirectories named by the file's modification month.

    Creates subfolders named like '2024-01', '2024-02', etc.

    Args:
        source_dir: The folder to organize.
        date_format: strftime format for the subfolder names.
                     "%Y-%m" gives "2024-01"; "%Y" gives "2024".
        dry_run: If True, show actions without moving files.

    Returns:
        List of action dicts.
    """
    actions = []

    for file_path in source_dir.iterdir():
        if file_path.is_dir():
            continue
        if file_path.name.startswith("."):
            continue

        # Get modification date
        modified_timestamp = file_path.stat().st_mtime
        modified_date = datetime.datetime.fromtimestamp(modified_timestamp)
        folder_name = modified_date.strftime(date_format)

        destination_dir = source_dir / folder_name
        destination_file = destination_dir / file_path.name

        action = {
            "file": file_path.name,
            "date_folder": folder_name,
            "destination": str(destination_file),
            "status": "pending",
        }

        if not dry_run:
            destination_dir.mkdir(exist_ok=True)
            if destination_file.exists():
                destination_file = resolve_naming_conflict(destination_file)
                action["status"] = "moved_renamed"
            else:
                action["status"] = "moved"
            shutil.move(str(file_path), str(destination_file))

        actions.append(action)

    return actions

17.5 Bulk Renaming

File naming conventions drift over time. Files arrive from different systems with different naming styles. Bulk renaming scripts restore order.

Common Renaming Operations

import re
import datetime
from pathlib import Path


def standardize_filename(filename: str) -> str:
    """
    Convert a filename to a standard format:
    - Lowercase
    - Spaces and special characters replaced with underscores
    - Multiple underscores collapsed to one

    Examples:
        'Q1 Sales Report (FINAL).xlsx' -> 'q1_sales_report_final.xlsx'
        'Chicago-RegionalData v2.csv'  -> 'chicago_regionaldata_v2.csv'
    """
    stem, suffix = filename.rsplit(".", 1) if "." in filename else (filename, "")

    # Lowercase
    stem = stem.lower()

    # Replace non-alphanumeric characters (except underscores) with underscores
    stem = re.sub(r"[^a-z0-9_]", "_", stem)

    # Collapse multiple underscores
    stem = re.sub(r"_+", "_", stem)

    # Strip leading/trailing underscores
    stem = stem.strip("_")

    if suffix:
        return f"{stem}.{suffix.lower()}"
    return stem


def add_date_prefix(filename: str, date: datetime.date = None) -> str:
    """
    Add an ISO-format date prefix to a filename.

    Args:
        filename: The original filename.
        date: The date to use. Defaults to today if not provided.

    Examples:
        'chicago_sales.csv', date=2024-01-15 -> '2024-01-15_chicago_sales.csv'
    """
    if date is None:
        date = datetime.date.today()

    date_prefix = date.strftime("%Y-%m-%d")
    return f"{date_prefix}_{filename}"


def reformat_date_in_filename(filename: str) -> str:
    """
    Find dates in various formats within a filename and normalize to ISO 8601.

    Handles: MM-DD-YYYY, MM/DD/YYYY, MMDDYYYY, DD-MM-YYYY
    Converts to: YYYY-MM-DD

    Examples:
        'report_01-15-2024.csv'  -> 'report_2024-01-15.csv'
        'data_01152024.xlsx'     -> 'data_2024-01-15.xlsx'
    """
    # Pattern: MM-DD-YYYY or MM/DD/YYYY
    pattern_mdy = re.compile(r"(\d{1,2})[-/](\d{1,2})[-/](\d{4})")

    def replace_mdy(match):
        month, day, year = match.groups()
        return f"{year}-{month.zfill(2)}-{day.zfill(2)}"

    result = pattern_mdy.sub(replace_mdy, filename)

    # Pattern: MMDDYYYY (8 digits with no separator — requires careful matching)
    pattern_mmddyyyy = re.compile(r"(\d{2})(\d{2})(\d{4})")

    def replace_mmddyyyy(match):
        month, day, year = match.groups()
        # Validate it looks like a real date before replacing
        try:
            datetime.date(int(year), int(month), int(day))
            return f"{year}-{month}-{day}"
        except ValueError:
            return match.group(0)  # Not a valid date — leave it alone

    result = pattern_mmddyyyy.sub(replace_mmddyyyy, result)

    return result


def bulk_rename_preview(source_dir: Path, rename_function) -> list[dict]:
    """
    Preview the result of applying a rename function to all files in source_dir.

    Does not move anything. Returns a list of before/after comparisons.
    """
    previews = []

    for file_path in sorted(source_dir.iterdir()):
        if file_path.is_dir():
            continue

        new_name = rename_function(file_path.name)
        previews.append({
            "before": file_path.name,
            "after": new_name,
            "changed": file_path.name != new_name,
        })

    return previews


# Preview before/after for a dry run
def print_rename_preview(previews: list[dict]) -> None:
    """Print a formatted before/after table for rename operations."""
    changes = [p for p in previews if p["changed"]]
    unchanged = len(previews) - len(changes)

    print(f"Rename preview: {len(changes)} files will change, {unchanged} unchanged\n")
    print(f"{'BEFORE':<45} {'AFTER':<45}")
    print("-" * 91)

    for preview in changes:
        print(f"{preview['before']:<45} {preview['after']:<45}")

17.6 Copying, Moving, and Archiving

The mechanics of moving files around and preserving them in archives.

Moving Files with Conflict Handling

import shutil
import logging
from pathlib import Path

logger = logging.getLogger(__name__)


def move_file_safely(
    source: Path,
    destination_dir: Path,
    on_conflict: str = "rename",
) -> Path:
    """
    Move a file to a destination directory, handling name conflicts.

    Args:
        source: Path to the file to move.
        destination_dir: Directory to move it into.
        on_conflict: What to do if the destination file already exists.
                     "rename" (default): add numeric suffix to avoid overwrite
                     "overwrite": replace the existing file
                     "skip": do nothing and return the existing file path
                     "error": raise FileExistsError

    Returns:
        The final path of the moved (or skipped) file.
    """
    destination_dir.mkdir(parents=True, exist_ok=True)
    destination = destination_dir / source.name

    if destination.exists():
        if on_conflict == "rename":
            destination = resolve_naming_conflict(destination)
            logger.info(f"Conflict resolved: {source.name} -> {destination.name}")
        elif on_conflict == "overwrite":
            logger.info(f"Overwriting existing file: {destination}")
        elif on_conflict == "skip":
            logger.info(f"Skipping {source.name}: already exists at {destination}")
            return destination
        elif on_conflict == "error":
            raise FileExistsError(
                f"Destination already exists: {destination}"
            )

    shutil.move(str(source), str(destination))
    logger.info(f"Moved: {source} -> {destination}")
    return destination


def resolve_naming_conflict(file_path: Path) -> Path:
    """Add a numeric suffix until the filename is unique."""
    counter = 1
    stem = file_path.stem
    suffix = file_path.suffix
    parent = file_path.parent

    while file_path.exists():
        file_path = parent / f"{stem}_{counter}{suffix}"
        counter += 1

    return file_path

Creating Archives

import shutil
import datetime
from pathlib import Path


def archive_folder(
    source_dir: Path,
    archive_dir: Path,
    archive_name: str = None,
    format: str = "zip",
) -> Path:
    """
    Compress source_dir into an archive in archive_dir.

    Args:
        source_dir: The folder to archive.
        archive_dir: Where to put the archive file.
        archive_name: Name for the archive (without extension).
                      Defaults to source folder name + timestamp.
        format: "zip", "tar", "gztar", or "bztar"

    Returns:
        Path to the created archive file.
    """
    archive_dir.mkdir(parents=True, exist_ok=True)

    if archive_name is None:
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        archive_name = f"{source_dir.name}_{timestamp}"

    archive_base = archive_dir / archive_name

    created_path = shutil.make_archive(
        base_name=str(archive_base),
        format=format,
        root_dir=str(source_dir.parent),
        base_dir=source_dir.name,
    )

    return Path(created_path)


def archive_old_files(
    source_dir: Path,
    archive_dir: Path,
    older_than_days: int = 30,
    dry_run: bool = True,
) -> list[Path]:
    """
    Move files older than a threshold into an archive directory.

    Args:
        source_dir: Folder to check for old files.
        archive_dir: Where to move old files.
        older_than_days: Files older than this many days will be moved.
        dry_run: If True, report what would happen without moving anything.

    Returns:
        List of files that were (or would be) moved.
    """
    import datetime

    cutoff = datetime.datetime.now() - datetime.timedelta(days=older_than_days)
    files_to_archive = []

    for file_path in source_dir.iterdir():
        if file_path.is_dir():
            continue

        modified = datetime.datetime.fromtimestamp(file_path.stat().st_mtime)

        if modified < cutoff:
            files_to_archive.append(file_path)

            if not dry_run:
                move_file_safely(file_path, archive_dir, on_conflict="rename")

    return files_to_archive

17.7 Generating Folder Structures Automatically

When a new project starts, creating the standard folder structure manually is tedious and prone to inconsistency. Python can generate it in seconds.

from pathlib import Path


# Define a project folder template as a nested dictionary
# None values represent files to create (empty or with placeholder content)
PROJECT_TEMPLATE = {
    "data": {
        "raw": {},
        "processed": {},
        "exports": {},
    },
    "reports": {
        "weekly": {},
        "monthly": {},
    },
    "scripts": {},
    "archive": {},
    "docs": {},
}


def create_folder_structure(
    base_dir: Path,
    template: dict,
    create_gitkeep: bool = True,
) -> list[Path]:
    """
    Recursively create a folder structure from a template dictionary.

    Args:
        base_dir: Root directory to create the structure under.
        template: Nested dict where keys are folder names and values
                  are sub-templates (empty dict for leaf folders).
        create_gitkeep: If True, add a .gitkeep file to empty directories
                        so they are tracked by git.

    Returns:
        List of all created directory paths.
    """
    created = []

    for folder_name, sub_template in template.items():
        folder_path = base_dir / folder_name
        folder_path.mkdir(parents=True, exist_ok=True)
        created.append(folder_path)

        if sub_template:
            # Recurse into sub-template
            sub_created = create_folder_structure(
                folder_path, sub_template, create_gitkeep
            )
            created.extend(sub_created)
        elif create_gitkeep:
            # Leaf folder: create a .gitkeep placeholder
            gitkeep = folder_path / ".gitkeep"
            if not gitkeep.exists():
                gitkeep.touch()

    return created


def setup_regional_project(
    base_dir: Path,
    project_name: str,
    regions: list[str],
) -> Path:
    """
    Create a standard Acme Corp regional analysis project structure.

    Creates a top-level project folder with subfolders for each region,
    plus shared data and reports directories.

    Returns:
        The created project root directory.
    """
    project_root = base_dir / project_name

    # Build a dynamic template based on the regions list
    regional_template = {region: {} for region in regions}

    template = {
        "data": {
            "raw": regional_template.copy(),
            "processed": {},
        },
        "reports": {
            "drafts": {},
            "final": {},
        },
        "scripts": {},
        "archive": {},
    }

    created = create_folder_structure(project_root, template)

    # Create a README in the project root
    readme_content = f"""# {project_name}

Created: {__import__('datetime').date.today()}
Regions: {', '.join(regions)}

## Structure
- data/raw/       Source data files by region
- data/processed/ Cleaned and joined data
- reports/        Analysis outputs
- scripts/        Python scripts for this analysis
- archive/        Old versions and superseded files
"""
    (project_root / "README.md").write_text(readme_content)

    print(f"Created project structure: {project_root}")
    print(f"  {len(created)} directories created")

    return project_root

17.8 Folder Watching with watchdog

All the operations above are run on-demand: you trigger the script when you want things organized. watchdog goes further — it monitors a folder and triggers actions the moment new files arrive, without any human involvement.

This is the pattern behind many real-world automation workflows: files dropped in a folder are automatically processed and moved.

Installing watchdog

pip install watchdog

A Basic Folder Watcher

import time
import logging
from pathlib import Path
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

logging.basicConfig(level=logging.INFO, format="%(asctime)s — %(message)s")
logger = logging.getLogger(__name__)


class CSVArrivalHandler(FileSystemEventHandler):
    """
    Handles file system events in the watched directory.

    Triggers processing when a CSV file is created or moved into the folder.
    """

    def __init__(self, processing_function):
        super().__init__()
        self.processing_function = processing_function

    def on_created(self, event):
        """Called when a file or directory is created."""
        if event.is_directory:
            return

        file_path = Path(event.src_path)

        # Only process CSV files
        if file_path.suffix.lower() == ".csv":
            logger.info(f"New CSV detected: {file_path.name}")
            self.processing_function(file_path)

    def on_moved(self, event):
        """Called when a file or directory is moved into the watched folder."""
        if event.is_directory:
            return

        dest_path = Path(event.dest_path)

        if dest_path.suffix.lower() == ".csv":
            logger.info(f"CSV moved in: {dest_path.name}")
            self.processing_function(dest_path)


def process_incoming_csv(file_path: Path) -> None:
    """
    Process a newly arrived CSV file.

    This function contains whatever business logic you need:
    validate columns, move to processing folder, send a notification, etc.
    """
    logger.info(f"Processing: {file_path.name}")

    # Example: move to a 'processing' subdirectory
    processing_dir = file_path.parent / "processing"
    processing_dir.mkdir(exist_ok=True)

    destination = processing_dir / file_path.name
    import shutil
    shutil.move(str(file_path), str(destination))
    logger.info(f"Moved to processing: {destination}")


def watch_folder(folder_path: Path, processing_function) -> None:
    """
    Watch a folder for new CSV files and process them automatically.

    Runs indefinitely until interrupted with Ctrl+C.

    Args:
        folder_path: The directory to monitor.
        processing_function: Called with the Path of each new CSV file.
    """
    event_handler = CSVArrivalHandler(processing_function)
    observer = Observer()
    observer.schedule(event_handler, str(folder_path), recursive=False)
    observer.start()

    logger.info(f"Watching for new CSV files in: {folder_path}")
    logger.info("Press Ctrl+C to stop.\n")

    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        logger.info("Stopping watcher...")
        observer.stop()

    observer.join()
    logger.info("Watcher stopped.")


if __name__ == "__main__":
    watch_folder(
        folder_path=Path("/data/exports/incoming"),
        processing_function=process_incoming_csv,
    )

How watchdog Works

watchdog uses the operating system's native file change notification mechanism (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows). This is far more efficient than polling — it does not repeatedly check every file; it receives an event the instant the OS detects a change.

The Observer object is the watcher. You attach one or more FileSystemEventHandler subclasses to it, each watching a path. When a file system event occurs (create, modify, delete, move), the corresponding method on your handler is called.

The key events: - on_created(event) — a new file or directory appeared - on_deleted(event) — a file or directory was removed - on_modified(event) — a file's contents changed - on_moved(event) — a file was renamed or moved (including moved in from outside)


17.9 A Complete Monthly File Organizer

Let's combine everything into a single, production-quality script that Marcus (or Priya's automated scheduler) can run every morning.

The script will: 1. Check that the four expected regional CSVs are present 2. Move those CSVs to the day's processing folder 3. Archive any files older than one day 4. Log everything

"""
monthly_file_organizer.py

Morning file organization script for Acme Corp data exports.
Designed to run daily via Task Scheduler or cron.

Usage:
    python monthly_file_organizer.py
    python monthly_file_organizer.py --dry-run
"""
import argparse
import datetime
import logging
import shutil
import sys
from pathlib import Path

# ── CONFIGURATION ─────────────────────────────────────────────────────────────
EXPORTS_DIR = Path("/data/acme/exports")
PROCESSING_DIR = Path("/data/acme/processing")
ARCHIVE_DIR = Path("/data/acme/archive")
LOG_DIR = Path("/data/acme/logs")

EXPECTED_REGIONAL_FILES = [
    "chicago_sales.csv",
    "nashville_sales.csv",
    "cincinnati_sales.csv",
    "stlouis_sales.csv",
]

ARCHIVE_AFTER_DAYS = 1  # Archive exports older than 1 day


# ── LOGGING SETUP ─────────────────────────────────────────────────────────────
def setup_logging(log_dir: Path) -> logging.Logger:
    """Configure logging to both console and a daily log file."""
    log_dir.mkdir(parents=True, exist_ok=True)
    today = datetime.date.today().strftime("%Y-%m-%d")
    log_file = log_dir / f"organizer_{today}.log"

    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s  %(levelname)-8s  %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[
            logging.FileHandler(log_file),
            logging.StreamHandler(sys.stdout),
        ],
    )
    return logging.getLogger(__name__)


# ── CORE FUNCTIONS ─────────────────────────────────────────────────────────────
def check_expected_files(
    source_dir: Path,
    expected_files: list[str],
    logger: logging.Logger,
) -> tuple[list[str], list[str]]:
    """
    Check which expected files are present and which are missing.

    Returns:
        (present_files, missing_files) — two lists of filenames
    """
    present = []
    missing = []

    for filename in expected_files:
        if (source_dir / filename).exists():
            present.append(filename)
            logger.info(f"  PRESENT:  {filename}")
        else:
            missing.append(filename)
            logger.warning(f"  MISSING:  {filename}")

    return present, missing


def move_files_to_processing(
    source_dir: Path,
    destination_dir: Path,
    filenames: list[str],
    logger: logging.Logger,
    dry_run: bool = False,
) -> list[Path]:
    """
    Move a list of files from source_dir to destination_dir.

    Returns:
        List of destination Paths for moved files.
    """
    destination_dir.mkdir(parents=True, exist_ok=True)
    moved = []

    for filename in filenames:
        source_file = source_dir / filename
        dest_file = destination_dir / filename

        if not source_file.exists():
            logger.warning(f"Cannot move (not found): {filename}")
            continue

        if dest_file.exists():
            # Avoid overwrite: add a timestamp suffix
            timestamp = datetime.datetime.now().strftime("%H%M%S")
            stem = source_file.stem
            suffix = source_file.suffix
            dest_file = destination_dir / f"{stem}_{timestamp}{suffix}"
            logger.info(f"  Conflict resolved: -> {dest_file.name}")

        if not dry_run:
            shutil.move(str(source_file), str(dest_file))
            logger.info(f"  Moved: {filename} -> {dest_file}")
        else:
            logger.info(f"  [DRY RUN] Would move: {filename} -> {dest_file}")

        moved.append(dest_file)

    return moved


def archive_old_exports(
    source_dir: Path,
    archive_dir: Path,
    older_than_days: int,
    logger: logging.Logger,
    dry_run: bool = False,
) -> list[Path]:
    """
    Move files older than the threshold into the archive directory,
    organized into date-named subfolders.
    """
    cutoff = datetime.datetime.now() - datetime.timedelta(days=older_than_days)
    archived = []

    for file_path in source_dir.iterdir():
        if file_path.is_dir():
            continue

        modified = datetime.datetime.fromtimestamp(file_path.stat().st_mtime)

        if modified < cutoff:
            date_folder = modified.strftime("%Y-%m-%d")
            dest_dir = archive_dir / date_folder
            dest_file = dest_dir / file_path.name

            if not dry_run:
                dest_dir.mkdir(parents=True, exist_ok=True)
                if dest_file.exists():
                    # Already archived (can happen on reruns)
                    logger.info(f"  Already archived, skipping: {file_path.name}")
                    continue
                shutil.move(str(file_path), str(dest_file))
                logger.info(f"  Archived: {file_path.name} -> {date_folder}/")
            else:
                logger.info(
                    f"  [DRY RUN] Would archive: {file_path.name} -> {date_folder}/"
                )

            archived.append(dest_file)

    return archived


# ── MAIN ORCHESTRATION ─────────────────────────────────────────────────────────
def run_morning_organizer(dry_run: bool = False) -> None:
    """
    Full morning file organization routine.

    1. Archive files from yesterday
    2. Check for today's expected regional files
    3. Move present files to processing
    4. Report missing files
    """
    logger = setup_logging(LOG_DIR)
    today = datetime.date.today().strftime("%Y-%m-%d")
    today_processing_dir = PROCESSING_DIR / today

    logger.info("=" * 60)
    logger.info(f"Morning File Organizer — {today}")
    if dry_run:
        logger.info("DRY RUN MODE: no files will be moved")
    logger.info("=" * 60)

    # Step 1: Archive old files
    logger.info("\n[1/3] Archiving old export files...")
    archived = archive_old_exports(
        source_dir=EXPORTS_DIR,
        archive_dir=ARCHIVE_DIR,
        older_than_days=ARCHIVE_AFTER_DAYS,
        logger=logger,
        dry_run=dry_run,
    )
    logger.info(f"      {len(archived)} files archived")

    # Step 2: Check for expected files
    logger.info("\n[2/3] Checking for expected regional files...")
    present_files, missing_files = check_expected_files(
        source_dir=EXPORTS_DIR,
        expected_files=EXPECTED_REGIONAL_FILES,
        logger=logger,
    )
    logger.info(f"      {len(present_files)}/4 expected files present")

    # Step 3: Move present files to processing
    if present_files:
        logger.info(f"\n[3/3] Moving files to today's processing folder...")
        moved = move_files_to_processing(
            source_dir=EXPORTS_DIR,
            destination_dir=today_processing_dir,
            filenames=present_files,
            logger=logger,
            dry_run=dry_run,
        )
        logger.info(f"      {len(moved)} files moved to {today_processing_dir}")
    else:
        logger.warning("\n[3/3] No expected files found — nothing to move")

    # Final summary
    logger.info("\n" + "=" * 60)
    logger.info("SUMMARY")
    logger.info(f"  Archived:    {len(archived)} files")
    logger.info(f"  Present:     {len(present_files)}/4 regional CSVs")
    logger.info(f"  Missing:     {missing_files or 'None'}")
    logger.info(f"  Processing:  {today_processing_dir}")
    logger.info("=" * 60)

    if missing_files:
        logger.warning(
            f"\nACTION REQUIRED: {len(missing_files)} regional files are missing."
        )
        for f in missing_files:
            logger.warning(f"  - {f}")
        sys.exit(1)  # Non-zero exit signals failure to the scheduler


# ── ENTRY POINT ───────────────────────────────────────────────────────────────
if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Acme Corp morning file organization script"
    )
    parser.add_argument(
        "--dry-run",
        action="store_true",
        help="Print what would happen without moving any files",
    )
    args = parser.parse_args()
    run_morning_organizer(dry_run=args.dry_run)

Running with --dry-run first is a professional discipline. Always verify what a file-moving script will do before it does it.


17.10 Running Scripts on a Schedule

Writing the script is only half the work. For daily automation to be truly hands-off, the script must run without anyone triggering it.

Windows Task Scheduler

Task Scheduler is built into Windows. To schedule a Python script:

  1. Open Task Scheduler (search for it in the Start menu)
  2. Click "Create Basic Task"
  3. Set the trigger: Daily, at 7:00 AM
  4. Set the action: Start a program - Program: C:\Users\Marcus\AppData\Local\Programs\Python\Python311\python.exe - Arguments: C:\data\scripts\morning_organizer.py - Start in: C:\data\scripts\

From the command line (using schtasks):

schtasks /create /tn "Acme Morning Organizer" ^
    /tr "python C:\data\scripts\morning_organizer.py" ^
    /sc DAILY /st 07:00

macOS/Linux: cron

Open the cron table editor:

crontab -e

Add a line:

# Run morning organizer at 7:00 AM on weekdays
0 7 * * 1-5 /usr/local/bin/python3 /data/scripts/morning_organizer.py

The format is: minute hour day month weekday command


Summary

  • Start with the automation audit: filter candidates by time × frequency × consistency
  • os handles file system operations; os.path handles path string manipulation
  • pathlib.Path is the modern, recommended approach — use the / operator to build paths
  • shutil handles copying, moving, and archiving at a higher level
  • Conflict handling is non-optional: every file-moving script must handle the case where the destination already exists
  • Dry-run mode is professional practice: verify before committing
  • watchdog enables reactive automation — trigger actions when files arrive without polling
  • Build scripts to log everything; logs are how you debug problems that happen at 7 AM while you're still asleep
  • Schedule with Windows Task Scheduler or cron to make automation truly hands-off

Chapter 18: Working with PDFs and Word Documents →