Answers to Selected Exercises

Python for Business for Beginners: Coding for Every Person

This appendix provides worked solutions to the starred (*) exercises from each chapter. Starred exercises are the ones most instructive to complete and review. For Chapters 1–10, solutions are presented in full detail with explanation. For Chapters 11–40, solutions focus on the key insight and most educational aspects of each problem.

How to use this appendix: Try each exercise yourself first. If you are stuck for more than 20 minutes, look at the approach here — not the final answer — and try again. Understanding why a solution works is more valuable than seeing the solution. Pay attention to the "Common Mistakes" sections; they reflect errors real students make.


Part I: Python Foundations (Chapters 1–8)


Chapter 1: Getting Started with Python

Exercise 1.1* — Run a Python script that prints your name, today's date, and a message to a potential business client.

Solution:

# my_intro.py
import datetime

name = "Your Name"
today = datetime.date.today()
message = "I help businesses unlock insights from their data using Python."

print(f"Hello! My name is {name}.")
print(f"Today is {today.strftime('%B %d, %Y')}.")
print(message)

Output:

Hello! My name is Your Name.
Today is February 24, 2026.
I help businesses unlock insights from their data using Python.

Explanation: datetime.date.today() returns today's date as a date object. The .strftime() method formats it as a human-readable string. %B is the full month name, %d the zero-padded day, and %Y the four-digit year. F-strings interpolate variables directly into the string using curly braces.

Common Mistakes: - Forgetting import datetime — Python doesn't know what datetime is without it. - Writing datetime.today() instead of datetime.date.today() — the today() method lives on the date class inside the datetime module.


Exercise 1.2* — Use the Python REPL to calculate the following: If Acme Corp has 12 salespeople, each making an average of 23 calls per day, how many calls are made per week (5 days)?

Solution:

>>> salespeople = 12
>>> calls_per_day_each = 23
>>> days_per_week = 5
>>> total_weekly_calls = salespeople * calls_per_day_each * days_per_week
>>> print(f"Total weekly calls: {total_weekly_calls:,}")
Total weekly calls: 1,380

Explanation: This is straightforward arithmetic. The key learning is naming variables meaningfully. Notice the :, format specifier inside the f-string — this adds a comma as a thousands separator, which is good practice for business numbers. total_weekly_calls is far more readable than x or result.


Exercise 1.3* — Install the pandas library using pip and verify the installation by importing it and printing its version number.

Solution:

In the terminal:

pip install pandas

In a Python file or REPL:

import pandas as pd
print(pd.__version__)

Output: Something like 2.2.0 (your version may differ)

Explanation: pip install pandas downloads pandas and all its dependencies from PyPI. The __version__ attribute is a convention in Python packages — it stores the installed version as a string. If the import succeeds without error, pandas is correctly installed.

Common Mistakes: - Running pip install pandas inside a Python script (it's a terminal command, not Python code). - Having multiple Python installations and installing to the wrong one — use python -m pip install pandas to ensure pip installs to the same Python you're using.


Chapter 2: Variables, Data Types, and Expressions

Exercise 2.1* — Maya Reyes is quoting a client project. She charges $150/hour, estimates 40 hours of work, and adds a 15% contingency. Calculate the total quote. Then update the hourly rate to $175 and recalculate without changing anything else.

Solution:

# Part 1: Original quote
hourly_rate = 150.00
estimated_hours = 40
contingency_rate = 0.15

base_cost = hourly_rate * estimated_hours
contingency = base_cost * contingency_rate
total_quote = base_cost + contingency

print(f"Hourly rate: ${hourly_rate:.2f}")
print(f"Base cost: ${base_cost:,.2f}")
print(f"Contingency ({contingency_rate:.0%}): ${contingency:,.2f}")
print(f"Total quote: ${total_quote:,.2f}")

# Part 2: Updated rate — only this line changes
hourly_rate = 175.00

base_cost = hourly_rate * estimated_hours
contingency = base_cost * contingency_rate
total_quote = base_cost + contingency

print(f"\nUpdated total quote at ${hourly_rate:.2f}/hr: ${total_quote:,.2f}")

Output:

Hourly rate: $150.00
Base cost: $6,000.00
Contingency (15%): $900.00
Total quote: $6,900.00

Updated total quote at $175.00/hr: $8,050.00

Explanation: The power of using named variables is demonstrated here. When the rate changes, only one line needs updating and all downstream calculations automatically reflect the change. :.2f formats to 2 decimal places; :,.2f adds comma separators; :.0% converts a decimal to a percentage with 0 decimal places.

Common Mistakes: - Using 0.15% in calculations instead of 0.15 — in Python, 0.15 is 15% as a decimal; there's no % operator for percentage conversion. - Hardcoding 6000 * 0.15 instead of using base_cost * contingency_rate — hardcoding prevents easy updates.


Exercise 2.2* — Demonstrate the difference between integer division (//), regular division (/), and modulo (%) by calculating how many full boxes of 12 items can be packed from an order of 500 items, and how many items are left over.

Solution:

order_quantity = 500
box_size = 12

full_boxes = order_quantity // box_size      # integer (floor) division
remainder = order_quantity % box_size         # modulo = remainder

print(f"Order quantity: {order_quantity} items")
print(f"Box size: {box_size} items")
print(f"Full boxes needed: {full_boxes}")
print(f"Leftover items: {remainder}")
print(f"Verification: {full_boxes} boxes × {box_size} + {remainder} = "
      f"{full_boxes * box_size + remainder}")

# Regular division for comparison
exact_division = order_quantity / box_size
print(f"\nRegular division result: {exact_division:.4f} (not useful here)")

Output:

Order quantity: 500 items
Box size: 12 items
Full boxes needed: 41
Leftover items: 8
Verification: 41 boxes × 12 + 8 = 500

Regular division result: 41.6667 (not useful here)

Explanation: // discards the decimal part (floor division), giving you 41 full boxes. % gives you what's left over after dividing evenly. These two operators always satisfy: (a // b) * b + (a % b) == a. In shipping/logistics contexts, you almost always want // and % rather than regular division.


Exercise 2.3* — Create a dictionary representing a product in Acme Corp's catalog. Include keys for product_id, name, category, unit_price, and units_in_stock. Then write code that prints a formatted product card and calculates the total inventory value.

Solution:

product = {
    'product_id': 'ACM-1042',
    'name': 'Precision Bearing Assembly',
    'category': 'Industrial Components',
    'unit_price': 47.99,
    'units_in_stock': 2340
}

# Formatted product card
print("=" * 40)
print("       ACME CORP — PRODUCT CARD")
print("=" * 40)
print(f"ID:         {product['product_id']}")
print(f"Name:       {product['name']}")
print(f"Category:   {product['category']}")
print(f"Unit Price: ${product['unit_price']:.2f}")
print(f"In Stock:   {product['units_in_stock']:,} units")

# Calculated field — not stored in dict, computed on demand
inventory_value = product['unit_price'] * product['units_in_stock']
print(f"Inv. Value: ${inventory_value:,.2f}")
print("=" * 40)

Output:

========================================
       ACME CORP — PRODUCT CARD
========================================
ID:         ACM-1042
Name:       Precision Bearing Assembly
Category:   Industrial Components
Unit Price: $47.99
In Stock:   2,340 units
Inv. Value: $112,296.60
========================================

Explanation: Dictionaries are the natural Python structure for representing records with named fields. Note that inventory_value is computed, not stored — this is good practice. If you stored it and later updated unit_price, the stored inventory_value would be stale. Compute derived values from source values on demand.


Chapter 3: Control Flow

Exercise 3.1* — Acme Corp uses a commission structure: 5% for sales under $10,000, 7% for sales $10,000–$49,999, 10% for sales $50,000 and above. Write a function using if/elif/else that computes the commission for any sales amount.

Solution:

def calculate_commission(sales_amount: float) -> float:
    """
    Calculate sales commission based on Acme Corp's tiered structure.

    Args:
        sales_amount: Total sales in dollars.

    Returns:
        Commission amount in dollars.
    """
    if sales_amount < 0:
        raise ValueError(f"Sales amount cannot be negative: {sales_amount}")
    elif sales_amount < 10_000:
        rate = 0.05
    elif sales_amount < 50_000:
        rate = 0.07
    else:
        rate = 0.10

    return sales_amount * rate


# Test with representative values
test_cases = [0, 5_000, 9_999.99, 10_000, 25_000, 49_999, 50_000, 120_000]

print(f"{'Sales Amount':>15}  {'Rate':>6}  {'Commission':>12}")
print("-" * 38)
for amount in test_cases:
    commission = calculate_commission(amount)
    rate = commission / amount if amount > 0 else 0
    print(f"${amount:>14,.2f}  {rate:>5.0%}  ${commission:>11,.2f}")

Output:

   Sales Amount    Rate    Commission
--------------------------------------
$          0.00     0%          $0.00
$      5,000.00     5%        $250.00
$      9,999.99     5%        $500.00
$     10,000.00     7%        $700.00
$     25,000.00     7%      $1,750.00
$     49,999.00     7%      $3,499.93
$     50,000.00    10%      $5,000.00
$    120,000.00    10%     $12,000.00

Explanation: Notice the boundary conditions: 9,999.99 is at 5%, 10,000 jumps to 7%. The negative check at the top is a "guard clause" — it validates input before doing real work. The underscore in 10_000 is a Python 3.6+ readability feature that has no effect on the value.

Common Mistakes: - Using sales_amount == 10_000 as a boundary test — this misses sales amounts like $10,000.01. Use >= to include the boundary value in the intended tier. - Forgetting that elif conditions only run if all previous conditions were False, so you don't need to write sales_amount >= 10_000 and sales_amount < 50_000 for the middle tier — sales_amount < 50_000 alone is sufficient given the preceding elif.


Exercise 3.2* — Using a for loop and the range() function, print a multiplication table for the numbers 1 through 8. Format it so columns align.

Solution:

n = 8

# Header row
print(" " * 4, end="")
for col in range(1, n + 1):
    print(f"{col:>5}", end="")
print()  # newline

# Separator
print("-" * (4 + 5 * n))

# Data rows
for row in range(1, n + 1):
    print(f"{row:>3} |", end="")
    for col in range(1, n + 1):
        print(f"{row * col:>5}", end="")
    print()  # newline at end of each row

Output:

        1    2    3    4    5    6    7    8
----------------------------------------
  1 |    1    2    3    4    5    6    7    8
  2 |    2    4    6    8   10   12   14   16
  3 |    3    6    9   12   15   18   21   24
  4 |    4    8   12   16   20   24   28   32
  5 |    5   10   15   20   25   30   35   40
  6 |    6   12   18   24   30   36   42   48
  7 |    7   14   21   28   35   42   49   56
  8 |    8   16   24   32   40   48   56   64

Explanation: print(f"{value:>5}", end="") — the >5 right-aligns the value in a field of width 5 characters, and end="" suppresses the default newline so values print on the same line. Nested for loops let you handle rows and columns independently.


Exercise 3.3* — Write a while loop that repeatedly asks Sandra for a sales figure until she enters a number greater than 0. Then print a formatted summary.

Solution:

while True:
    user_input = input("Enter a sales figure (must be > 0): ")

    try:
        sales = float(user_input)
    except ValueError:
        print(f"  Error: '{user_input}' is not a valid number. Please try again.")
        continue

    if sales <= 0:
        print(f"  Error: Sales must be greater than 0. Got {sales}. Please try again.")
        continue

    break  # valid input received, exit the loop

commission = sales * 0.07
print(f"\nSales recorded: ${sales:,.2f}")
print(f"Commission (7%): ${commission:,.2f}")

Explanation: The while True / break pattern is the idiomatic way to handle "keep asking until valid" loops. The try/except handles the case where the user types non-numeric text. The continue statements skip back to the while condition after printing an error. The break only executes when both validation checks pass.


Chapter 4: Functions

Exercise 4.1* — Refactor the commission calculator from Chapter 3 to also accept an optional cap parameter that limits the maximum commission paid. Include a docstring and test with several values.

Solution:

def calculate_commission(
    sales_amount: float,
    cap: float = None
) -> dict:
    """
    Calculate tiered sales commission with optional cap.

    Commission rates:
        - Under $10,000: 5%
        - $10,000 to $49,999: 7%
        - $50,000 and above: 10%

    Args:
        sales_amount: Total sales in dollars. Must be non-negative.
        cap: Maximum commission in dollars. If None, no cap is applied.

    Returns:
        Dictionary with 'rate', 'raw_commission', 'final_commission', and 'capped' keys.

    Raises:
        ValueError: If sales_amount is negative or cap is non-positive.
    """
    if sales_amount < 0:
        raise ValueError(f"sales_amount must be non-negative, got {sales_amount}")
    if cap is not None and cap <= 0:
        raise ValueError(f"cap must be positive, got {cap}")

    # Determine rate
    if sales_amount < 10_000:
        rate = 0.05
    elif sales_amount < 50_000:
        rate = 0.07
    else:
        rate = 0.10

    raw_commission = sales_amount * rate

    # Apply cap if provided
    if cap is not None and raw_commission > cap:
        final_commission = cap
        capped = True
    else:
        final_commission = raw_commission
        capped = False

    return {
        'rate': rate,
        'raw_commission': raw_commission,
        'final_commission': final_commission,
        'capped': capped
    }


# Tests
scenarios = [
    (8_000, None),
    (25_000, None),
    (100_000, None),
    (100_000, 8_000),   # cap kicks in
    (25_000, 2_000),    # cap kicks in
]

print(f"{'Sales':>12} {'Cap':>8} {'Rate':>6} {'Raw Comm':>12} {'Final Comm':>12} {'Capped'}")
print("-" * 62)
for sales, cap in scenarios:
    result = calculate_commission(sales, cap)
    cap_str = f"${cap:,.0f}" if cap else "None"
    print(
        f"${sales:>11,.0f} {cap_str:>8} {result['rate']:>5.0%} "
        f"${result['raw_commission']:>11,.2f} "
        f"${result['final_commission']:>11,.2f} "
        f"{'Yes' if result['capped'] else 'No'}"
    )

Explanation: Returning a dictionary (rather than just a single number) is a common pattern when a function computes several related values. The caller can choose which fields to use. The cap=None default parameter makes the cap optional without breaking existing code that calls the function with just a sales amount.

Common Mistakes: - Using if cap: instead of if cap is not None:if cap: would evaluate False for a cap of 0.0, even though 0 might be intentionally passed. Always use is None / is not None for optional arguments.


Exercise 4.2* — Write a function summarize_sales that accepts *args for any number of monthly sales figures and **kwargs for metadata (like salesperson name and region). Return a formatted summary string.

Solution:

def summarize_sales(*monthly_sales, **metadata):
    """
    Summarize an arbitrary number of monthly sales figures with optional metadata.

    Args:
        *monthly_sales: Variable number of numeric monthly sales amounts.
        **metadata: Optional keyword arguments for context (e.g., name, region).

    Returns:
        Formatted summary string.
    """
    if not monthly_sales:
        return "No sales data provided."

    total = sum(monthly_sales)
    average = total / len(monthly_sales)
    best_month = max(monthly_sales)
    worst_month = min(monthly_sales)

    # Build header from metadata
    header_parts = []
    if 'name' in metadata:
        header_parts.append(metadata['name'])
    if 'region' in metadata:
        header_parts.append(metadata['region'])
    header = " | ".join(header_parts) if header_parts else "Sales Summary"

    lines = [
        f"=== {header} ({len(monthly_sales)} months) ===",
        f"  Total:    ${total:>12,.2f}",
        f"  Average:  ${average:>12,.2f}/month",
        f"  Best:     ${best_month:>12,.2f}",
        f"  Worst:    ${worst_month:>12,.2f}",
    ]

    return "\n".join(lines)


# Test cases
print(summarize_sales(
    45_000, 52_000, 38_000, 61_000, 55_000, 49_000,
    name="Sandra Chen", region="West Coast"
))

print()

print(summarize_sales(
    22_000, 31_000, 28_000,
    name="Marcus Webb"
))

Output:

=== Sandra Chen | West Coast (6 months) ===
  Total:    $  300,000.00
  Average:  $   50,000.00/month
  Best:     $   61,000.00
  Worst:    $   38,000.00

=== Marcus Webb (3 months) ===
  Total:    $   81,000.00
  Average:  $   27,000.00/month
  Best:     $   31,000.00
  Worst:    $   22,000.00

Explanation: *monthly_sales collects all positional arguments into a tuple. **metadata collects all keyword arguments into a dictionary. This makes the function flexible for different amounts of data. The function uses .get() pattern via in metadata to gracefully handle missing metadata.


Chapter 5: Objects and Methods

Exercise 5.1* — Explore string methods by cleaning a list of product names that have inconsistent formatting. Some have extra spaces, inconsistent capitalization, and special characters that should be removed.

Solution:

raw_product_names = [
    "  WIDGET ALPHA  ",
    "sprocket BETA-plus",
    "  Gear   Assembly  (v2)  ",
    "bearing UNIT -- deluxe",
    "MODULE gamma",
    "   ",
    "COMPONENT-DELTA  extra  ",
]

def clean_product_name(name: str) -> str:
    """Standardize product name to Title Case with single spaces."""
    # Strip leading/trailing whitespace
    cleaned = name.strip()

    # Replace multiple internal spaces with single space
    import re
    cleaned = re.sub(r'\s+', ' ', cleaned)

    # Convert to title case
    cleaned = cleaned.title()

    return cleaned


cleaned_names = []
skipped = 0

for name in raw_product_names:
    if not name.strip():  # skip empty/whitespace-only strings
        skipped += 1
        continue
    cleaned = clean_product_name(name)
    cleaned_names.append(cleaned)

print("Cleaned product names:")
for original, cleaned in zip(raw_product_names,
                              [clean_product_name(n) for n in raw_product_names if n.strip()]):
    print(f"  {repr(original):40s} -> {repr(cleaned)}")

print(f"\nTotal: {len(cleaned_names)} valid, {skipped} skipped")

Output:

Cleaned product names:
  '  WIDGET ALPHA  '                       -> 'Widget Alpha'
  'sprocket BETA-plus'                     -> 'Sprocket Beta-Plus'
  '  Gear   Assembly  (v2)  '             -> 'Gear Assembly (V2)'
  'bearing UNIT -- deluxe'                -> 'Bearing Unit -- Deluxe'
  'MODULE gamma'                           -> 'Module Gamma'
  'COMPONENT-DELTA  extra  '              -> 'Component-Delta Extra'

Explanation: Method chaining (.strip().title()) applies transformations sequentially. The re.sub(r'\s+', ' ', text) call uses a regular expression to collapse multiple whitespace characters into one. The repr() function wraps strings in quotes so you can see the before/after spacing clearly.


Chapter 6: Files and Exceptions

Exercise 6.1* — Write a function that reads a CSV file of sales transactions, handles common errors gracefully (file not found, permission error, malformed rows), and returns a summary.

Solution:

import csv
import os
from pathlib import Path


def read_sales_csv(filepath: str) -> dict:
    """
    Read a sales CSV file and return a summary.

    Expected CSV columns: date, salesperson, amount, region

    Args:
        filepath: Path to the CSV file.

    Returns:
        Dictionary with 'records', 'total', 'errors', 'filename'.

    Raises:
        No exceptions — all errors are captured and returned.
    """
    path = Path(filepath)
    result = {
        'filename': path.name,
        'records': [],
        'total': 0.0,
        'error_count': 0,
        'errors': [],
        'success': False
    }

    # Check existence before opening
    if not path.exists():
        result['errors'].append(f"File not found: {filepath}")
        return result

    if not path.is_file():
        result['errors'].append(f"Path is not a file: {filepath}")
        return result

    try:
        with open(path, 'r', encoding='utf-8', newline='') as f:
            reader = csv.DictReader(f)

            # Validate required columns
            required_cols = {'date', 'salesperson', 'amount', 'region'}
            if not required_cols.issubset(set(reader.fieldnames or [])):
                missing = required_cols - set(reader.fieldnames or [])
                result['errors'].append(f"Missing required columns: {missing}")
                return result

            for line_num, row in enumerate(reader, start=2):  # 2 = first data row
                try:
                    amount = float(row['amount'])
                    if amount < 0:
                        raise ValueError(f"Negative amount: {amount}")

                    result['records'].append({
                        'date': row['date'],
                        'salesperson': row['salesperson'].strip(),
                        'amount': amount,
                        'region': row['region'].strip()
                    })
                    result['total'] += amount

                except ValueError as e:
                    result['error_count'] += 1
                    result['errors'].append(f"Row {line_num}: {e} — skipping row")

    except PermissionError:
        result['errors'].append(f"Permission denied reading: {filepath}")
        return result
    except UnicodeDecodeError:
        result['errors'].append(
            f"Encoding error — try opening with encoding='latin-1'"
        )
        return result

    result['success'] = True
    return result


# Demonstration with a test file
test_csv = "test_sales.csv"

# Create a test file
with open(test_csv, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['date', 'salesperson', 'amount', 'region'])
    writer.writerow(['2025-01-15', 'Sandra Chen', '52000', 'West'])
    writer.writerow(['2025-01-20', 'Marcus Webb', 'not_a_number', 'East'])
    writer.writerow(['2025-01-22', 'Priya Okonkwo', '38500', 'Central'])
    writer.writerow(['2025-02-01', 'Sandra Chen', '-100', 'West'])
    writer.writerow(['2025-02-10', 'Priya Okonkwo', '41200', 'Central'])

result = read_sales_csv(test_csv)
print(f"File: {result['filename']}")
print(f"Records loaded: {len(result['records'])}")
print(f"Total sales: ${result['total']:,.2f}")
print(f"Errors encountered: {result['error_count']}")
for err in result['errors']:
    print(f"  - {err}")

# Clean up test file
os.remove(test_csv)

Explanation: This function demonstrates the "never let the program crash silently" philosophy. All errors are captured and returned so the caller can decide how to handle them. The with open() context manager ensures the file is always closed. csv.DictReader makes rows accessible by column name rather than index, which is more readable and robust.


Chapter 7: The Python Environment

Exercise 7.1* — Write a script that checks whether the required packages for this textbook are installed, and prints a status report. If any are missing, print the pip install command to fix it.

Solution:

import importlib
import sys

REQUIRED_PACKAGES = {
    'pandas': 'pandas',
    'numpy': 'numpy',
    'matplotlib': 'matplotlib',
    'seaborn': 'seaborn',
    'sklearn': 'scikit-learn',    # import name differs from pip name
    'flask': 'flask',
    'requests': 'requests',
    'openpyxl': 'openpyxl',
    'sqlalchemy': 'sqlalchemy',
}

print(f"Python {sys.version}")
print(f"{'Package':<15} {'Status':<12} {'Version'}")
print("-" * 45)

missing_packages = []

for import_name, pip_name in REQUIRED_PACKAGES.items():
    try:
        module = importlib.import_module(import_name)
        version = getattr(module, '__version__', 'unknown')
        print(f"{pip_name:<15} {'OK':>12}   {version}")
    except ImportError:
        print(f"{pip_name:<15} {'MISSING':>12}   -")
        missing_packages.append(pip_name)

if missing_packages:
    print(f"\nTo install missing packages:")
    print(f"  pip install {' '.join(missing_packages)}")
else:
    print("\nAll required packages are installed!")

Explanation: importlib.import_module() lets you try to import a module by string name, which is necessary here since you're checking names dynamically. getattr(module, '__version__', 'unknown') safely retrieves the version, defaulting to 'unknown' if the attribute doesn't exist. Note that some packages have different import names vs. pip names (scikit-learn is imported as sklearn).


Chapter 8: Working with Data Structures

Exercise 8.1* — Given a list of sales dictionaries, use a list comprehension to extract all sales above $20,000 from the West region, and a dict comprehension to create a lookup mapping salesperson names to their total sales.

Solution:

sales_data = [
    {'rep': 'Sandra Chen',    'region': 'West',    'amount': 52_000},
    {'rep': 'Priya Okonkwo',  'region': 'Central', 'amount': 38_500},
    {'rep': 'Marcus Webb',    'region': 'East',    'amount': 15_200},
    {'rep': 'Sandra Chen',    'region': 'West',    'amount': 61_000},
    {'rep': 'Priya Okonkwo',  'region': 'Central', 'amount': 41_200},
    {'rep': 'Marcus Webb',    'region': 'East',    'amount': 22_800},
    {'rep': 'Sandra Chen',    'region': 'West',    'amount': 8_500},
    {'rep': 'Priya Okonkwo',  'region': 'Central', 'amount': 18_900},
]

# List comprehension: West region, above $20,000
high_west_sales = [
    sale for sale in sales_data
    if sale['region'] == 'West' and sale['amount'] > 20_000
]

print("High-value West region sales:")
for sale in high_west_sales:
    print(f"  {sale['rep']}: ${sale['amount']:,.0f}")

# Dict comprehension: total sales per rep
from collections import defaultdict

totals = defaultdict(float)
for sale in sales_data:
    totals[sale['rep']] += sale['amount']

# Convert to regular dict and display
rep_totals = dict(totals)
print("\nTotal sales by representative:")
for rep, total in sorted(rep_totals.items(), key=lambda x: x[1], reverse=True):
    print(f"  {rep}: ${total:,.0f}")

Output:

High-value West region sales:
  Sandra Chen: $52,000
  Sandra Chen: $61,000

Total sales by representative:
  Sandra Chen: $121,500
  Priya Okonkwo: $98,600
  Marcus Webb: $38,000

Explanation: The list comprehension combines two conditions with and. The defaultdict(float) automatically initializes missing keys to 0.0, avoiding the need for if rep in totals: totals[rep] += amount else: totals[rep] = amount logic. The sorted() call uses a lambda to sort by value (total), in descending order.


Part II: Data Analysis with Pandas (Chapters 9–18)


Chapter 9: NumPy Fundamentals

Exercise 9.1* — Create a NumPy array of 12 months of sales data. Calculate the mean, standard deviation, and months where sales exceeded the mean. Identify whether Q4 (months 10–12) outperformed the annual average.

Solution:

import numpy as np

monthly_sales = np.array([
    42_500, 38_200, 51_300, 47_800, 55_100, 49_600,
    43_900, 52_700, 58_400, 61_200, 67_800, 72_500
])

mean_sales = monthly_sales.mean()
std_sales = monthly_sales.std()
months_above_mean = np.where(monthly_sales > mean_sales)[0] + 1  # +1 for 1-indexed months

q4_sales = monthly_sales[9:]  # months 10, 11, 12 (0-indexed: 9, 10, 11)
q4_average = q4_sales.mean()

print(f"Annual Summary:")
print(f"  Total:         ${monthly_sales.sum():>12,.0f}")
print(f"  Mean/month:    ${mean_sales:>12,.2f}")
print(f"  Std deviation: ${std_sales:>12,.2f}")
print(f"  Months above mean: {list(months_above_mean)}")
print(f"\nQ4 Performance:")
print(f"  Q4 total:   ${q4_sales.sum():>12,.0f}")
print(f"  Q4 average: ${q4_average:>12,.2f}")
print(f"  Vs. annual: {(q4_average - mean_sales) / mean_sales:>+11.1%}")
print(f"  Q4 beat annual mean: {q4_average > mean_sales}")

Output:

Annual Summary:
  Total:         $     641,000
  Mean/month:    $   53,416.67
  Std deviation: $   10,001.93
  Months above mean: [5, 6, 8, 9, 10, 11, 12]

Q4 Performance:
  Q4 total:   $     201,500
  Q4 average: $   67,166.67
  Vs. annual:         +25.7%
  Q4 beat annual mean: True

Explanation: NumPy operations work on entire arrays at once without explicit loops. np.where(condition)[0] returns indices where the condition is True. Array slicing with [9:] extracts the last three elements. The +25.7% format uses + to always show the sign, which is useful for variance reporting.


Chapter 10: Introduction to Pandas

Exercise 10.1* — Create a DataFrame from a dictionary of sales data. Practice accessing a column, a row by label, a row by position, and a subset using boolean filtering. Print the shape and dtypes.

Solution:

import pandas as pd

data = {
    'salesperson': ['Sandra Chen', 'Marcus Webb', 'Priya Okonkwo', 'David Kim', 'Aisha Patel'],
    'region':      ['West', 'East', 'Central', 'West', 'East'],
    'q1_sales':    [152_000, 89_000, 134_000, 98_000, 115_000],
    'q2_sales':    [165_000, 94_000, 141_000, 87_500, 128_000],
    'quota':       [150_000, 100_000, 130_000, 110_000, 120_000],
}

df = pd.DataFrame(data)
df = df.set_index('salesperson')  # use name as the row label

print("Shape:", df.shape)
print("\nDtypes:")
print(df.dtypes)

# Access a single column
print("\nQ1 Sales (all reps):")
print(df['q1_sales'])

# Access a single row by label
print("\nSandra Chen's data:")
print(df.loc['Sandra Chen'])

# Access a single row by position
print("\nFirst row (iloc[0]):")
print(df.iloc[0])

# Boolean filter: reps who beat their quota in Q1
beat_quota = df[df['q1_sales'] >= df['quota']]
print("\nReps who met/beat quota in Q1:")
print(beat_quota[['q1_sales', 'quota']])

Explanation: Setting salesperson as the index makes .loc['Sandra Chen'] work intuitively. .iloc[0] accesses by position regardless of the index. Boolean filtering compares entire columns element-wise and returns a boolean Series that serves as a mask.


Chapter 11: Selecting and Filtering Data

Exercise 11.1* — Using Acme Corp's product catalog DataFrame, demonstrate at least five different selection operations. Include .loc[], .iloc[], boolean filters with multiple conditions, and selecting multiple columns.

Solution:

import pandas as pd

df = pd.DataFrame({
    'product_id':  ['ACM-001', 'ACM-002', 'ACM-003', 'ACM-004', 'ACM-005', 'ACM-006'],
    'name':        ['Widget A', 'Sprocket B', 'Gear C', 'Widget D', 'Sprocket E', 'Bearing F'],
    'category':    ['Widget', 'Sprocket', 'Gear', 'Widget', 'Sprocket', 'Bearing'],
    'price':       [24.99, 12.50, 89.99, 34.99, 18.75, 156.00],
    'stock':       [1200, 450, 78, 3400, 890, 22],
    'active':      [True, True, True, False, True, True],
})

# 1. Single cell by label
print("1. Price of ACM-003:")
print(df.loc[df['product_id'] == 'ACM-003', 'price'].values[0])

# 2. Multiple columns by name
print("\n2. Name and price columns:")
print(df[['name', 'price']].head())

# 3. Rows and columns with iloc
print("\n3. First 3 rows, first 3 columns (iloc):")
print(df.iloc[:3, :3])

# 4. Boolean: active products priced over $20
high_value_active = df[(df['active'] == True) & (df['price'] > 20)]
print("\n4. Active products > $20:")
print(high_value_active[['name', 'price', 'stock']])

# 5. isin() for category filter
target_categories = ['Widget', 'Bearing']
filtered = df[df['category'].isin(target_categories)]
print("\n5. Widgets and Bearings only:")
print(filtered[['name', 'category', 'price']])

Common Mistake: Using and and or instead of & and | for combining boolean conditions in pandas — this raises a ValueError because and/or operate on the truthiness of entire Series, which is ambiguous. Always use &, |, and ~ for element-wise boolean operations, and wrap each condition in parentheses.


Chapter 12: Cleaning Data

Exercise 12.1* — Given a "dirty" employee DataFrame, perform a complete cleaning process: identify and handle missing values, fix data types, remove duplicates, and standardize string formatting.

Solution:

import pandas as pd
import numpy as np

# Dirty data as it might arrive from a real system
raw_data = {
    'emp_id':   ['E001', 'E002', 'E003', 'E003', 'E004', 'E005', 'E006'],
    'name':     ['Sandra Chen', ' marcus webb ', 'PRIYA OKONKWO', 'Priya Okonkwo',
                 'David Kim', None, 'Aisha Patel'],
    'dept':     ['Sales', 'IT', 'Analytics', 'Analytics', 'Sales', 'IT', 'analytics'],
    'salary':   ['95000', '78000', '82000', '82000', 'N/A', '71000', '88000'],
    'start_date': ['2019-03-15', '2020-07-01', '2021-01-10', '2021-01-10',
                   '2018-11-20', '2022-05-15', 'invalid_date'],
}

df = pd.DataFrame(raw_data)
print("=== ORIGINAL DATA ===")
print(df)
print(f"\nShape: {df.shape}")

# Step 1: Remove duplicates
df = df.drop_duplicates(subset='emp_id', keep='first')
print(f"\nAfter removing duplicates: {df.shape[0]} rows")

# Step 2: Fix name formatting (strip whitespace, title case)
df['name'] = df['name'].str.strip().str.title()

# Step 3: Standardize department (title case)
df['dept'] = df['dept'].str.strip().str.title()

# Step 4: Handle missing names
print(f"\nRows with missing name: {df['name'].isna().sum()}")
df['name'] = df['name'].fillna('Unknown')

# Step 5: Convert salary to numeric (handling 'N/A')
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
print(f"Salary NaN count (from 'N/A'): {df['salary'].isna().sum()}")
# Fill missing salary with department median
dept_median_salary = df.groupby('dept')['salary'].transform('median')
df['salary'] = df['salary'].fillna(dept_median_salary)

# Step 6: Parse dates (handling invalid dates)
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')
print(f"Invalid dates converted to NaT: {df['start_date'].isna().sum()}")

print("\n=== CLEANED DATA ===")
print(df.to_string())
print(f"\nFinal dtypes:\n{df.dtypes}")

Key Insight: pd.to_numeric(..., errors='coerce') and pd.to_datetime(..., errors='coerce') are the standard approaches to converting columns that contain a mix of valid and invalid values. Both convert invalid values to NaN/NaT instead of raising an error, giving you control over how to handle them next.


Part III: Business Automation (Chapters 19–25)

Solutions for Chapters 13–18 follow the same pattern. Key solutions below focus on the most instructive starred exercises.


Chapter 19: Automating File Operations

Exercise 19.1* — Write a script that monitors a folder for new CSV files, reads each one, appends a "processed" timestamp column, and moves the file to a processed subfolder.

Solution:

import os
import shutil
from pathlib import Path
from datetime import datetime
import pandas as pd


def process_incoming_csv(input_folder: str, output_folder: str) -> int:
    """
    Process all CSV files in input_folder and move them to output_folder.

    Returns:
        Number of files processed.
    """
    input_path = Path(input_folder)
    output_path = Path(output_folder)
    output_path.mkdir(parents=True, exist_ok=True)

    processed_count = 0

    for csv_file in input_path.glob('*.csv'):
        print(f"Processing: {csv_file.name}")

        try:
            df = pd.read_csv(csv_file)
            df['processed_at'] = datetime.now().isoformat()

            # Save enriched version to output folder
            output_file = output_path / csv_file.name
            df.to_csv(output_file, index=False)

            # Archive original (move to processed/archive)
            archive_path = output_path / 'archive'
            archive_path.mkdir(exist_ok=True)
            shutil.move(str(csv_file), str(archive_path / csv_file.name))

            print(f"  {len(df)} rows processed -> {output_file.name}")
            processed_count += 1

        except Exception as e:
            print(f"  ERROR: {e}")

    return processed_count


if __name__ == '__main__':
    count = process_incoming_csv('incoming', 'processed')
    print(f"\nComplete: {count} files processed")

Explanation: Path.glob('*.csv') finds all CSV files in the folder. mkdir(parents=True, exist_ok=True) creates the folder and any intermediate directories, and doesn't fail if the folder already exists. shutil.move() moves files across directories (unlike os.rename(), which fails across file system boundaries). The if __name__ == '__main__': guard is best practice for scripts.


Chapter 20: Automated Reports

Exercise 20.1* — Generate a monthly sales report as an Excel file with multiple sheets: a summary sheet, a by-region sheet, and a raw data sheet. Apply basic formatting.

Solution:

import pandas as pd
from datetime import datetime


def generate_monthly_report(df: pd.DataFrame, output_path: str, month: str) -> None:
    """
    Generate a formatted Excel report with multiple sheets.

    Args:
        df: Sales DataFrame with columns: date, region, salesperson, amount
        output_path: Path for output .xlsx file
        month: Month label string (e.g., 'January 2025')
    """
    # Summary stats
    summary_data = {
        'Metric': [
            'Total Sales', 'Average Sale', 'Median Sale',
            'Highest Single Sale', 'Number of Transactions',
            'Number of Sales Reps'
        ],
        'Value': [
            df['amount'].sum(),
            df['amount'].mean(),
            df['amount'].median(),
            df['amount'].max(),
            len(df),
            df['salesperson'].nunique()
        ]
    }
    summary_df = pd.DataFrame(summary_data)

    # By region
    region_df = (
        df.groupby('region')
        .agg(
            total_sales=('amount', 'sum'),
            transaction_count=('amount', 'count'),
            average_sale=('amount', 'mean')
        )
        .sort_values('total_sales', ascending=False)
        .reset_index()
    )

    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        summary_df.to_excel(writer, sheet_name='Summary', index=False)
        region_df.to_excel(writer, sheet_name='By Region', index=False)
        df.to_excel(writer, sheet_name='Raw Data', index=False)

        # Apply column widths to Summary sheet
        ws = writer.sheets['Summary']
        ws.column_dimensions['A'].width = 30
        ws.column_dimensions['B'].width = 20

    print(f"Report saved: {output_path}")


# Create sample data and demonstrate
import numpy as np
rng = np.random.default_rng(42)

dates = pd.date_range('2025-01-01', '2025-01-31', freq='D')
n = 50

sample_data = pd.DataFrame({
    'date': rng.choice(dates, n),
    'region': rng.choice(['West', 'East', 'Central', 'South'], n),
    'salesperson': rng.choice(['Sandra Chen', 'Marcus Webb', 'Priya Okonkwo'], n),
    'amount': rng.uniform(5_000, 75_000, n).round(2)
})

generate_monthly_report(sample_data, 'acme_jan2025_report.xlsx', 'January 2025')

Chapter 25: Regular Expressions for Business Data

Exercise 25.1* — Write regular expression patterns to validate and extract: US phone numbers (various formats), email addresses, US ZIP codes, and currency values from a text string.

Solution:

import re

# Test data
test_strings = [
    "Call Sandra at (555) 867-5309 or email sandra.chen@acmecorp.com",
    "Office: 800-555-0199 | Zip: 90210-3456 | Revenue was $1,234,567.89",
    "Contact: +1.555.234.5678 for pricing. Budget: $50,000",
    "Not a phone: 123-45 | Not an email: user@ | Not a zip: 9021",
]

# Patterns
phone_pattern = re.compile(
    r'(?:\+1[.\-\s]?)?'          # optional country code
    r'(?:\(?\d{3}\)?)'            # area code, optional parens
    r'[\s.\-]?'                   # separator
    r'\d{3}'                      # exchange
    r'[\s.\-]?'                   # separator
    r'\d{4}'                      # subscriber
)

email_pattern = re.compile(
    r'\b[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}\b'
)

zip_pattern = re.compile(
    r'\b\d{5}(?:-\d{4})?\b'
)

currency_pattern = re.compile(
    r'\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?'
)

print("=== Extracted Matches ===\n")
for text in test_strings:
    print(f"Text: {text[:60]}...")
    print(f"  Phones:    {phone_pattern.findall(text)}")
    print(f"  Emails:    {email_pattern.findall(text)}")
    print(f"  ZIP codes: {zip_pattern.findall(text)}")
    print(f"  Currency:  {currency_pattern.findall(text)}")
    print()

Explanation: Each pattern uses raw strings (r'...') to avoid needing double backslashes. (?:...) is a non-capturing group — it groups without creating a capture group. \b is a word boundary anchor that prevents matching partial words. findall() returns all non-overlapping matches as a list.


Part IV: Business Analytics (Chapters 26–33)


Chapter 28: Database Operations

Exercise 28.1* — Create a SQLite database for Acme Corp's sales system, insert data, and write queries using Python. Demonstrate parameterized queries to prevent SQL injection.

Solution:

import sqlite3
import pandas as pd
from contextlib import contextmanager


@contextmanager
def get_db_connection(db_path: str):
    """Context manager for SQLite database connections."""
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row  # enables column name access
    try:
        yield conn
        conn.commit()
    except Exception:
        conn.rollback()
        raise
    finally:
        conn.close()


# Create and populate database
DB_PATH = 'acme_sales.db'

with get_db_connection(DB_PATH) as conn:
    cursor = conn.cursor()

    # Create table
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS sales (
            sale_id    INTEGER PRIMARY KEY AUTOINCREMENT,
            rep_name   TEXT NOT NULL,
            region     TEXT NOT NULL,
            amount     REAL NOT NULL,
            sale_date  TEXT NOT NULL,
            product_id TEXT
        )
    """)

    # Insert with PARAMETERIZED QUERIES (prevents SQL injection)
    sales_to_insert = [
        ('Sandra Chen',    'West',    52_000.00, '2025-01-15', 'ACM-001'),
        ('Priya Okonkwo',  'Central', 38_500.00, '2025-01-20', 'ACM-003'),
        ('Marcus Webb',    'East',    22_800.00, '2025-01-22', 'ACM-002'),
        ('Sandra Chen',    'West',    61_000.00, '2025-02-01', 'ACM-001'),
        ('Priya Okonkwo',  'Central', 41_200.00, '2025-02-10', 'ACM-003'),
    ]

    # SAFE: parameters are passed separately — never concatenate user input!
    cursor.executemany(
        "INSERT INTO sales (rep_name, region, amount, sale_date, product_id) "
        "VALUES (?, ?, ?, ?, ?)",
        sales_to_insert
    )
    print(f"Inserted {cursor.rowcount} rows")

# Query with parameters
with get_db_connection(DB_PATH) as conn:
    # Aggregation query
    df = pd.read_sql_query("""
        SELECT rep_name,
               region,
               COUNT(*) as transactions,
               SUM(amount) as total_sales,
               AVG(amount) as avg_sale
        FROM sales
        GROUP BY rep_name, region
        ORDER BY total_sales DESC
    """, conn)

    print("\nSales Summary by Rep:")
    print(df.to_string())

    # Parameterized filter
    min_amount = 40_000
    high_sales = pd.read_sql_query(
        "SELECT * FROM sales WHERE amount >= ?",
        conn,
        params=(min_amount,)
    )
    print(f"\nSales >= ${min_amount:,}:")
    print(high_sales.to_string())

# Cleanup
import os; os.remove(DB_PATH)

Key Security Insight: Never build SQL queries by concatenating strings with user input:

# DANGEROUS — SQL injection vulnerability:
cursor.execute(f"SELECT * FROM sales WHERE rep_name = '{user_input}'")

# SAFE — parameterized query:
cursor.execute("SELECT * FROM sales WHERE rep_name = ?", (user_input,))

A malicious user could enter '; DROP TABLE sales; -- as their name, which in the dangerous version would execute as SQL and destroy your data. Parameterized queries treat the input as data, not code.


Part V: Machine Learning and Deployment (Chapters 34–40)


Chapter 35: Your First ML Model

Exercise 35.1* — Build a complete classification pipeline: load data, split into train/test, fit a logistic regression model, evaluate with accuracy, precision, recall, and F1, and interpret the confusion matrix.

Solution:

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report
)
import warnings
warnings.filterwarnings('ignore')


# Generate synthetic customer churn data
rng = np.random.default_rng(42)
n = 500

data = pd.DataFrame({
    'tenure_months':     rng.integers(1, 72, n),
    'monthly_spend':     rng.uniform(50, 300, n).round(2),
    'support_tickets':   rng.integers(0, 10, n),
    'login_frequency':   rng.uniform(0, 30, n).round(1),
})

# Target: churn (1) if high tickets, low tenure, high spend pattern
churn_prob = (
    (data['support_tickets'] > 5).astype(float) * 0.4 +
    (data['tenure_months'] < 12).astype(float) * 0.3 +
    (data['monthly_spend'] > 200).astype(float) * 0.2 +
    rng.uniform(0, 0.1, n)
)
data['churned'] = (churn_prob > 0.4).astype(int)

print(f"Dataset: {len(data)} customers")
print(f"Churn rate: {data['churned'].mean():.1%}")

# Feature matrix and target vector
X = data.drop('churned', axis=1)
y = data['churned']

# Split: 80% train, 20% test (stratified to preserve churn ratio)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features (required for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit AND transform on training data
X_test_scaled = scaler.transform(X_test)         # ONLY transform on test data (no fit!)

# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)

# Evaluate
print("\n=== Model Evaluation ===")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"  True Negatives  (correctly predicted stay):  {cm[0][0]}")
print(f"  False Positives (predicted churn, stayed):   {cm[0][1]}")
print(f"  False Negatives (predicted stay, churned):   {cm[1][0]}")
print(f"  True Positives  (correctly predicted churn): {cm[1][1]}")

print("\nFull Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Stayed', 'Churned']))

# Feature importance (coefficients)
print("Feature Coefficients (impact on churn probability):")
for feature, coef in sorted(
    zip(X.columns, model.coef_[0]),
    key=lambda x: abs(x[1]), reverse=True
):
    direction = "increases" if coef > 0 else "decreases"
    print(f"  {feature:<20}: {coef:>+.3f} ({direction} churn risk)")

Key Learning: The most important line pair is:

X_train_scaled = scaler.fit_transform(X_train)  # learn mean/std from training data
X_test_scaled = scaler.transform(X_test)         # apply same scaling, don't learn new

Fitting the scaler on test data constitutes "data leakage" — the model would be evaluated with knowledge it wouldn't have in production.


Chapter 36: Model Evaluation

Exercise 36.1* — Demonstrate k-fold cross-validation on the churn model from Chapter 35. Compare single train/test split results with 5-fold and 10-fold CV results. Discuss what the results tell you.

Solution:

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
import numpy as np

# Using data and model from Chapter 35 exercise
# Build a pipeline (ensures scaler is applied correctly in each fold)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(random_state=42))
])

# 5-fold CV
skf5 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_5fold = cross_val_score(pipeline, X, y, cv=skf5, scoring='f1')

# 10-fold CV
skf10 = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores_10fold = cross_val_score(pipeline, X, y, cv=skf10, scoring='f1')

print("Cross-Validation Results (F1 Score):")
print(f"\n5-Fold CV:")
for i, score in enumerate(scores_5fold, 1):
    print(f"  Fold {i}: {score:.3f}")
print(f"  Mean: {scores_5fold.mean():.3f} ± {scores_5fold.std():.3f}")

print(f"\n10-Fold CV:")
for i, score in enumerate(scores_10fold, 1):
    print(f"  Fold {i}: {score:.3f}")
print(f"  Mean: {scores_10fold.mean():.3f} ± {scores_10fold.std():.3f}")

print(f"\nSingle Train/Test Split F1: {f1_score(y_test, y_pred):.3f}")

Key Insight: The standard deviation from cross-validation tells you how stable your model's performance is. A mean of 0.75 ± 0.02 is much more trustworthy than a single test score of 0.78, which might be due to a "lucky" split. Using Pipeline in cross-validation is critical — it ensures the scaler is refit on each training fold separately, preventing leakage.


Chapter 40: Deployment

Exercise 40.1* — Deploy the trained churn model as a Flask REST API that accepts customer features as JSON and returns a churn prediction and probability.

Solution:

# save_model.py — run this first to create the model files
import pickle
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Train pipeline (using same synthetic data as Chapter 35)
rng = np.random.default_rng(42)
n = 500
data = pd.DataFrame({
    'tenure_months':   rng.integers(1, 72, n),
    'monthly_spend':   rng.uniform(50, 300, n).round(2),
    'support_tickets': rng.integers(0, 10, n),
    'login_frequency': rng.uniform(0, 30, n).round(1),
})
churn_prob = (
    (data['support_tickets'] > 5) * 0.4 +
    (data['tenure_months'] < 12) * 0.3
).astype(float) + rng.uniform(0, 0.1, n)
data['churned'] = (churn_prob > 0.4).astype(int)

X = data.drop('churned', axis=1)
y = data['churned']

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(random_state=42))
])
pipeline.fit(X, y)

with open('churn_model.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

print("Model saved to churn_model.pkl")


# app.py — the Flask API
from flask import Flask, request, jsonify
import pickle
import pandas as pd
import os

app = Flask(__name__)

# Load model at startup (once)
MODEL_PATH = 'churn_model.pkl'
with open(MODEL_PATH, 'rb') as f:
    model = pickle.load(f)

REQUIRED_FEATURES = ['tenure_months', 'monthly_spend', 'support_tickets', 'login_frequency']


@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint."""
    return jsonify({'status': 'ok', 'model': MODEL_PATH})


@app.route('/predict', methods=['POST'])
def predict():
    """
    Predict customer churn.

    Expected JSON body:
    {
        "tenure_months": 6,
        "monthly_spend": 250.00,
        "support_tickets": 7,
        "login_frequency": 3.5
    }
    """
    data = request.get_json()

    if data is None:
        return jsonify({'error': 'Request body must be JSON'}), 400

    # Validate required features
    missing = [f for f in REQUIRED_FEATURES if f not in data]
    if missing:
        return jsonify({'error': f'Missing features: {missing}'}), 400

    # Convert to DataFrame (model expects this shape)
    try:
        features = pd.DataFrame([{f: float(data[f]) for f in REQUIRED_FEATURES}])
    except (ValueError, TypeError) as e:
        return jsonify({'error': f'Invalid feature values: {e}'}), 400

    # Predict
    prediction = model.predict(features)[0]
    probability = model.predict_proba(features)[0].tolist()

    return jsonify({
        'prediction': int(prediction),
        'label': 'CHURN' if prediction == 1 else 'STAY',
        'probability_stay': round(probability[0], 4),
        'probability_churn': round(probability[1], 4),
        'risk_level': 'HIGH' if probability[1] > 0.7 else
                      'MEDIUM' if probability[1] > 0.4 else 'LOW'
    })


if __name__ == '__main__':
    app.run(debug=True, port=5000)


# Test the API (in a separate terminal after starting the app):
# curl -X POST http://localhost:5000/predict \
#   -H "Content-Type: application/json" \
#   -d '{"tenure_months": 6, "monthly_spend": 280, "support_tickets": 8, "login_frequency": 2}'

Explanation: The critical architecture decision is loading the model once at startup, not on every request — loading a model file is slow and should not happen per-request in production. Input validation before calling the model prevents unhelpful 500 errors. Returning a human-readable risk_level alongside raw probabilities makes the API useful to non-technical consumers.


For complete solutions to all exercises in every chapter, see the companion solution files available at the book's online repository. Remember: the exercises you struggle with the most are the ones that teach you the most. The goal is understanding, not completion.