23 min read

> "Beware of bugs in the above code; I have only proved it correct, not tested it."

Learning Objectives

  • Write unit tests using pytest conventions and run them from the command line
  • Apply test-driven development (TDD) to design functions by writing tests first
  • Identify edge cases and boundary conditions for any given function
  • Use pytest fixtures to share setup code across tests
  • Apply systematic debugging strategies including binary search debugging
  • Evaluate code coverage reports and explain why 100% coverage is not the goal
  • Write testable code by separating logic from I/O

Chapter 13: Testing and Debugging: Writing Code You Can Trust

"Beware of bugs in the above code; I have only proved it correct, not tested it." — Donald Knuth

Chapter Overview

Here's a question that separates hobbyist programmers from professionals: how do you know your code works?

If your answer is "I ran it and it seemed fine," you're in good company — that's how most beginners operate. But "seemed fine" doesn't scale. Elena Vasquez learned this the hard way when her nonprofit's quarterly report script — the one she'd been running successfully for months — silently dropped every donation record where the donor's name contained an apostrophe. O'Brien, D'Angelo, O'Malley — all gone. Nobody noticed until the executive director presented the numbers to the board and a donor in the room said, "Wait, I gave $5,000 this quarter. Where am I?"

The bug had been there since day one. Elena's manual spot-checks never caught it because she'd always tested with names like "Smith" and "Johnson." What she needed wasn't more manual testing. She needed automated tests — code that checks other code, systematically, every single time.

This chapter teaches you to write those tests. You'll learn pytest (the industry-standard testing framework for Python), practice test-driven development (a technique that changes how you think about designing code), and build systematic debugging skills for when things go wrong despite your best efforts.

In this chapter, you will learn to: - Write and run unit tests using pytest - Apply the Red-Green-Refactor cycle of test-driven development - Identify edge cases and boundary conditions before they bite you - Use fixtures to keep test code clean and maintainable - Debug systematically using binary search and other strategies - Write code that's inherently easier to test

🏃 Fast Track: If you're already familiar with the concept of testing and just want the pytest mechanics, skip to section 13.3. If you've used pytest before and want TDD, jump to section 13.5.

🔬 Deep Dive: After this chapter, read Appendix D's FAQ section on "Setting up pytest in VS Code" for IDE integration details.


13.1 Why Testing Matters

Let's talk about money. In 2012, Knight Capital Group deployed a software update to their automated stock trading system. A bug in the deployment caused the system to execute unintended trades at an extraordinary rate. In 45 minutes, the firm lost $440 million. The company never recovered.

You might be thinking: "I'm writing a grade calculator, not a trading system." Fair point. But the principle scales down. A bug in a student's CSV processing script can corrupt a semester of data. A bug in Elena's report can misrepresent a nonprofit's finances. A bug in Dr. Patel's DNA analysis pipeline can invalidate months of research.

Testing isn't about paranoia. It's about insurance. You pay a small premium (writing tests) to protect against a large loss (broken code in production).

The Cost of Late Bug Discovery

Here's an industry pattern that's been observed consistently across decades of software projects: the later you find a bug, the more expensive it is to fix.

When the Bug Is Found Relative Cost to Fix
While writing the code (tests catch it) 1x
During code review 5x
During integration testing 10x
After deployment to users 50-100x

The "50-100x" factor isn't just about the code fix itself. It includes debugging time (finding the bug in a large system), data recovery, user communication, reputation damage, and the stress of patching production systems at 2 AM.

What "Testing" Actually Means

When we say "testing" in software, we mean writing code that automatically checks whether other code behaves correctly. Not clicking around. Not "I'll try a few inputs and see." We mean executable, repeatable verification that you can run with a single command, every time you change anything.

💡 Intuition: Think of tests like a safety net for a trapeze artist. The artist is skilled and probably won't fall — but the net is there anyway, because "probably" isn't good enough. Every time you modify your code, you're doing a trapeze act. Your tests are the net.

🔗 Connection (Ch 11): In Chapter 11, you learned to write code that handles errors gracefully with try/except. Testing goes one step further — it verifies that your error handling actually works by deliberately triggering those error conditions and checking the results.


13.2 Types of Testing

Software testing comes in several flavors. In this course, we'll focus almost entirely on one type, but you should know the landscape.

Unit Testing (Our Focus)

A unit test checks a single, small piece of code in isolation — typically one function or one method. You give it specific inputs and verify it produces the expected outputs.

def add(a, b):
    return a + b

# A unit test for add():
# Given inputs 2 and 3, the expected output is 5.
# If add(2, 3) returns anything other than 5, the test fails.

Unit tests are fast (milliseconds each), focused (one thing at a time), and independent (each test stands alone).

Integration Testing

Integration tests check that multiple components work together correctly. For example: does the storage.py module correctly save data that models.py produces? Integration tests are slower and more complex than unit tests, but they catch bugs that unit tests miss — the kind where each piece works fine alone but they break when combined.

End-to-End (E2E) Testing

End-to-end tests simulate a real user interacting with the entire system. For a web app, that might mean automating a browser to click buttons and fill forms. For a CLI app like TaskFlow, it might mean feeding input to the program and checking the full output.

The Testing Pyramid

Most teams aim for something like this distribution:

        /  E2E  \          Few — slow, expensive, broad
       /----------\
      / Integration \      Some — moderate speed, moderate scope
     /----------------\
    /   Unit Tests      \  Many — fast, cheap, focused
    ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾

Many unit tests at the base (fast, cheap, run constantly), fewer integration tests in the middle, and a small number of E2E tests at the top. This chapter focuses on the base of the pyramid.


13.3 Your First Unit Test with pytest

Python ships with a built-in testing framework called unittest. It works, but it's verbose. The Python community overwhelmingly prefers pytest — a third-party framework that's simpler, more powerful, and more readable. We'll use pytest throughout this chapter.

Installing pytest

Open your terminal and run:

pip install pytest

Verify the installation:

pytest --version

You should see something like pytest 8.x.x.

⚠️ Pitfall: If you're using a virtual environment (from Chapter 12), make sure it's activated before you install pytest. Otherwise, you'll install it globally and wonder why your project can't find it.

Writing Your First Test

Here's a simple function and a test for it. Create a file called math_tools.py:

# math_tools.py
def calculate_average(numbers):
    """Return the arithmetic mean of a list of numbers."""
    return sum(numbers) / len(numbers)

Now create a file called test_math_tools.py in the same directory:

# test_math_tools.py
from math_tools import calculate_average

def test_average_of_three_numbers():
    result = calculate_average([10, 20, 30])
    assert result == 20.0

def test_average_of_single_number():
    result = calculate_average([42])
    assert result == 42.0

def test_average_with_decimals():
    result = calculate_average([1.5, 2.5, 3.0])
    assert result == 2.3333333333333335

Running Your Tests

From the terminal, in the directory containing your test file:

pytest test_math_tools.py

Expected output:

========================= test session starts ==========================
collected 3 items

test_math_tools.py ...                                           [100%]

========================== 3 passed in 0.01s ===========================

Each dot represents a passing test. Three dots, three passes. That's it — you've written and run your first automated test suite.

pytest Conventions

pytest uses simple conventions to discover and run tests:

Convention Rule
Test file names Must start with test_ or end with _test.py
Test function names Must start with test_
Assertions Use plain assert statements
Test discovery Run pytest with no arguments to find all tests automatically

No classes required. No special setup. Just functions that start with test_ and use assert.

unittest vs. pytest: A Comparison

Since you'll encounter both in the wild, here's how they compare:

Feature unittest (built-in) pytest (third-party)
Import needed import unittest import pytest (often not even needed)
Test structure Classes inheriting unittest.TestCase Plain functions
Assertions self.assertEqual(a, b), self.assertTrue(x) assert a == b, assert x
Setup/Teardown setUp() / tearDown() methods Fixtures (more flexible)
Running tests python -m unittest pytest
Output on failure Basic Detailed diff showing exactly what went wrong
Verbosity More boilerplate Minimal boilerplate
Popularity Standard library, always available De facto industry standard, must install

Both are valid. This book uses pytest because it lets you focus on the testing rather than the framework boilerplate.

🔄 Check Your Understanding (try to answer without scrolling up)

  1. What naming convention must test files and test functions follow for pytest to discover them?
  2. What Python keyword does pytest use for assertions, instead of methods like assertEqual()?
  3. What command runs all discovered tests in the current directory?

Verify

  1. Test files must start with test_ (or end with _test.py). Test functions must start with test_.
  2. The plain assert keyword — e.g., assert result == expected.
  3. Simply pytest with no arguments. It recursively discovers and runs all test files.

13.4 Assertions and Test Design

The assert statement is the backbone of every test. It says: "This condition must be true. If it isn't, something is wrong."

Basic Assertions

# Equality
assert calculate_average([10, 20]) == 15.0

# Truthiness
assert is_valid_email("user@example.com")

# Containment
assert "error" in log_message.lower()

# Type checking
assert isinstance(result, dict)

# Comparison
assert len(tasks) >= 1

When an assertion fails, pytest shows exactly what happened:

    def test_average_of_two():
        result = calculate_average([10, 20])
>       assert result == 16.0
E       assert 15.0 == 16.0

test_math_tools.py:5: AssertionError

That's one of pytest's superpowers — the failure message shows both the actual value (15.0) and the expected value (16.0), so you can immediately see what went wrong.

Testing Floating-Point Results

Floating-point arithmetic is imprecise (remember Chapter 3?). Never test floats with ==:

# BAD — may fail due to floating-point imprecision
assert 0.1 + 0.2 == 0.3  # Fails! 0.1 + 0.2 == 0.30000000000000004

# GOOD — use pytest.approx() for float comparisons
import pytest
assert 0.1 + 0.2 == pytest.approx(0.3)
assert calculate_average([1, 2]) == pytest.approx(1.5)

Testing That Exceptions Are Raised

Sometimes the correct behavior is to raise an exception. You learned in Chapter 11 that functions should raise ValueError for invalid input. Here's how to test that:

import pytest
from math_tools import calculate_average

def test_average_of_empty_list_raises():
    with pytest.raises(ZeroDivisionError):
        calculate_average([])

def test_average_rejects_non_numeric():
    with pytest.raises(TypeError):
        calculate_average(["a", "b", "c"])

The pytest.raises() context manager says: "I expect this code to raise this specific exception. If it doesn't raise it, the test fails."

🔗 Spaced Review (Ch 11): Remember EAFP — "Easier to Ask Forgiveness than Permission"? Testing exception-raising behavior is EAFP in action. You're not checking if the exception might happen — you're asserting that it must happen for certain inputs. This is how you verify your error handling from Chapter 11 actually works.

Testing Dictionary-Returning Functions

🔗 Spaced Review (Ch 9): In Chapter 9, you learned to use dictionaries as structured data. Many real functions return dicts, and testing them requires checking specific keys and values.

def build_student_record(name, scores):
    """Build a student record dict with computed average."""
    return {
        "name": name,
        "scores": scores,
        "average": sum(scores) / len(scores),
        "grade_count": len(scores)
    }

def test_student_record_structure():
    record = build_student_record("Alice", [90, 80, 70])
    assert record["name"] == "Alice"
    assert record["scores"] == [90, 80, 70]
    assert record["average"] == pytest.approx(80.0)
    assert record["grade_count"] == 3

def test_student_record_has_all_keys():
    record = build_student_record("Bob", [100])
    assert set(record.keys()) == {"name", "scores", "average", "grade_count"}

What Makes a Good Test?

A good test is:

  1. Focused — tests one behavior, not five
  2. Independent — doesn't depend on other tests or shared state
  3. Readable — the function name describes what's being tested
  4. Deterministic — produces the same result every time
  5. Fast — runs in milliseconds, not seconds

✅ Best Practice: Name your tests like sentences: test_average_of_empty_list_raises_error, not test_1 or test_calculate_average. When a test fails at 2 AM, you want to know exactly what broke without reading the test code.


13.5 Test-Driven Development

🚪 Threshold Concept: TDD Changes How You Think About Design

Test-driven development (TDD) is a technique where you write the test before you write the code. This sounds backwards until you try it — then it transforms how you think about programming.

  • Before TDD: "I write code, then check if it works."
  • After TDD: "I define what 'working' means first, then write code to satisfy that definition."

This isn't just a testing technique. It's a design technique. When you write the test first, you're forced to think about the function's interface — what it takes as input, what it returns, how it handles errors — before you write a single line of implementation. You're designing from the user's perspective, not the implementer's perspective.

Students who internalize TDD report that it reduces anxiety about where to start. Instead of staring at a blank screen thinking "how do I implement this?", you think "what should this function do?" — and that question is almost always easier to answer.

The Red-Green-Refactor Cycle

TDD follows a simple three-step loop:

  1. 🔴 Red: Write a test for behavior that doesn't exist yet. Run it. It fails. (This is expected — and important. A test that passes immediately might be testing nothing.)

  2. 🟢 Green: Write the minimum code to make the test pass. Don't optimize. Don't handle edge cases you haven't tested. Just make the red test go green.

  3. 🔄 Refactor: Clean up your code without changing its behavior. Rename variables, extract helper functions, remove duplication. Run the tests again to make sure your refactoring didn't break anything.

Then repeat: write the next test, make it pass, clean up.

Worked Example: TDD for calculate_average()

Let's build calculate_average() from scratch using TDD. Pretend the function doesn't exist yet.

Step 1 — 🔴 Red: Write the first test

# test_math_tools.py
from math_tools import calculate_average

def test_average_of_three_integers():
    assert calculate_average([10, 20, 30]) == 20.0

Run it:

pytest test_math_tools.py
E   ImportError: cannot import name 'calculate_average' from 'math_tools'

It fails. Good. That's Red.

Step 2 — 🟢 Green: Write minimum code to pass

# math_tools.py
def calculate_average(numbers):
    return sum(numbers) / len(numbers)

Run the test again:

test_math_tools.py .                                             [100%]
1 passed

Green. The test passes.

Step 3 — 🔄 Refactor: The code is simple enough — nothing to refactor yet.

Step 4 — 🔴 Red: Next test — what about a single number?

def test_average_of_single_number():
    assert calculate_average([42]) == 42.0

Run it — passes immediately. Good, our implementation already handles this case.

Step 5 — 🔴 Red: What about an empty list?

def test_average_of_empty_list_raises_error():
    with pytest.raises(ValueError, match="Cannot calculate average of empty list"):
        calculate_average([])

Run it:

E   Failed: DID NOT RAISE <class 'ValueError'>

Red — it raises ZeroDivisionError instead of a clear ValueError. Let's fix that.

Step 6 — 🟢 Green: Handle the empty list

# math_tools.py
def calculate_average(numbers):
    if not numbers:
        raise ValueError("Cannot calculate average of empty list")
    return sum(numbers) / len(numbers)

Run again — all three tests pass. Green.

Step 7 — 🔴 Red: What about floating-point inputs?

def test_average_with_floats():
    result = calculate_average([1.5, 2.5, 3.0])
    assert result == pytest.approx(2.3333, rel=1e-3)

Passes immediately — sum() and / handle floats. Green.

Notice the rhythm: test, code, clean up. Test, code, clean up. Each cycle takes seconds to minutes, not hours. And at every step, you know exactly what works and what doesn't.

💡 Intuition: TDD is like GPS navigation. Instead of memorizing the entire route before you start driving, you get one instruction at a time: "Turn left here." Each test is one instruction. You follow it, confirm you're on track, then get the next one. You always know where you are.


13.6 Edge Cases and Boundary Conditions

An edge case is an input at the extreme end of what a function should handle. A boundary condition is the exact point where behavior changes (e.g., the threshold between passing and failing a course).

Most bugs live at edges and boundaries. Experienced developers test these first.

Common Edge Cases to Always Consider

Category Examples
Empty input Empty list [], empty string "", empty dict {}
Single element List with one item [42], string with one char "x"
Negative numbers -1, -0.5, mixing positive and negative
Zero 0 as divisor, zero-length, index 0
Very large values 10**18, float('inf')
Duplicates All identical values [5, 5, 5, 5]
Type boundaries Int vs. float: 5 vs 5.0
Special characters Names with apostrophes, accented characters, emoji

Example: Testing a Letter Grade Function

def letter_grade(score):
    """Convert a numeric score (0-100) to a letter grade."""
    if score < 0 or score > 100:
        raise ValueError(f"Score must be 0-100, got {score}")
    if score >= 90:
        return "A"
    elif score >= 80:
        return "B"
    elif score >= 70:
        return "C"
    elif score >= 60:
        return "D"
    else:
        return "F"

A beginner might write one test: assert letter_grade(85) == "B". A professional thinks about boundaries:

def test_grade_boundaries():
    """Test the exact cutoff points where grades change."""
    assert letter_grade(90) == "A"   # Boundary: exactly 90
    assert letter_grade(89) == "B"   # Just below A cutoff
    assert letter_grade(80) == "B"   # Boundary: exactly 80
    assert letter_grade(79) == "C"   # Just below B cutoff
    assert letter_grade(70) == "C"
    assert letter_grade(69) == "D"
    assert letter_grade(60) == "D"
    assert letter_grade(59) == "F"

def test_grade_extremes():
    """Test the outermost valid values."""
    assert letter_grade(100) == "A"  # Maximum valid score
    assert letter_grade(0) == "F"    # Minimum valid score

def test_grade_invalid_scores():
    """Test that invalid scores raise ValueError."""
    with pytest.raises(ValueError):
        letter_grade(101)
    with pytest.raises(ValueError):
        letter_grade(-1)
    with pytest.raises(ValueError):
        letter_grade(150)

Notice how the boundary tests check both sides of each cutoff — 89 and 90, 79 and 80. That's where off-by-one errors hide.

🧩 Productive Struggle: Find the Bug

The following function is supposed to find the second-largest number in a list. It has a subtle bug. Try to find it without running the code — just by reading carefully.

python def second_largest(numbers): """Return the second-largest number in a list.""" if len(numbers) < 2: raise ValueError("Need at least two numbers") largest = max(numbers) remaining = [n for n in numbers if n != largest] return max(remaining)

Think about it before reading the answer.

The Bug

What happens if all numbers are the same? Like second_largest([5, 5, 5])?

The list comprehension [n for n in numbers if n != largest] removes all instances of the largest value. If every number equals the largest, remaining is an empty list, and max([]) raises a ValueError — but with an unhelpful message ("max() arg is an empty sequence") instead of something meaningful.

A better version would use sorting or track the two largest values: python def second_largest(numbers): if len(numbers) < 2: raise ValueError("Need at least two numbers") unique = sorted(set(numbers), reverse=True) if len(unique) < 2: raise ValueError("Need at least two distinct numbers") return unique[1]

This is exactly the kind of bug that edge case testing catches.

🔄 Check Your Understanding (try to answer without scrolling up)

  1. What is the difference between an edge case and a boundary condition?
  2. When testing a function that accepts a list, what are three edge cases you should always consider?

Verify

  1. An edge case is an input at the extreme end of valid input (empty list, very large number). A boundary condition is the exact point where behavior changes (the cutoff between an A and a B grade).
  2. Empty list, single-element list, and a list with all duplicate values. (Also valid: very large lists, lists with mixed types, lists with negative numbers.)

13.7 Fixtures and Setup/Teardown

When multiple tests need the same setup data, you don't want to copy-paste it into every test function. pytest fixtures solve this.

The Problem: Repeated Setup

# Without fixtures — lots of repetition
def test_task_count():
    tasks = [
        {"title": "Buy groceries", "priority": "high", "done": False},
        {"title": "Read chapter 13", "priority": "medium", "done": True},
        {"title": "Walk the dog", "priority": "low", "done": False},
    ]
    assert len(tasks) == 3

def test_completed_tasks():
    tasks = [
        {"title": "Buy groceries", "priority": "high", "done": False},
        {"title": "Read chapter 13", "priority": "medium", "done": True},
        {"title": "Walk the dog", "priority": "low", "done": False},
    ]
    completed = [t for t in tasks if t["done"]]
    assert len(completed) == 1

That task list is repeated verbatim. If you add a field (like "category"), you'd have to update every test.

The Solution: Fixtures

import pytest

@pytest.fixture
def sample_tasks():
    """Provide a list of sample tasks for testing."""
    return [
        {"title": "Buy groceries", "priority": "high", "done": False},
        {"title": "Read chapter 13", "priority": "medium", "done": True},
        {"title": "Walk the dog", "priority": "low", "done": False},
    ]

def test_task_count(sample_tasks):
    assert len(sample_tasks) == 3

def test_completed_tasks(sample_tasks):
    completed = [t for t in sample_tasks if t["done"]]
    assert len(completed) == 1

def test_high_priority_tasks(sample_tasks):
    high = [t for t in sample_tasks if t["priority"] == "high"]
    assert len(high) == 1
    assert high[0]["title"] == "Buy groceries"

A fixture is a function decorated with @pytest.fixture that returns test data. When a test function includes the fixture name as a parameter, pytest automatically calls the fixture and passes the result. Each test gets a fresh copy, so tests can't interfere with each other.

Fixtures for File Operations

Fixtures are especially useful for tests that need temporary files:

import pytest
import json
import os

@pytest.fixture
def temp_tasks_file(tmp_path):
    """Create a temporary JSON file with sample tasks."""
    tasks = [
        {"title": "Test task", "priority": "high", "done": False}
    ]
    file_path = tmp_path / "tasks.json"
    file_path.write_text(json.dumps(tasks))
    return file_path

def test_load_tasks_from_file(temp_tasks_file):
    with open(temp_tasks_file) as f:
        tasks = json.load(f)
    assert len(tasks) == 1
    assert tasks[0]["title"] == "Test task"

The tmp_path fixture is built into pytest — it provides a temporary directory that's automatically cleaned up after the test runs. No leftover test files cluttering your project.

🔗 Connection (Ch 12): In Chapter 12, you split code into modules. Fixtures often live in a special file called conftest.py — pytest automatically discovers fixtures defined there and makes them available to all test files in the directory. It's like a shared module for test setup.


13.8 Debugging Strategies

Tests tell you that something is wrong. Debugging tells you what and where. Even with great tests, you'll spend time debugging — it's an unavoidable part of programming. The question is whether you debug systematically or randomly.

Strategy 1: Print Debugging (The Quick and Dirty Approach)

The simplest debugging technique: add print() statements to see what's happening inside your code.

def find_discount_price(price, discount_percent):
    print(f"DEBUG: price={price}, discount={discount_percent}")  # Temporary
    discount_amount = price * discount_percent
    print(f"DEBUG: discount_amount={discount_amount}")  # Temporary
    final_price = price - discount_amount
    print(f"DEBUG: final_price={final_price}")  # Temporary
    return final_price

# When called: find_discount_price(100, 20)
# DEBUG: price=100, discount=20
# DEBUG: discount_amount=2000   <-- AHA! Should be 20, not 2000!
# DEBUG: final_price=-1900

The bug is obvious once you see the intermediate values: discount_percent should be divided by 100 (or passed as 0.20 instead of 20). Print debugging is fast and effective for simple bugs.

⚠️ Pitfall: Remove or comment out debug prints before committing your code. Leftover print(f"DEBUG: ...") statements in production code are a hallmark of rushed work.

Strategy 2: Rubber Duck Debugging

Explain your code, line by line, out loud — to a rubber duck, a stuffed animal, a patient friend, or an empty room. The act of articulating what each line should do often reveals the line where what it should do and what it actually does diverge.

This sounds silly. It works embarrassingly well. The key is to explain what the code does, not what you meant it to do. Read the actual code.

Strategy 3: Binary Search Debugging

🐛 Debugging Walkthrough: The Binary Search Approach

When you have a long function (or pipeline of functions) and something is wrong with the output, you could check every line from top to bottom. But that's like reading every page of a dictionary to find a word. Binary search is faster.

The technique:

  1. Find the midpoint of your code
  2. Add a check (print, assert, or breakpoint) at the midpoint
  3. Determine if the data is still correct at that point
  4. If correct at midpoint → the bug is in the second half
  5. If incorrect at midpoint → the bug is in the first half
  6. Repeat with the narrower half

Example: Elena's report script processes 500 donation records through 8 steps: load CSV, clean names, parse dates, validate amounts, compute totals, group by category, format output, write report.

The final report has wrong totals. Where's the bug?

  1. Check after step 4 (validate amounts): are the individual amounts correct? → Yes
  2. Bug is in steps 5-8. Check after step 6 (group by category): are the category groups correct? → No, some donations are in the wrong category
  3. Bug is in step 5 or 6. Check after step 5 (compute totals): are totals correct before grouping? → Yes
  4. The bug is in step 6 — the grouping logic.

Three checks instead of eight. For larger codebases, the savings are dramatic — binary search narrows the problem space by half with each check.

Strategy 4: Using the VS Code Debugger

The debugger is a tool built into VS Code (and most IDEs) that lets you pause your program at any line and inspect the state of every variable. It's more powerful than print debugging because you can step through code line by line and examine complex data structures interactively.

Key concepts:

  • Breakpoint: A marker on a line of code where execution will pause. In VS Code, click the gutter (left margin) next to a line number — a red dot appears.
  • Step Over (F10): Execute the current line and move to the next one.
  • Step Into (F11): If the current line calls a function, jump into that function.
  • Step Out (Shift+F11): Finish the current function and return to the caller.
  • Continue (F5): Resume execution until the next breakpoint.
  • Variables panel: Shows all variables and their current values.
  • Watch panel: Add specific expressions to monitor as you step through.

To debug a pytest test in VS Code:

  1. Open the test file
  2. Set a breakpoint on the line you want to inspect
  3. Right-click the test function and select "Debug Test"
  4. Use the debugger controls to step through and inspect

✅ Best Practice: Use print debugging for quick investigations. Use the VS Code debugger when you need to explore complex state, step through loops iteration by iteration, or understand how data flows through multiple function calls.


13.9 Writing Testable Code

Some code is easy to test. Some code fights you every step of the way. The difference usually comes down to how the code is structured.

The Problem: Logic Mixed with I/O

# HARD TO TEST — logic and I/O are tangled together
def get_grade_report():
    name = input("Enter student name: ")
    scores = input("Enter scores separated by commas: ")
    score_list = [int(s.strip()) for s in scores.split(",")]
    average = sum(score_list) / len(score_list)
    if average >= 90:
        grade = "A"
    elif average >= 80:
        grade = "B"
    elif average >= 70:
        grade = "C"
    elif average >= 60:
        grade = "D"
    else:
        grade = "F"
    print(f"{name}: {grade} ({average:.1f})")

How do you test this? You can't — not easily. It calls input() (requires user interaction) and print() (outputs to the screen instead of returning a value). The logic is trapped inside I/O operations.

The Solution: Separate Logic from I/O

# EASY TO TEST — pure logic, no I/O
def calculate_grade(scores):
    """Given a list of scores, return the average and letter grade."""
    if not scores:
        raise ValueError("No scores provided")
    average = sum(scores) / len(scores)
    if average >= 90:
        grade = "A"
    elif average >= 80:
        grade = "B"
    elif average >= 70:
        grade = "C"
    elif average >= 60:
        grade = "D"
    else:
        grade = "F"
    return {"average": average, "grade": grade}

# I/O lives in a separate function
def main():
    name = input("Enter student name: ")
    scores_input = input("Enter scores separated by commas: ")
    score_list = [int(s.strip()) for s in scores_input.split(",")]
    result = calculate_grade(score_list)
    print(f"{name}: {result['grade']} ({result['average']:.1f})")

Now calculate_grade() is a pure function — it takes input as parameters and returns output as a return value. No input(), no print(), no side effects. Testing it is trivial:

def test_grade_a():
    result = calculate_grade([95, 92, 88, 97])
    assert result["grade"] == "A"
    assert result["average"] == pytest.approx(93.0)

def test_grade_f():
    result = calculate_grade([30, 45, 20])
    assert result["grade"] == "F"

Dependency Injection (A Preview)

What if your function needs to interact with a file or a database? Instead of hardcoding the dependency, pass it in as a parameter:

# HARD TO TEST — hardcoded file path
def load_tasks():
    with open("tasks.json") as f:
        return json.load(f)

# EASY TO TEST — file path is a parameter
def load_tasks(file_path="tasks.json"):
    with open(file_path) as f:
        return json.load(f)

The second version lets tests pass a temporary file (using tmp_path fixture) instead of relying on a real tasks.json being present. This technique — passing dependencies as parameters instead of hardcoding them — is called dependency injection. You'll use it extensively in Chapter 14 and beyond.

🔄 Check Your Understanding (try to answer without scrolling up)

  1. Why is a function that calls input() and print() hard to test?
  2. What does "separate logic from I/O" mean in practice?
  3. What is a pure function?

Verify

  1. Because input() requires interactive user input (you can't automate it easily) and print() outputs to the screen instead of returning a testable value.
  2. Put the computation (math, logic, data transformation) in functions that take parameters and return values. Put the I/O (input, print, file reading) in a separate function — usually main() — that calls the logic functions.
  3. A function that takes input only through its parameters and produces output only through its return value. No side effects, no global state, no I/O.

13.10 Code Coverage

Code coverage measures what percentage of your code is actually executed by your tests. It answers the question: "Are there lines of code that no test ever touches?"

Installing and Using pytest-cov

pip install pytest-cov

Run your tests with coverage:

pytest --cov=math_tools test_math_tools.py

Example output:

---------- coverage: platform linux, python 3.12.0 -----------
Name              Stmts   Miss  Cover
--------------------------------------
math_tools.py         6      0   100%
--------------------------------------
TOTAL                 6      0   100%

You can also get a detailed line-by-line report:

pytest --cov=math_tools --cov-report=term-missing test_math_tools.py
Name              Stmts   Miss  Cover   Missing
------------------------------------------------
math_tools.py         6      1    83%   8
------------------------------------------------

The "Missing" column tells you exactly which lines aren't covered — line 8 in this case.

Why 100% Coverage Is Not the Goal

This might surprise you: 100% code coverage doesn't mean your code is correct. Here's why:

def add(a, b):
    return a + b

def test_add():
    add(2, 3)  # 100% coverage! But we never checked the result!

This test achieves 100% code coverage — every line of add() is executed. But it doesn't assert anything. The function could return "pizza" and the test would still pass.

Coverage tells you what code was executed, not what code was verified. It's a measure of breadth, not depth.

Practical guideline: Aim for 80-90% coverage on most projects. The last 10-20% is usually boilerplate, I/O code, or error handlers for extremely rare conditions. Chasing 100% often leads to brittle, low-value tests that test implementation details rather than behavior.

⚠️ Pitfall: Don't confuse high coverage with high quality. A test suite with 95% coverage and thoughtful assertions is infinitely better than one with 100% coverage and no assertions. Coverage is a useful metric, not a goal in itself.


13.11 Project Checkpoint: TaskFlow v1.2 — Full Test Suite

🔗 Bridge from Chapter 12: In v1.1, you split TaskFlow into modules: models.py, storage.py, display.py, and cli.py. Now you'll write a comprehensive test suite that verifies each module works correctly — and works correctly together.

In this checkpoint, you'll create test_taskflow.py with tests covering:

  1. Adding tasks — verify a task is added with correct fields
  2. Deleting tasks — verify deletion by index, handle invalid indices
  3. Searching tasks — keyword search, case-insensitive matching
  4. Saving/loading JSON — round-trip persistence (save, then load, verify data)
  5. Error handling — invalid inputs raise appropriate errors

Here's the structure. Refer to code/project-checkpoint-tests.py for the complete implementation.

# test_taskflow.py
import pytest
import json

# --- Task model tests ---

def test_create_task():
    """A new task should have a title, priority, and done=False."""
    task = create_task("Buy groceries", "high")
    assert task["title"] == "Buy groceries"
    assert task["priority"] == "high"
    assert task["done"] is False

def test_create_task_default_priority():
    """Tasks should default to 'medium' priority."""
    task = create_task("Read chapter 13")
    assert task["priority"] == "medium"

def test_create_task_empty_title_raises():
    """Empty titles should raise ValueError."""
    with pytest.raises(ValueError):
        create_task("")

# --- Storage tests (with fixtures) ---

@pytest.fixture
def sample_task_list():
    return [
        create_task("Task A", "high"),
        create_task("Task B", "low"),
        create_task("Task C", "medium"),
    ]

def test_save_and_load_roundtrip(tmp_path, sample_task_list):
    """Tasks saved to JSON should load back identically."""
    file_path = tmp_path / "tasks.json"
    save_tasks(sample_task_list, str(file_path))
    loaded = load_tasks(str(file_path))
    assert loaded == sample_task_list

def test_load_missing_file_returns_empty():
    """Loading from a nonexistent file should return an empty list."""
    result = load_tasks("nonexistent_file_xyz.json")
    assert result == []

# --- Search tests ---

def test_search_finds_matching_tasks(sample_task_list):
    """Search should find tasks whose title contains the keyword."""
    results = search_tasks(sample_task_list, "Task A")
    assert len(results) == 1
    assert results[0]["title"] == "Task A"

def test_search_case_insensitive(sample_task_list):
    """Search should be case-insensitive."""
    results = search_tasks(sample_task_list, "task a")
    assert len(results) == 1

def test_search_no_results(sample_task_list):
    """Search with no matches should return empty list."""
    results = search_tasks(sample_task_list, "nonexistent")
    assert results == []

# --- Delete tests ---

def test_delete_task_by_index(sample_task_list):
    """Deleting by valid index should remove the task."""
    tasks = sample_task_list.copy()
    deleted = delete_task(tasks, 0)
    assert deleted["title"] == "Task A"
    assert len(tasks) == 2

def test_delete_invalid_index_raises(sample_task_list):
    """Deleting with an out-of-range index should raise IndexError."""
    with pytest.raises(IndexError):
        delete_task(sample_task_list, 99)

Run your test suite:

pytest test_taskflow.py -v

The -v flag gives verbose output, showing each test name and its pass/fail status:

test_taskflow.py::test_create_task PASSED
test_taskflow.py::test_create_task_default_priority PASSED
test_taskflow.py::test_create_task_empty_title_raises PASSED
test_taskflow.py::test_save_and_load_roundtrip PASSED
test_taskflow.py::test_load_missing_file_returns_empty PASSED
test_taskflow.py::test_search_finds_matching_tasks PASSED
test_taskflow.py::test_search_case_insensitive PASSED
test_taskflow.py::test_search_no_results PASSED
test_taskflow.py::test_delete_task_by_index PASSED
test_taskflow.py::test_delete_invalid_index_raises PASSED

========================== 10 passed in 0.05s ==========================

What You Just Built

Ten passing tests. Each one documents a specific behavior of your TaskFlow app. Each one runs in milliseconds. Each one will scream at you if a future change breaks something. That's a regression test suite — tests that catch regressions (things that used to work but now don't) when you modify code.

✅ Best Practice (From the field): In professional development, you almost never push code without running the test suite first. Many teams make this automatic: the tests run every time someone proposes a code change, and the change is rejected if any test fails. You'll learn more about this workflow in Chapter 25 (version control) and Chapter 26 (software development lifecycle).


Chapter Summary

You started this chapter knowing how to write code. You're ending it knowing how to prove your code works.

Testing isn't extra work — it's the work that makes all your other work trustworthy. TDD isn't just a testing technique — it's a design technique that forces you to think about what your code should do before you think about how. And systematic debugging — binary search, rubber ducking, using the debugger — replaces panicked guessing with methodical investigation.

Here's what we covered:

Topic Key Takeaway
Why testing matters Bugs found late cost 50-100x more to fix than bugs caught by tests
Types of testing Unit tests (our focus), integration tests, E2E tests
pytest basics test_ prefix, assert statements, pytest command
TDD Red-Green-Refactor: define "working" before writing code
Edge cases Empty inputs, boundaries, duplicates — test these first
Fixtures @pytest.fixture for shared test data
Debugging Print → Rubber duck → Binary search → Debugger
Testable code Separate logic from I/O; pure functions are easy to test
Code coverage 80-90% is practical; 100% is a vanity metric

Elena never got ambushed by the apostrophe bug again. She wrote a test: test_donor_name_with_apostrophe(). It failed. She fixed the bug. The test passed. And every time she modifies her report script, that test runs — automatically, silently, in milliseconds — making sure O'Brien's donation is always counted.

What's Next: In Chapter 14, you'll learn object-oriented programming — bundling data and behavior together into classes. The test-writing skills from this chapter will be essential: you'll test your classes from day one, and TDD will help you design clean, focused class interfaces.