21 min read

> "Beware of bugs in the above code; I have only proved it correct, not tested it."

Chapter 21: AI-Assisted Testing Strategies

"Beware of bugs in the above code; I have only proved it correct, not tested it." --- Donald Knuth

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain why testing is more critical when working with AI-generated code than with manually written code (Understand)
  • Apply pytest fundamentals including fixtures, parametrized tests, and markers to organize and run test suites effectively (Apply)
  • Design integration testing strategies that verify how AI-generated components work together (Create)
  • Implement end-to-end tests that validate complete user workflows in AI-built applications (Apply)
  • Construct property-based tests using Hypothesis to uncover edge cases that AI-generated code may mishandle (Create)
  • Execute a test-driven development workflow where you write tests first and have AI implement the code (Apply)
  • Evaluate test coverage metrics and quality indicators to determine when a test suite is sufficient (Evaluate)
  • Integrate automated testing into continuous integration pipelines for AI-assisted projects (Apply)
  • Build a comprehensive, multi-layered test suite for a real application developed with AI assistance (Create)

Introduction

There is a persistent myth among newcomers to vibe coding that AI-generated code needs less testing because the AI "knows what it is doing." This belief is dangerously wrong. AI-generated code needs more testing, not less, and it needs different kinds of testing than code you write yourself. When you write code by hand, you carry a mental model of every decision you made, every edge case you considered, and every shortcut you took. When an AI generates code for you, that mental model does not exist in your head. The code might look correct, pass a quick visual inspection, and even work for the first few inputs you try --- yet harbor subtle bugs that surface only under specific conditions.

This chapter equips you with a thorough testing toolkit. We begin with the philosophical case for rigorous testing of AI code, then move through practical techniques: pytest fundamentals, integration testing, end-to-end testing, property-based testing with Hypothesis, test-driven development with AI as your implementer, mocking strategies, coverage metrics, and continuous testing workflows. By the end, you will be able to build a comprehensive test suite that gives you genuine confidence in AI-generated code.


21.1 Why Testing Matters More with AI-Generated Code

The Trust Gap

When you write code yourself, you build understanding line by line. You know why you chose a particular data structure, how you handle null values, and where the tricky parts are. With AI-generated code, you receive a finished product without that journey. This creates a trust gap --- the distance between what the code appears to do and what it actually does.

Consider a simple example. You ask an AI to write a function that calculates the average of a list of numbers:

def calculate_average(numbers: list[float]) -> float:
    """Calculate the arithmetic mean of a list of numbers."""
    return sum(numbers) / len(numbers)

This looks correct. It will work for most inputs. But what happens when you pass an empty list? You get a ZeroDivisionError. A human developer might or might not catch this --- but the critical difference is that a human developer thought about the function while writing it. With AI-generated code, you may never have thought about empty lists at all.

Common AI Code Failure Modes

Testing AI-generated code is essential because AI assistants exhibit predictable failure patterns:

  1. Happy-path bias. AI models are trained on code that often demonstrates the "normal" case. Edge cases, boundary conditions, and error handling are frequently incomplete or missing entirely.

  2. Plausible but incorrect logic. AI-generated code can contain logic that looks right but produces wrong results for certain inputs. The syntax is valid, the variable names make sense, and the structure follows conventions --- but the algorithm has a subtle flaw.

  3. Inconsistent error handling. An AI might handle errors differently across functions in the same codebase, sometimes raising exceptions, sometimes returning None, and sometimes silently failing.

  4. Stale patterns. AI models may generate code using deprecated APIs, outdated library versions, or patterns that were common in training data but are no longer best practice.

  5. Context drift. As you iterate with an AI across a long conversation, later code may subtly contradict assumptions made in earlier code.

Key Insight: Testing is not about distrusting AI. It is about establishing a verification layer that catches problems regardless of their source. The best developers in the world write tests for their own code. AI-generated code deserves the same rigor --- and often more.

The Testing Mindset for Vibe Coding

Adopt this principle: the AI writes the implementation, you own the specification. Your tests are the specification. They define what the code should do, how it should handle errors, and what invariants must hold. When you write tests before or alongside AI-generated code, you are not doing busywork --- you are defining the contract that the code must fulfill.

This mindset transforms your relationship with AI-generated code from passive acceptance to active verification. You become the quality gate, and your tests are the mechanism through which that gate operates.

A Real-World Example of the Trust Gap

To make this concrete, consider a developer who asked an AI to implement a function for calculating compound interest:

def compound_interest(principal: float, rate: float, years: int) -> float:
    """Calculate compound interest."""
    return principal * (1 + rate) ** years

The developer tested it with a few values and it worked. They shipped it. Weeks later, a user reported that negative years produced bizarre results. Another user found that a rate of -1.5 (which should be invalid in this context) returned a complex number in Python 3 when combined with fractional years. A third user passed a rate of 0.05 expecting 5% interest, but the function treated it as 5% correctly only by coincidence --- the docstring never specified whether rate should be 0.05 or 5.0 for 5% interest.

None of these bugs would have survived even a modest test suite:

def test_compound_interest_negative_years_raises():
    with pytest.raises(ValueError):
        compound_interest(1000, 0.05, -1)

def test_compound_interest_invalid_rate_raises():
    with pytest.raises(ValueError):
        compound_interest(1000, -1.5, 5)

def test_compound_interest_known_value():
    # $1000 at 5% for 10 years = $1628.89
    result = compound_interest(1000, 0.05, 10)
    assert result == pytest.approx(1628.89, abs=0.01)

These three simple tests would have immediately revealed that the AI's implementation lacked input validation and that the interface was ambiguous about the rate format. This is the trust gap in action: the code looked correct, passed a superficial check, and hid three distinct bugs.

The Economics of Testing AI Code

Some developers resist testing because they see it as slowing down the "vibe" of rapid AI-assisted development. This is short-sighted. Consider the economics:

  • Time to generate code with AI: Minutes
  • Time to write basic tests: Minutes
  • Time to debug a production bug without tests: Hours to days
  • Cost of a production bug in a financial application: Potentially enormous

The math is clear. Testing is the cheapest form of quality assurance, and it becomes even cheaper when you factor in that AI can help generate the tests themselves. The investment is small; the protection is significant.


21.2 Unit Testing with pytest

Why pytest

Python's built-in unittest module works, but pytest has become the de facto standard for Python testing for good reasons: simpler syntax, powerful fixtures, excellent plugin ecosystem, and readable output. When you ask an AI to generate tests, specify pytest --- the results will be cleaner and more maintainable.

Writing Your First pytest Tests

A pytest test is simply a function whose name starts with test_:

def test_addition():
    assert 1 + 1 == 2

def test_string_concatenation():
    result = "hello" + " " + "world"
    assert result == "hello world"

No classes needed, no inheritance from TestCase, no special assertion methods. Just assert statements with plain Python expressions. When a test fails, pytest provides detailed output showing exactly what went wrong.

Fixtures: Managing Test Dependencies

Fixtures are pytest's mechanism for setup and teardown. They replace the setUp and tearDown methods from unittest with something far more flexible:

import pytest

@pytest.fixture
def sample_user():
    """Create a sample user for testing."""
    return {
        "username": "testuser",
        "email": "test@example.com",
        "age": 30
    }

@pytest.fixture
def database_connection():
    """Create and tear down a test database connection."""
    conn = create_test_database()
    yield conn
    conn.close()
    cleanup_test_database()

def test_user_has_email(sample_user):
    assert "email" in sample_user
    assert "@" in sample_user["email"]

def test_user_age_positive(sample_user):
    assert sample_user["age"] > 0

The yield keyword in fixtures separates setup from teardown. Code before yield runs before the test; code after yield runs after the test completes (even if it fails).

Fixture scopes control how often a fixture is created:

@pytest.fixture(scope="session")
def expensive_resource():
    """Created once for the entire test session."""
    return load_large_dataset()

@pytest.fixture(scope="module")
def module_resource():
    """Created once per test module."""
    return initialize_module_state()

@pytest.fixture(scope="function")  # This is the default
def fresh_resource():
    """Created anew for each test function."""
    return create_clean_state()

Parametrized Tests

Parametrization lets you run the same test logic with different inputs, which is especially valuable for testing AI-generated code across multiple scenarios:

@pytest.mark.parametrize("input_val,expected", [
    (0, "zero"),
    (1, "one"),
    (-1, "negative"),
    (100, "positive"),
    (999999, "positive"),
])
def test_classify_number(input_val, expected):
    result = classify_number(input_val)
    assert result == expected

When testing AI-generated functions, parametrize aggressively. Include normal cases, boundary values, negative numbers, zero, empty strings, None, very large values, and Unicode characters. AI-generated code frequently fails on these edge cases.

Markers: Organizing Tests

Markers let you categorize and selectively run tests:

@pytest.mark.slow
def test_large_dataset_processing():
    """This test takes several seconds."""
    result = process_million_records()
    assert len(result) == 1_000_000

@pytest.mark.integration
def test_api_endpoint():
    """Requires a running server."""
    response = requests.get("http://localhost:8000/api/health")
    assert response.status_code == 200

@pytest.mark.skip(reason="Waiting for API key")
def test_external_service():
    pass

@pytest.mark.xfail(reason="Known bug in AI-generated parser")
def test_malformed_input():
    result = parse_data("<<<invalid>>>")
    assert result is None

Run specific markers from the command line:

pytest -m "not slow"          # Skip slow tests
pytest -m integration         # Only integration tests
pytest -m "unit and not slow" # Fast unit tests only

Prompting Tip: When asking AI to generate tests, specify the testing framework and style explicitly: "Write pytest tests with fixtures and parametrize for the following function. Include edge cases for empty input, None, negative numbers, and very large values. Use type hints and docstrings."

The conftest.py File

The conftest.py file is pytest's mechanism for sharing fixtures across multiple test files. Place it in your test directory, and any fixture defined there is automatically available to all test files in that directory and its subdirectories:

# conftest.py
import pytest

@pytest.fixture
def app_config():
    """Shared application configuration for all tests."""
    return {
        "debug": True,
        "database_url": "sqlite:///:memory:",
        "secret_key": "test-secret-key"
    }

@pytest.fixture
def mock_api_client(app_config):
    """Create a mock API client using shared config."""
    client = MockAPIClient(config=app_config)
    yield client
    client.cleanup()

21.3 Integration Testing Strategies

What Integration Tests Verify

Unit tests verify individual functions in isolation. Integration tests verify that multiple components work together correctly. With AI-generated code, integration testing is particularly important because the AI may generate components that are individually correct but incompatible with each other.

Common integration test scenarios include:

  • Database interactions: Does the code correctly read from and write to the database?
  • API communication: Do the request and response handlers work together?
  • File I/O: Does the code correctly read, process, and write files?
  • Service interactions: Do multiple services communicate correctly?

Testing Database Interactions

import pytest
import sqlite3

@pytest.fixture
def db_connection():
    """Create an in-memory SQLite database for testing."""
    conn = sqlite3.connect(":memory:")
    conn.execute("""
        CREATE TABLE users (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            username TEXT UNIQUE NOT NULL,
            email TEXT NOT NULL,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    yield conn
    conn.close()

def test_create_and_retrieve_user(db_connection):
    """Test that a user can be created and then retrieved."""
    # Create
    db_connection.execute(
        "INSERT INTO users (username, email) VALUES (?, ?)",
        ("alice", "alice@example.com")
    )
    db_connection.commit()

    # Retrieve
    cursor = db_connection.execute(
        "SELECT username, email FROM users WHERE username = ?",
        ("alice",)
    )
    row = cursor.fetchone()

    assert row is not None
    assert row[0] == "alice"
    assert row[1] == "alice@example.com"

def test_duplicate_username_raises_error(db_connection):
    """Test that duplicate usernames are rejected."""
    db_connection.execute(
        "INSERT INTO users (username, email) VALUES (?, ?)",
        ("bob", "bob@example.com")
    )
    db_connection.commit()

    with pytest.raises(sqlite3.IntegrityError):
        db_connection.execute(
            "INSERT INTO users (username, email) VALUES (?, ?)",
            ("bob", "bob2@example.com")
        )

Testing API Endpoints

For web applications, use test clients provided by your framework. Here is an example with Flask:

import pytest
from myapp import create_app

@pytest.fixture
def client():
    """Create a test client for the Flask application."""
    app = create_app(testing=True)
    with app.test_client() as client:
        yield client

def test_health_endpoint(client):
    response = client.get("/api/health")
    assert response.status_code == 200
    data = response.get_json()
    assert data["status"] == "healthy"

def test_create_item_requires_auth(client):
    response = client.post("/api/items", json={"name": "Test"})
    assert response.status_code == 401

def test_create_item_with_auth(client, auth_headers):
    response = client.post(
        "/api/items",
        json={"name": "Test Item", "price": 9.99},
        headers=auth_headers
    )
    assert response.status_code == 201
    data = response.get_json()
    assert data["name"] == "Test Item"

Warning

AI-generated integration tests often use real external services (live APIs, production databases). Always ensure your tests use test doubles, in-memory databases, or sandbox environments. A test that calls a production API is a test that can cause real damage.

The Arrange-Act-Assert Pattern

Structure every test clearly using the AAA pattern:

def test_order_total_with_discount():
    # Arrange: Set up the test data and preconditions
    order = Order(items=[
        Item("Widget", price=10.00, quantity=3),
        Item("Gadget", price=25.00, quantity=1),
    ])
    discount = PercentageDiscount(10)

    # Act: Execute the behavior being tested
    total = order.calculate_total(discount=discount)

    # Assert: Verify the results
    assert total == pytest.approx(49.50)

21.4 End-to-End Testing

The Purpose of E2E Tests

End-to-end tests simulate real user workflows from start to finish. They are the highest-level tests in your pyramid and verify that the entire system works as a user would experience it. While slower and more brittle than unit tests, E2E tests catch integration issues that lower-level tests miss.

E2E Testing for CLI Applications

For command-line applications (a common output of vibe coding), use subprocess or Click's CliRunner:

from click.testing import CliRunner
from myapp.cli import main

def test_full_workflow():
    """Test a complete user workflow through the CLI."""
    runner = CliRunner()

    # Step 1: Initialize the application
    result = runner.invoke(main, ["init", "--name", "test-project"])
    assert result.exit_code == 0
    assert "Project initialized" in result.output

    # Step 2: Add an item
    result = runner.invoke(main, ["add", "Buy groceries", "--priority", "high"])
    assert result.exit_code == 0

    # Step 3: List items
    result = runner.invoke(main, ["list"])
    assert result.exit_code == 0
    assert "Buy groceries" in result.output
    assert "high" in result.output

    # Step 4: Complete the item
    result = runner.invoke(main, ["complete", "1"])
    assert result.exit_code == 0

    # Step 5: Verify completion
    result = runner.invoke(main, ["list", "--show-completed"])
    assert "Buy groceries" in result.output
    assert "[done]" in result.output.lower() or "completed" in result.output.lower()

E2E Testing for Web Applications

For web applications, Playwright or Selenium automate browser interactions:

import pytest
from playwright.sync_api import Page

def test_user_registration_flow(page: Page):
    """Test the complete user registration workflow."""
    # Navigate to registration page
    page.goto("http://localhost:3000/register")

    # Fill in the registration form
    page.fill("#username", "newuser")
    page.fill("#email", "newuser@example.com")
    page.fill("#password", "SecurePass123!")
    page.fill("#confirm-password", "SecurePass123!")

    # Submit the form
    page.click("button[type='submit']")

    # Verify redirect to dashboard
    page.wait_for_url("**/dashboard")
    assert "Welcome, newuser" in page.text_content("h1")

The Testing Pyramid

Structure your tests following the testing pyramid:

         /\
        /  \        E2E Tests (few, slow, high confidence)
       /    \
      /------\
     /        \     Integration Tests (moderate number)
    /          \
   /------------\
  /              \  Unit Tests (many, fast, focused)
 /________________\
  • Unit tests (70-80% of tests): Fast, isolated, test individual functions
  • Integration tests (15-25% of tests): Test component interactions
  • E2E tests (5-10% of tests): Test complete workflows

Practical Tip: When building with AI, start with unit tests for each generated function, then add integration tests for components that interact, and finally write a handful of E2E tests for critical user workflows. Do not try to achieve comprehensive E2E coverage --- it is too slow and too brittle.


21.5 Property-Based Testing with Hypothesis

Why Property-Based Testing Is Powerful for AI Code

Traditional example-based tests check specific inputs against expected outputs. Property-based testing takes a fundamentally different approach: you define properties that should hold true for any valid input, and the testing framework generates hundreds of random inputs to verify those properties.

This approach is exceptionally valuable for AI-generated code because:

  1. It tests inputs you never thought of. The AI might not handle Unicode strings, extremely long inputs, or negative numbers. Hypothesis will find these gaps.
  2. It finds edge cases systematically. Rather than guessing which inputs might cause problems, you let the framework discover them.
  3. It verifies invariants. Instead of checking "does f(3) equal 9?", you check "does f always return a non-negative number?" --- a much stronger guarantee.

Getting Started with Hypothesis

from hypothesis import given, settings, assume
from hypothesis import strategies as st

@given(st.integers(), st.integers())
def test_addition_is_commutative(a, b):
    """Addition should be commutative for all integers."""
    assert a + b == b + a

@given(st.lists(st.integers()))
def test_sorting_preserves_length(lst):
    """Sorting a list should not change its length."""
    sorted_lst = sorted(lst)
    assert len(sorted_lst) == len(lst)

@given(st.lists(st.integers(), min_size=1))
def test_sorted_list_is_ordered(lst):
    """Every element in a sorted list should be <= the next element."""
    sorted_lst = sorted(lst)
    for i in range(len(sorted_lst) - 1):
        assert sorted_lst[i] <= sorted_lst[i + 1]

Hypothesis Strategies

Strategies are Hypothesis's way of describing what kind of data to generate:

# Basic strategies
st.integers()                          # Any integer
st.integers(min_value=0, max_value=100)  # Bounded integers
st.floats(allow_nan=False)             # Floats without NaN
st.text()                              # Any Unicode string
st.text(min_size=1, max_size=50)       # Bounded strings
st.booleans()                          # True or False
st.none()                              # Always None

# Collection strategies
st.lists(st.integers())               # Lists of integers
st.lists(st.text(), min_size=1)        # Non-empty lists of strings
st.dictionaries(st.text(), st.integers())  # Dict[str, int]
st.tuples(st.integers(), st.text())    # Tuple[int, str]

# Composite strategies for custom data
@st.composite
def user_strategy(draw):
    """Generate random but valid user data."""
    username = draw(st.text(
        alphabet=st.characters(whitelist_categories=("L", "N")),
        min_size=3,
        max_size=20
    ))
    age = draw(st.integers(min_value=13, max_value=120))
    email = draw(st.emails())
    return {"username": username, "age": age, "email": email}

Verifying AI-Generated Code with Properties

Suppose an AI generates a function to encode and decode data. You can verify it with round-trip properties:

from hypothesis import given
from hypothesis import strategies as st

@given(st.text())
def test_encode_decode_roundtrip(original):
    """Encoding then decoding should return the original string."""
    encoded = encode(original)
    decoded = decode(encoded)
    assert decoded == original

@given(st.binary())
def test_compress_decompress_roundtrip(data):
    """Compressing then decompressing should return original data."""
    compressed = compress(data)
    decompressed = decompress(compressed)
    assert decompressed == data

Other powerful property patterns:

# Idempotency: doing it twice is the same as doing it once
@given(st.text())
def test_normalize_is_idempotent(text):
    once = normalize(text)
    twice = normalize(normalize(text))
    assert once == twice

# Invariant preservation: certain properties always hold
@given(st.lists(st.integers()))
def test_sort_preserves_elements(lst):
    sorted_lst = sorted(lst)
    assert sorted(sorted_lst) == sorted(lst)  # Same elements

# Oracle testing: compare against a known-good implementation
@given(st.integers(min_value=0, max_value=1000))
def test_fast_fib_matches_slow_fib(n):
    """Compare AI's optimized fibonacci against a known-correct version."""
    assume(n <= 30)  # Keep it tractable
    assert fast_fibonacci(n) == slow_but_correct_fibonacci(n)

Key Insight: Property-based testing is your most powerful tool for verifying AI-generated code. While you might not know every specific output to expect, you almost always know properties that outputs should satisfy. Leverage those properties relentlessly.


21.6 Test-Driven Development with AI

The TDD-AI Workflow

Test-driven development (TDD) takes on a new dimension with AI assistance. The classic TDD cycle is Red-Green-Refactor:

  1. Red: Write a failing test
  2. Green: Write the minimum code to pass the test
  3. Refactor: Clean up while keeping tests green

With AI, this becomes:

  1. Red: You write a failing test that specifies desired behavior
  2. Green: AI implements the code to pass your test
  3. Refactor: AI refactors while you verify tests remain green

This workflow is powerful because it places you in control of the specification while leveraging AI for implementation.

Practical TDD-AI Example

Let us build a password validator step by step.

Step 1: Write the first test (you)

def test_password_minimum_length():
    """Password must be at least 8 characters."""
    assert validate_password("Ab1!xxxx") == True
    assert validate_password("Ab1!xxx") == False  # Only 7 chars

Step 2: Prompt the AI

"Implement a validate_password function that passes this test. The password must be at least 8 characters long. Return True if valid, False otherwise."

Step 3: AI generates implementation

def validate_password(password: str) -> bool:
    """Validate a password against security requirements."""
    if len(password) < 8:
        return False
    return True

Step 4: Write the next test (you)

def test_password_requires_uppercase():
    """Password must contain at least one uppercase letter."""
    assert validate_password("abcdefg1!") == False
    assert validate_password("Abcdefg1!") == True

Step 5: Prompt the AI to extend

"Update validate_password to also require at least one uppercase letter. It should still pass all previous tests."

Step 6: Continue iterating

Each cycle adds a new requirement: lowercase letters, digits, special characters, no common passwords, no repeated characters. Your tests accumulate into a comprehensive specification, and the AI incrementally builds the implementation.

Benefits of TDD with AI

  1. You maintain control. The tests define behavior; the AI is an implementation tool.
  2. Incremental verification. Each small addition is verified before moving on.
  3. Living documentation. Your tests document every requirement.
  4. Regression protection. When you ask the AI to change something, existing tests catch regressions.
  5. Clearer prompts. Failing tests give the AI an unambiguous specification.

Prompting Tip: Include your test code directly in your prompt to the AI. Say: "Here are my tests. Implement the function(s) that make all these tests pass. Do not modify the tests." This gives the AI a precise specification to work from.

When Tests Fail After AI Implementation

When the AI's implementation does not pass your tests, you have valuable information:

  • The test is wrong: Re-examine your test. Does it correctly specify the desired behavior?
  • The prompt was ambiguous: Clarify your requirements and re-prompt.
  • The AI made an error: Share the failing test output with the AI and ask it to fix the implementation.
  • The requirement is contradictory: Sometimes adding a new test reveals that your requirements conflict with each other.

Each scenario is a learning opportunity that improves both your specification and the implementation.


21.7 Mocking and Test Doubles

Why Mocking Matters

Real applications depend on external systems: databases, APIs, file systems, clocks, random number generators. Tests that depend on external systems are slow, flaky, and hard to reproduce. Mocking replaces these dependencies with controlled substitutes.

AI-generated code frequently calls external services. Without mocking, you cannot test that code reliably.

Types of Test Doubles

  • Stub: Returns predetermined values. "When get_user(1) is called, return this user object."
  • Mock: Records how it was called. "Verify that send_email was called exactly once with these arguments."
  • Fake: A simplified working implementation. An in-memory database instead of PostgreSQL.
  • Spy: Wraps a real object to record calls while still executing real behavior.

Using unittest.mock

Python's standard library provides a powerful mocking framework:

from unittest.mock import Mock, patch, MagicMock

# Creating a mock object
mock_api = Mock()
mock_api.get_user.return_value = {"id": 1, "name": "Alice"}

# Using the mock
result = mock_api.get_user(1)
assert result["name"] == "Alice"
mock_api.get_user.assert_called_once_with(1)

The patch Decorator

patch temporarily replaces an object with a mock during a test:

from unittest.mock import patch

@patch("myapp.services.requests.get")
def test_fetch_weather(mock_get):
    """Test weather fetching without calling the real API."""
    mock_get.return_value.status_code = 200
    mock_get.return_value.json.return_value = {
        "temperature": 72,
        "condition": "sunny"
    }

    result = fetch_weather("New York")

    assert result["temperature"] == 72
    assert result["condition"] == "sunny"
    mock_get.assert_called_once()

Using pytest-mock

The pytest-mock plugin provides a cleaner fixture-based interface:

def test_send_notification(mocker):
    """Test notification sending without actually sending."""
    mock_send = mocker.patch("myapp.notifications.send_email")
    mock_send.return_value = True

    result = notify_user(user_id=1, message="Hello")

    assert result is True
    mock_send.assert_called_once_with(
        to="user1@example.com",
        subject="Notification",
        body="Hello"
    )

Mocking Best Practices

  1. Mock at the boundary. Mock external dependencies, not internal functions. If your code calls requests.get, mock requests.get, not some internal wrapper.

  2. Mock the right target. Use the import path where the object is used, not where it is defined:

# If myapp/services.py does: from requests import get
# Mock it as:
@patch("myapp.services.get")  # Where it's used
# NOT:
@patch("requests.get")  # Where it's defined
  1. Do not over-mock. If you mock everything, your tests verify your mocks, not your code. Mock external boundaries; let internal logic run for real.

  2. Use spec=True. This ensures your mock has the same interface as the real object, catching attribute errors:

mock_db = Mock(spec=DatabaseConnection)
mock_db.qurey()  # AttributeError! (catches typo "qurey")

Warning

AI-generated test code often mocks too aggressively, replacing internal functions with mocks and creating tests that pass even when the code is completely broken. Review AI-generated mocks carefully and ensure they only replace genuine external dependencies.


21.8 Test Coverage and Quality Metrics

What Coverage Measures

Test coverage measures which lines, branches, or paths in your code are executed during testing. The most common metric is line coverage: the percentage of lines executed by at least one test.

# Run tests with coverage
pytest --cov=myapp --cov-report=term-missing

# Output example:
# Name                    Stmts   Miss  Cover   Missing
# myapp/models.py            45      3    93%   67-69
# myapp/services.py          82     12    85%   34-38, 91-97
# myapp/utils.py             28      0   100%
# TOTAL                     155     15    90%

Coverage Types

  • Line coverage: Which lines were executed (most common)
  • Branch coverage: Which branches of conditionals were taken (more thorough)
  • Path coverage: Which complete execution paths were traversed (most thorough, rarely practical)
def categorize(value: int) -> str:
    if value > 0:      # Branch 1: True/False
        if value > 100: # Branch 2: True/False
            return "large positive"
        return "small positive"
    return "non-positive"

100% line coverage requires three tests (one for each return statement). 100% branch coverage requires testing both sides of each if. 100% path coverage requires testing all four possible combinations of the two conditionals.

When Coverage Matters (and When It Doesn't)

Coverage is a useful metric with important limitations:

Coverage is useful for: - Identifying untested code paths - Ensuring edge cases are tested - Detecting dead code - Setting minimum quality bars in CI

Coverage does NOT tell you: - Whether your tests assert the right things - Whether your tests cover meaningful scenarios - Whether your code is correct

A test suite with 100% coverage can still miss bugs if the assertions are weak or wrong:

def test_bad_coverage():
    """This gives 100% coverage but tests nothing useful."""
    result = complex_calculation(42)
    assert result is not None  # Weak assertion!

Coverage Targets

For AI-generated code, aim for these targets:

  • Core business logic: 90-100% line coverage, 80%+ branch coverage
  • Utility functions: 100% line coverage (they're usually small and self-contained)
  • API endpoints: 85%+ coverage including error paths
  • Configuration and boilerplate: 60-70% (diminishing returns)
  • Overall project: 80%+ is a good target

Practical Tip: Do not chase 100% coverage everywhere. Instead, use coverage reports to find untested code that matters. An uncovered error handler for a critical financial calculation is far more important than an uncovered __repr__ method.

Mutation Testing: A Stronger Metric

Mutation testing goes beyond coverage by checking whether your tests actually detect bugs. It introduces small changes (mutations) to your code --- flipping a > to <, changing a + to a -, removing a return statement --- and checks whether any test fails. If no test catches a mutation, your tests have a gap.

# Using mutmut for Python mutation testing
mutmut run --paths-to-mutate=myapp/
mutmut results

Mutation testing is computationally expensive but reveals weaknesses that coverage alone cannot. Consider running it periodically on critical code paths rather than on every commit.


21.9 Continuous Testing Workflows

Integrating Tests into CI/CD

Tests provide value only if they run consistently. Continuous Integration (CI) automates test execution on every code change. Here is a GitHub Actions workflow for a Python project:

# .github/workflows/test.yml
name: Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.10", "3.11", "3.12"]

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-test.txt

      - name: Run linting
        run: |
          ruff check .
          mypy myapp/

      - name: Run tests
        run: |
          pytest --cov=myapp --cov-report=xml -v

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: ./coverage.xml

Test Organization for CI

Structure your test commands to enable incremental testing:

# Fast feedback: unit tests only (seconds)
pytest tests/unit/ -x --timeout=10

# Medium feedback: unit + integration (minutes)
pytest tests/unit/ tests/integration/ --timeout=60

# Full validation: everything including E2E (minutes to hours)
pytest --timeout=300

The -x flag stops on the first failure, providing fast feedback during development.

Pre-Commit Hooks

Run fast tests before every commit to catch problems early:

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pytest-fast
        name: Run fast tests
        entry: pytest tests/unit/ -x -q --timeout=10
        language: system
        pass_filenames: false
        always_run: true

Continuous Testing During Development

Modern IDEs and tools can run tests automatically as you save files:

# Watch mode: re-run tests on file changes
pytest-watch --runner "pytest tests/unit/ -x -q"

# Or using ptw (pytest-watch)
ptw -- --testmon  # Only run tests affected by changes

Prompting Tip: When you ask AI to build a feature, also ask it to generate the CI configuration: "Generate a GitHub Actions workflow that runs pytest with coverage on Python 3.11 and 3.12, fails if coverage drops below 80%, and runs mypy for type checking."


21.10 Building a Comprehensive Test Suite

Test Suite Architecture

A well-organized test suite mirrors your application structure:

project/
    myapp/
        __init__.py
        models.py
        services.py
        api.py
        utils.py
    tests/
        __init__.py
        conftest.py              # Shared fixtures
        unit/
            __init__.py
            test_models.py
            test_services.py
            test_utils.py
        integration/
            __init__.py
            test_database.py
            test_api.py
        e2e/
            __init__.py
            test_workflows.py
        property/
            __init__.py
            test_properties.py

A Complete Test Strategy for AI-Built Applications

When building an application with AI, follow this testing strategy:

Phase 1: Specification (Before AI generates code) 1. Write key test cases that define the desired behavior 2. Define properties that should hold for all inputs 3. Identify integration points that need testing

Phase 2: Verification (After AI generates code) 1. Run your pre-written tests against the AI code 2. Review and fix failures 3. Add tests for behaviors you discover during review 4. Run property-based tests to find edge cases

Phase 3: Hardening (Before deployment) 1. Achieve target coverage levels 2. Add E2E tests for critical user workflows 3. Test error handling and failure modes 4. Run mutation testing on critical paths

Prompting AI to Generate Tests

AI is excellent at generating tests when given the right prompts. Here are effective patterns:

Pattern 1: Generate tests for existing code

"Write comprehensive pytest tests for the following function. Include: - Normal/happy-path cases - Edge cases (empty input, None, boundary values) - Error cases (invalid input types, out-of-range values) - Use parametrize for multiple test cases - Include docstrings explaining each test

python def calculate_discount(price: float, percentage: float) -> float: ..."

Pattern 2: Generate property-based tests

"Write Hypothesis property-based tests for this function. Identify at least 3 properties that should hold for all valid inputs:

python def serialize(data: dict) -> str: ... def deserialize(text: str) -> dict: ..."

Pattern 3: Generate test fixtures

"Create pytest fixtures in a conftest.py for testing an e-commerce application. I need fixtures for: a sample product, a shopping cart with items, a mock payment gateway, and a test database connection."

Using Tests to Verify AI Code Quality

Here is a checklist for using tests to verify AI-generated code:

  1. Run existing tests. Does the new code break anything?
  2. Check coverage. Which lines of AI code are untested? Pay special attention to error handlers, edge cases, and complex conditionals.
  3. Run property-based tests. Do properties hold for random inputs?
  4. Test error messages. Does the code produce helpful error messages for invalid inputs?
  5. Test concurrency. If the code is meant to be thread-safe or async, test it under concurrent load.
  6. Test with realistic data. Use data that resembles production data in size and complexity.

Writing Tests That AI Cannot Fool

Some tests are more robust at catching AI errors than others:

Weak tests (easy for buggy code to pass):

def test_weak():
    result = process_data([1, 2, 3])
    assert result is not None
    assert isinstance(result, list)

Strong tests (precisely verify behavior):

def test_strong():
    result = process_data([1, 2, 3])
    assert result == [2, 4, 6]  # Exact expected output

def test_strong_edge_cases():
    assert process_data([]) == []
    assert process_data([0]) == [0]
    assert process_data([-1]) == [-2]

Strongest tests (verify properties across all inputs):

@given(st.lists(st.integers()))
def test_strongest(data):
    result = process_data(data)
    assert len(result) == len(data)
    for original, processed in zip(data, result):
        assert processed == original * 2

Key Insight: Think of your test suite as a net. Unit tests are a fine mesh that catches small bugs. Integration tests are a wider mesh that catches interaction problems. Property-based tests are a net that adapts its shape to catch whatever slips through. E2E tests are a final safety check at the boundaries. Together, they form a comprehensive safety net that gives you genuine confidence in AI-generated code.

Maintaining Tests Over Time

Tests are code, and like all code, they require maintenance:

  1. Delete tests that no longer apply. When requirements change, remove tests for the old requirements.
  2. Update tests when interfaces change. If a function's signature changes, update all tests that call it.
  3. Refactor tests for readability. Tests serve as documentation. Keep them clear.
  4. Avoid test interdependence. Each test should run independently. Do not rely on test execution order.
  5. Keep tests fast. Slow tests do not get run. If a test takes more than a few seconds, consider whether it can be restructured.

Summary

Testing AI-generated code is not optional --- it is the foundation of responsible vibe coding. This chapter covered the full spectrum of testing techniques, from fast unit tests with pytest to sophisticated property-based testing with Hypothesis. The key themes bear repeating:

  • AI-generated code needs more testing, not less, because you lack the mental model that comes from writing code yourself.
  • pytest provides a powerful, flexible foundation for all levels of testing, from simple assertions to complex fixture-based integration tests.
  • Property-based testing with Hypothesis is uniquely suited to AI code verification because it tests properties across thousands of randomly generated inputs, finding edge cases you would never think of.
  • TDD with AI is a powerful workflow where you write the specification (tests) and the AI writes the implementation. This keeps you in control while leveraging AI's speed.
  • Mocking isolates your tests from external dependencies, making them fast, reliable, and reproducible.
  • Coverage metrics are useful guides, not goals in themselves. Use them to find untested code that matters, not to chase arbitrary percentage targets.
  • Continuous testing through CI/CD ensures that tests run consistently and catch regressions early.

The most important takeaway from this chapter is a mindset shift: your tests are the specification, and the AI-generated code is the implementation that must conform to that specification. When you internalize this principle, testing transforms from a chore into your primary tool for maintaining quality and control in the age of AI-assisted development.


What's Next

In Chapter 22, we will explore Debugging and Troubleshooting --- what to do when tests fail, how to diagnose problems in AI-generated code, and strategies for working with AI to fix bugs efficiently. The testing skills you have built in this chapter will be your foundation for effective debugging.