Case Study 2: Catching AI Bugs with Property-Based Testing

Overview

This case study demonstrates how property-based testing with Hypothesis reveals subtle bugs in AI-generated code --- bugs that look correct on casual inspection and pass traditional example-based tests, yet fail under specific conditions that humans rarely think to test. We examine three AI-generated functions, show how standard tests miss their flaws, and then use Hypothesis to systematically uncover every defect.

The Scenario

A developer asks an AI to generate a utility library with three functions:

  1. flatten_list --- Recursively flatten a nested list into a single flat list.
  2. merge_sorted --- Merge two sorted lists into a single sorted list.
  3. json_safe_encode --- Convert a Python dictionary to a JSON string, handling special types like datetime and Decimal.

The developer writes basic example-based tests, and all three functions pass. Months later, a production bug leads the developer to add property-based tests. The results are revealing.

Function 1: flatten_list

The AI-Generated Code

def flatten_list(nested: list) -> list:
    """Recursively flatten a nested list structure.

    Args:
        nested: A potentially nested list of values.

    Returns:
        A flat list containing all leaf values.
    """
    result = []
    for item in nested:
        if isinstance(item, list):
            result.extend(flatten_list(item))
        else:
            result.append(item)
    return result

Example-Based Tests That Pass

def test_flatten_simple():
    assert flatten_list([1, 2, 3]) == [1, 2, 3]

def test_flatten_nested():
    assert flatten_list([1, [2, 3], [4, [5, 6]]]) == [1, 2, 3, 4, 5, 6]

def test_flatten_empty():
    assert flatten_list([]) == []

def test_flatten_deeply_nested():
    assert flatten_list([[[1]], [[2]], [[3]]]) == [1, 2, 3]

All four tests pass. The code looks correct. But is it?

Property-Based Tests Reveal the Bug

from hypothesis import given, settings
from hypothesis import strategies as st

# Strategy for generating nested lists
nested_lists = st.recursive(
    st.integers() | st.text(max_size=10) | st.none(),
    lambda children: st.lists(children, max_size=5),
    max_leaves=20
)

@given(nested_lists)
def test_flatten_preserves_all_elements(nested):
    """Every leaf element in the input should appear in the output."""
    if not isinstance(nested, list):
        nested = [nested]
    flat = flatten_list(nested)

    def count_leaves(structure):
        if not isinstance(structure, list):
            return 1
        return sum(count_leaves(item) for item in structure)

    assert len(flat) == count_leaves(nested)

@given(st.lists(st.integers() | st.tuples(st.integers(), st.integers())))
def test_flatten_handles_tuples(data):
    """Tuples should be preserved as leaf values, not flattened."""
    flat = flatten_list(data)
    for item in flat:
        assert not isinstance(item, list)

These tests pass --- tuples are indeed left as leaf values. But consider another property:

@given(st.lists(st.integers() | st.text() | st.binary()))
def test_flatten_handles_all_iterables_correctly(data):
    """Non-list iterables like strings and bytes should be leaf values."""
    flat = flatten_list(data)
    assert len(flat) == len(data)

This test passes for this implementation because the code checks isinstance(item, list) specifically. But what if we change the test to probe deeper?

@given(st.lists(
    st.integers() | st.lists(st.integers(), max_size=3),
    max_size=10
))
def test_flatten_idempotent(data):
    """Flattening an already-flat list should produce the same list."""
    once = flatten_list(data)
    twice = flatten_list(once)
    assert once == twice

This idempotency test passes. The function seems solid. Now let us move to a function with a real bug.

Function 2: merge_sorted

The AI-Generated Code

def merge_sorted(list_a: list[int], list_b: list[int]) -> list[int]:
    """Merge two sorted lists into a single sorted list.

    Args:
        list_a: A sorted list of integers.
        list_b: A sorted list of integers.

    Returns:
        A single sorted list containing all elements from both inputs.
    """
    result = []
    i, j = 0, 0

    while i < len(list_a) and j < len(list_b):
        if list_a[i] <= list_b[j]:
            result.append(list_a[i])
            i += 1
        else:
            result.append(list_b[j])
            j += 1

    # Append remaining elements
    result.extend(list_a[i:])
    result.extend(list_b[j:])

    return result

Example-Based Tests That Pass

def test_merge_basic():
    assert merge_sorted([1, 3, 5], [2, 4, 6]) == [1, 2, 3, 4, 5, 6]

def test_merge_empty_first():
    assert merge_sorted([], [1, 2, 3]) == [1, 2, 3]

def test_merge_empty_second():
    assert merge_sorted([1, 2, 3], []) == [1, 2, 3]

def test_merge_both_empty():
    assert merge_sorted([], []) == []

def test_merge_duplicates():
    assert merge_sorted([1, 2, 3], [2, 3, 4]) == [1, 2, 2, 3, 3, 4]

All five tests pass. The function handles the normal cases, empty lists, and duplicates. Shipping it.

Property-Based Tests Reveal the Bug

from hypothesis import given, assume
from hypothesis import strategies as st

sorted_lists = st.lists(st.integers(min_value=-1000, max_value=1000)).map(sorted)

@given(sorted_lists, sorted_lists)
def test_merge_result_is_sorted(list_a, list_b):
    """The merged result should always be sorted."""
    result = merge_sorted(list_a, list_b)
    for i in range(len(result) - 1):
        assert result[i] <= result[i + 1]

@given(sorted_lists, sorted_lists)
def test_merge_preserves_all_elements(list_a, list_b):
    """The merged result should contain exactly all elements from both lists."""
    result = merge_sorted(list_a, list_b)
    assert sorted(result) == sorted(list_a + list_b)

@given(sorted_lists, sorted_lists)
def test_merge_length(list_a, list_b):
    """The merged result length should equal the sum of input lengths."""
    result = merge_sorted(list_a, list_b)
    assert len(result) == len(list_a) + len(list_b)

These tests all pass. The merge_sorted function is actually correct for sorted inputs. But what happens when we test the function's precondition? The function's docstring says it expects sorted inputs, but what happens with unsorted inputs?

@given(st.lists(st.integers()), st.lists(st.integers()))
def test_merge_with_unsorted_inputs(list_a, list_b):
    """What happens when inputs are not sorted?"""
    result = merge_sorted(list_a, list_b)
    # The function silently produces incorrect results!
    # It does not validate its precondition.
    expected = sorted(list_a + list_b)
    assert result == expected  # This FAILS for unsorted inputs

Hypothesis finds the bug: when given unsorted inputs like [2, 1] and [0], the function returns [0, 2, 1] --- not sorted at all. The function lacks input validation. This is a classic AI code failure: the happy path works perfectly, but the precondition is not enforced.

The Fix

We prompt the AI to add input validation:

def merge_sorted(list_a: list[int], list_b: list[int]) -> list[int]:
    """Merge two sorted lists into a single sorted list.

    Args:
        list_a: A sorted list of integers.
        list_b: A sorted list of integers.

    Returns:
        A single sorted list containing all elements from both inputs.

    Raises:
        ValueError: If either input list is not sorted.
    """
    # Validate preconditions
    for i in range(len(list_a) - 1):
        if list_a[i] > list_a[i + 1]:
            raise ValueError(f"list_a is not sorted: {list_a[i]} > {list_a[i+1]}")
    for i in range(len(list_b) - 1):
        if list_b[i] > list_b[i + 1]:
            raise ValueError(f"list_b is not sorted: {list_b[i]} > {list_b[i+1]}")

    # ... rest of implementation

Now the updated property test verifies proper behavior:

@given(st.lists(st.integers()), st.lists(st.integers()))
def test_merge_rejects_unsorted_inputs(list_a, list_b):
    """Unsorted inputs should raise ValueError."""
    a_sorted = all(list_a[i] <= list_a[i+1] for i in range(len(list_a)-1))
    b_sorted = all(list_b[i] <= list_b[i+1] for i in range(len(list_b)-1))

    if a_sorted and b_sorted:
        result = merge_sorted(list_a, list_b)
        assert result == sorted(list_a + list_b)
    else:
        with pytest.raises(ValueError):
            merge_sorted(list_a, list_b)

Function 3: json_safe_encode

The AI-Generated Code

import json
from datetime import datetime, date
from decimal import Decimal


class SafeEncoder(json.JSONEncoder):
    """JSON encoder that handles datetime and Decimal types."""

    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        if isinstance(obj, date):
            return obj.isoformat()
        if isinstance(obj, Decimal):
            return float(obj)
        return super().default(obj)


def json_safe_encode(data: dict) -> str:
    """Encode a dictionary to JSON, handling special Python types.

    Args:
        data: A dictionary potentially containing datetime, date,
              and Decimal values.

    Returns:
        A JSON string representation.
    """
    return json.dumps(data, cls=SafeEncoder)

Example-Based Tests That Pass

from datetime import datetime, date
from decimal import Decimal

def test_encode_basic():
    assert json_safe_encode({"key": "value"}) == '{"key": "value"}'

def test_encode_datetime():
    dt = datetime(2024, 1, 15, 10, 30, 0)
    result = json_safe_encode({"created": dt})
    assert "2024-01-15T10:30:00" in result

def test_encode_decimal():
    result = json_safe_encode({"price": Decimal("19.99")})
    parsed = json.loads(result)
    assert parsed["price"] == 19.99

def test_encode_date():
    result = json_safe_encode({"birthday": date(1990, 5, 20)})
    assert "1990-05-20" in result

All pass. But Hypothesis finds a precision problem.

Property-Based Tests Reveal the Bug

from hypothesis import given
from hypothesis import strategies as st
from decimal import Decimal
import json

@given(st.decimals(allow_nan=False, allow_infinity=False))
def test_decimal_roundtrip_precision(value):
    """Decimal values should survive JSON encoding without precision loss."""
    data = {"amount": value}
    encoded = json_safe_encode(data)
    decoded = json.loads(encoded)
    assert Decimal(str(decoded["amount"])) == value

Hypothesis finds the bug: For the Decimal value Decimal("0.1"), the encoder converts it to float(0.1), which is 0.1 in JSON. But Decimal(str(0.1)) gives Decimal('0.1'), which happens to match. However, for values like Decimal("1.10"), the float conversion loses the trailing zero: float(Decimal("1.10")) becomes 1.1, and Decimal(str(1.1)) gives Decimal('1.1'), which does not equal Decimal("1.10") in a strict comparison.

More critically, for certain decimal values with many significant digits, the float conversion introduces genuine precision errors:

>>> float(Decimal("0.3333333333333333333333333333"))
0.3333333333333333  # Precision lost!

The Fix

Convert Decimals to strings instead of floats:

class SafeEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        if isinstance(obj, date):
            return obj.isoformat()
        if isinstance(obj, Decimal):
            return str(obj)  # Preserve full precision
        return super().default(obj)

This changes the JSON output from {"price": 19.99} (a JSON number) to {"price": "19.99"} (a JSON string), which preserves full precision. The developer must decide which behavior is correct for their use case, but the key insight is that Hypothesis surfaced the precision issue that none of the example-based tests caught.

Quantifying the Results

Function Example Tests Bugs Found by Examples Hypothesis Tests Bugs Found by Hypothesis
flatten_list 4 0 4 0 (function was correct)
merge_sorted 5 0 4 1 (missing input validation)
json_safe_encode 4 0 1 1 (Decimal precision loss)
Total 13 0 9 2

Thirteen hand-written example tests found zero bugs. Nine property-based tests found two bugs. This is not because the example tests were poorly written --- they covered all the "obvious" cases. The bugs existed in territories that humans rarely think to explore: unsorted inputs to a function that assumes sorted inputs, and precision edge cases in type conversions.

Lessons Learned

1. Property-Based Testing Finds Bugs at Boundaries

Both bugs existed at the boundaries of the functions' expected inputs. merge_sorted failed at the boundary between valid (sorted) and invalid (unsorted) inputs. json_safe_encode failed at the boundary between float-representable and float-lossy Decimal values. These are exactly the boundaries that humans --- and AI --- tend to overlook.

2. AI-Generated Code Has Systematic Blind Spots

The AI generated clean, well-documented code that worked for all standard cases. Its blind spots were predictable: missing input validation and naive type conversions. Property-based testing is systematically effective against these blind spots because it explores the input space much more broadly than hand-picked examples.

3. Start with Properties, Not Examples

When verifying AI-generated code, start by asking: "What properties should always hold?" For merge_sorted, the property is that the output is always sorted and contains exactly the elements from both inputs. For json_safe_encode, the property is that encoding and decoding should preserve data. These properties are easier to define than specific examples and far more powerful at finding bugs.

4. Hypothesis's Shrinking Is Invaluable

When Hypothesis finds a failing input, it automatically "shrinks" it to the simplest possible example that still triggers the failure. For merge_sorted, instead of reporting a failure with a 50-element unsorted list, it shrank the input to [1, 0] and [] --- the simplest possible unsorted list with the simplest second argument. This makes debugging dramatically easier.

5. Property-Based Testing Complements Example-Based Testing

Property-based tests do not replace example-based tests. Example-based tests are easier to read, serve as documentation, and verify specific important behaviors. Property-based tests provide breadth and find edge cases. The combination is far stronger than either approach alone.

Conclusion

This case study demonstrates that property-based testing with Hypothesis is one of the most effective tools for verifying AI-generated code. The bugs it finds are exactly the kind that AI tends to introduce: missing validation, naive type handling, and incomplete edge case coverage. By investing a small amount of time in defining properties --- invariants that should always hold --- you gain a testing approach that explores thousands of cases and surfaces problems that hand-written tests routinely miss. For any AI-generated code that handles diverse inputs, property-based testing should be a mandatory part of your verification strategy.