Case Study: Raj's Test Suite — Copilot as Testing Partner

Background

Raj was three days into building a new REST API for his company's payment processing service. The API handled incoming webhook events from three payment providers, normalized them into a consistent internal format, and routed them to downstream services. The business logic was straightforward but the edge cases were not: different providers formatted timestamps differently, some sent optional fields inconsistently, amounts came in different currencies and required conversion, and error states varied enough between providers that normalization had real complexity.

By day three, Raj had working code. He had also written almost no tests.

This is a familiar position for developers working under deadline pressure. The code worked — he had manually tested the main paths. But he had a technical review scheduled in two days and knew the reviewer would ask about test coverage. More importantly, he knew from experience that payment processing code without comprehensive tests was a liability waiting to materialize.

He had an afternoon to build the test suite. He decided to use it as a systematic experiment in Copilot-as-testing-partner.

The Starting Point

The core of Raj's normalization logic was a function called normalize_payment_event:

def normalize_payment_event(
    raw_event: dict,
    provider: str,
    received_at: datetime
) -> PaymentEvent:
    """
    Normalize a raw webhook event from a payment provider into a standard PaymentEvent.

    Args:
        raw_event: The raw webhook payload as a dictionary
        provider: One of 'stripe', 'paypal', 'braintree'
        received_at: UTC timestamp when the webhook was received

    Returns:
        A normalized PaymentEvent with standardized fields

    Raises:
        ValueError: If provider is not recognized or required fields are missing
        AmountConversionError: If currency conversion fails
    """

He also had a PaymentEvent dataclass, three provider-specific extraction functions, a currency conversion helper, and a routing function that dispatched normalized events to downstream services.

Phase 1: Generating the Baseline Test Suite

Raj selected the normalize_payment_event function and used Copilot Chat's /tests command. The prompt was brief — just the slash command with the function selected.

Copilot generated twelve tests covering: - A basic Stripe success event - A basic PayPal success event - A basic Braintree success event - A test for an invalid provider raising ValueError - Tests for missing fields raising ValueError

Raj's immediate evaluation: this was a reasonable starting point. The happy path tests were structurally correct — they would catch if the function broke entirely. But they were thin.

Problems he noted immediately: - All three provider tests used identical mock data structures, which meant they were testing the normalization logic only once, not three times - None of the tests used realistic provider payloads — they used minimal, idealized dicts that no real provider actually sends - The error tests checked that ValueError was raised but not what the message said - No tests for the AmountConversionError path - No tests for the currency conversion logic specifically - No boundary condition tests

This was useful scaffolding. It was not adequate coverage.

Phase 2: Using Chat to Surface Edge Cases

Rather than writing all the missing tests himself, Raj turned to Copilot Chat with a different kind of request:

I have a function that normalizes webhook events from payment providers.
Here is the function signature and docstring: [pasted the full docstring]

Here is the PaymentEvent dataclass: [pasted the definition]

Think carefully about edge cases and unusual inputs that this function might
receive in production. List all the edge cases you can think of, organized
by category (input validation, data format issues, edge values, error states).
Don't generate test code yet — just list the cases.

Copilot's response listed thirty-one edge cases across six categories. Some Raj had already thought of: - Missing required fields (various combinations) - Invalid provider name - Currency conversion failure

Others were genuinely useful additions he had not fully thought through: - Timestamp edge cases: Unix timestamps vs. ISO 8601 strings vs. "relative" timestamps some providers include - Amount precision: floating-point representations from providers, amounts with more than 2 decimal places, amounts of exactly zero - Duplicate event handling: the same event ID appearing twice - Events from the future (received_at earlier than event timestamp) - Very large transaction amounts near float precision limits - Provider payloads with extra unexpected fields (should these be ignored, preserved, or raise errors?) - Refund events where amount is negative - Partial events (Braintree specifically sends incremental authorization updates)

Several of these immediately rang true. Raj had a Stripe production issue three months earlier caused by unexpected float precision in amount fields — Stripe sends amounts as integers in the smallest currency unit (cents for USD) but the internal system expected decimals. He had fixed the bug at the time but the test coverage for that fix had not been carried forward into the new service.

Phase 3: Structuring the Test Suite

With the edge case list in hand, Raj made a structural decision: he would write one test file per provider plus a shared test file for cross-cutting concerns. This was his design choice, not Copilot's — the organizational structure of a test suite is a judgment call about maintainability that AI should not make unilaterally.

He then went back to Copilot, provider by provider, asking for test generation with explicit context:

Generate pytest tests for the Stripe-specific normalization path of
normalize_payment_event. Use realistic Stripe webhook payloads
(Stripe sends amounts as integers in cents, timestamps as Unix integers,
uses their standard event object structure with 'id', 'type', 'data.object').

Cover these specific cases: [listed the relevant edge cases from the earlier output]

Use pytest fixtures for test data. Mock the currency conversion function
with unittest.mock. Use descriptive test names that explain what each test verifies.

This produced significantly better tests than the initial /tests generation. Because Raj had provided realistic payload structure, Copilot generated tests that actually used Stripe's data format. Because he had specified the edge cases, the tests covered the right ground. Because he had specified the fixture pattern, the test organization was consistent.

He repeated this for PayPal and Braintree, adjusting the payload description for each provider's actual format.

Phase 4: Review and Quality Control

After Copilot generated the bulk of the test code, Raj conducted a systematic review. His review checklist:

For each test: Is the assertion meaningful? Three of the generated tests had assertions like assert result is not None or assert isinstance(result, PaymentEvent). These were structurally valid but behaviorally weak — they confirmed the function returned something but not whether it returned the right thing. Raj replaced these with assertions on specific fields: assert result.amount == Decimal('19.99'), assert result.currency == 'USD', assert result.provider_event_id == 'evt_test_123'.

For each error test: Does it test the right error and message? The generated error tests often only checked pytest.raises(ValueError) without checking the message. For user-facing errors, the message content matters — it affects debugging. Raj added match= patterns to key error tests.

For the currency conversion tests: Are we testing the conversion or the mock? Raj discovered that two tests were testing that the mock returned the mocked value rather than testing that the function correctly called the converter. He rewrote these to test the integration: that when the converter raised AmountConversionError, the normalization function propagated it correctly.

Are there any always-passing tests? He ran each test in isolation with a deliberately broken implementation of normalize_payment_event to verify it actually failed. Two tests passed even against a broken implementation. He rewrote those.

Phase 5: Gap Analysis

After the structured review, Raj used conversational AI (Claude, not Copilot) for a final gap analysis:

I've built a test suite for a payment event normalization function.
Here is the complete test file: [pasted all tests]
Here is the function under test: [pasted the implementation]

What test cases are missing? Look specifically for:
1. Paths through the code that have no test coverage
2. Error states that might occur but aren't tested
3. Interaction effects between edge cases (e.g., missing field AND wrong type simultaneously)
4. Performance edge cases for list/batch inputs

Claude identified four gaps Raj had missed: - No tests for the routing function downstream from normalization - No test for what happens when received_at is in the past by more than 24 hours (the system had a staleness check Raj had implemented but forgotten to test) - No tests for concurrent calls with the same event ID - No test for the batch processing path

The first two were immediate additions. The concurrent access test was deferred — it required integration test infrastructure he could not spin up in an afternoon. The batch path test was added.

The Final Suite

The final test suite: 67 tests across 4 files. Coverage on the normalization module: 94%.

Comparison to what Copilot's initial /tests command produced: 12 tests, coverage approximately 38%.

Time to build the final suite: approximately 3.5 hours for a module that would have taken 6-8 hours to test without AI assistance. Raj's estimate of time saved: 3-4 hours.

More importantly: Raj found two bugs in his own implementation while reviewing AI-generated test failures. Both were in edge cases he had not manually tested: a timestamp normalization error for PayPal's specific format and an incorrect handling of zero-amount events (which occur in certain authorization flows).

What Raj Learned About Copilot as a Testing Tool

The highest-value use is edge case generation, not initial test generation. The first /tests output is scaffolding. The real value is using Chat to enumerate edge cases you might have missed, then generating tests for those specific cases.

Provider-specific context dramatically improves test quality. Generic tests use generic data. Realistic tests require context about the actual system. The more specific the prompt about data formats, the more realistic and valuable the generated tests.

AI-generated tests require assertiveness review. The most consistent quality issue is weak assertions — tests that verify presence rather than correctness. Budget time for this review pass.

Two tools are better than one. Raj found Copilot stronger for test structure and boilerplate; Claude stronger for gap analysis reasoning. Using both produced better coverage than either alone.

The human still designs the test architecture. How tests are organized, what constitutes a logical test group, how to structure fixtures for maintainability — these are design decisions that reflect experience and system knowledge AI does not have. Raj made these calls. Copilot filled them in.

The test suite Raj built in that afternoon caught two real bugs, survived the technical review without complaints about coverage, and has since caught three regressions during ongoing development. That is the outcome measure that matters.