Case Study 2: Payment Integration Done Right

Overview

Project: SubscribePro -- A SaaS subscription management platform handling monthly recurring payments.

Team: Three developers at a fintech startup transitioning from a manual billing process to automated payment processing.

Timeline: Six-week development cycle, followed by a phased rollout to 500 existing customers.

Core Integration: Stripe-style payment processor for credit card billing, with supporting integrations for email receipts, invoice storage, and internal Slack notifications.


The Challenge

SubscribePro had been processing payments manually -- a developer would run a script each month that generated invoices and the finance team would process them one by one. With 500 customers and growing, this was unsustainable. The team needed to build an automated system that could:

  1. Process monthly recurring payments automatically on each customer's billing date
  2. Handle failed payments with intelligent retry logic
  3. Send email receipts for successful payments and dunning emails for failures
  4. Generate and store PDF invoices in cloud storage
  5. Notify the team of high-value payments and critical failures via Slack
  6. Handle subscription upgrades, downgrades, and cancellations mid-cycle with prorated billing

The stakes were high. A bug in payment processing could overcharge customers, miss payments, or worse -- charge the wrong amount. Every edge case needed to be handled correctly.


Architecture Decisions

Decision 1: Never Handle Raw Card Data

The team's first and most critical decision was to ensure that card numbers never touched their servers. They used the payment processor's client-side tokenization:

┌──────────┐    Card Details     ┌──────────────┐
│  Client  │ ───────────────────>│   Payment    │
│  Browser │    Payment Token    │  Processor   │
│          │ <───────────────────│   (Stripe)   │
└──────────┘                     └──────────────┘
      │
      │  Token Only
      ▼
┌──────────┐   Create Customer   ┌──────────────┐
│   Our    │ ───────────────────>│   Payment    │
│  Server  │   Customer ID       │  Processor   │
│          │ <───────────────────│              │
└──────────┘                     └──────────────┘

The client-side JavaScript widget collected card details and sent them directly to the payment processor. The application server only ever received a token, which it used to create a customer profile on the payment processor. This reduced their PCI compliance scope from SAQ D (the most burdensome) to SAQ A (the simplest).

Decision 2: Idempotency Everywhere

The team learned early in development that payment operations must be idempotent. During testing, a network timeout caused a payment to be created twice -- the test customer was charged $49.99 twice. This led to a strict idempotency policy:

import hashlib


def generate_idempotency_key(
    customer_id: str,
    invoice_id: str,
    amount: int,
) -> str:
    """Generate a deterministic idempotency key for a payment.

    Using a deterministic key (rather than a random UUID) ensures
    that retries of the same logical operation use the same key,
    even across different server instances or restarts.
    """
    raw = f"{customer_id}:{invoice_id}:{amount}"
    return hashlib.sha256(raw.encode()).hexdigest()

The key insight was using deterministic idempotency keys rather than random UUIDs. A random UUID would be different on each retry, defeating the purpose. A deterministic key based on the customer, invoice, and amount ensured that even if the retry happened from a different server instance (after a restart, for example), the payment processor would recognize it as a duplicate.

Decision 3: Webhook-Driven State Machine

Rather than relying on synchronous API responses for payment status, the team built a state machine driven by webhooks:

class PaymentState(Enum):
    CREATED = "created"
    PROCESSING = "processing"
    SUCCEEDED = "succeeded"
    FAILED = "failed"
    REQUIRES_ACTION = "requires_action"  # 3D Secure
    CANCELED = "canceled"
    REFUNDED = "refunded"


VALID_TRANSITIONS = {
    PaymentState.CREATED: {
        PaymentState.PROCESSING,
        PaymentState.CANCELED,
    },
    PaymentState.PROCESSING: {
        PaymentState.SUCCEEDED,
        PaymentState.FAILED,
        PaymentState.REQUIRES_ACTION,
    },
    PaymentState.REQUIRES_ACTION: {
        PaymentState.PROCESSING,
        PaymentState.FAILED,
        PaymentState.CANCELED,
    },
    PaymentState.FAILED: {
        PaymentState.PROCESSING,  # retry
    },
    PaymentState.SUCCEEDED: {
        PaymentState.REFUNDED,
    },
}


def transition_payment(
    current: PaymentState,
    target: PaymentState,
) -> bool:
    """Validate and execute a payment state transition."""
    valid_targets = VALID_TRANSITIONS.get(current, set())
    if target not in valid_targets:
        raise InvalidStateTransition(
            f"Cannot transition from {current} to {target}"
        )
    return True

The state machine prevented impossible transitions (e.g., refunding a failed payment) and provided a clear audit trail. Every state change was logged with the webhook event that triggered it.

Decision 4: Intelligent Retry for Failed Payments

Failed payments are common -- expired cards, insufficient funds, bank declines. The team implemented a graduated retry schedule:

RETRY_SCHEDULE = [
    {"delay_days": 1, "notify_customer": False},
    {"delay_days": 3, "notify_customer": True,
     "message": "gentle_reminder"},
    {"delay_days": 5, "notify_customer": True,
     "message": "second_reminder"},
    {"delay_days": 7, "notify_customer": True,
     "message": "final_warning"},
]

MAX_RETRIES = len(RETRY_SCHEDULE)


async def handle_payment_failure(
    payment: Payment,
    subscription: Subscription,
) -> None:
    """Handle a failed payment with graduated retry logic."""
    retry_count = payment.retry_count

    if retry_count >= MAX_RETRIES:
        await cancel_subscription(subscription)
        await send_cancellation_email(subscription.customer)
        await notify_team_slack(
            f"Subscription {subscription.id} canceled "
            f"after {MAX_RETRIES} failed payment attempts"
        )
        return

    schedule = RETRY_SCHEDULE[retry_count]

    # Schedule the retry
    retry_date = datetime.utcnow() + timedelta(
        days=schedule["delay_days"]
    )
    await schedule_payment_retry(payment.id, retry_date)

    # Notify customer if required
    if schedule["notify_customer"]:
        await send_dunning_email(
            subscription.customer,
            template=schedule["message"],
            amount=payment.amount,
            retry_date=retry_date,
        )

This "dunning" process -- gradually escalating notifications for failed payments -- recovered 68% of initially failed payments within the first retry and 82% within all four retries.

Decision 5: Comprehensive Webhook Handling

The payment processor sent webhooks for numerous event types. The team needed to handle them all:

WEBHOOK_HANDLERS = {
    "payment_intent.succeeded": handle_payment_success,
    "payment_intent.payment_failed": handle_payment_failure,
    "customer.subscription.created": handle_subscription_created,
    "customer.subscription.updated": handle_subscription_updated,
    "customer.subscription.deleted": handle_subscription_deleted,
    "invoice.created": handle_invoice_created,
    "invoice.paid": handle_invoice_paid,
    "invoice.payment_failed": handle_invoice_payment_failed,
    "charge.refunded": handle_refund,
    "charge.dispute.created": handle_dispute,
}


@app.post("/webhooks/payments")
async def payment_webhook(request: Request):
    """Process payment processor webhooks."""
    payload = await request.body()
    sig_header = request.headers.get("Stripe-Signature", "")

    # Verify signature
    try:
        event = verify_and_parse_webhook(
            payload, sig_header, WEBHOOK_SECRET
        )
    except SignatureVerificationError:
        raise HTTPException(status_code=401)

    # Check for duplicate delivery
    event_id = event["id"]
    if await is_event_processed(event_id):
        return {"status": "already_processed"}

    # Route to handler
    event_type = event["type"]
    handler = WEBHOOK_HANDLERS.get(event_type)

    if handler:
        try:
            await handler(event["data"]["object"])
            await mark_event_processed(event_id)
        except Exception as exc:
            logger.error(
                f"Webhook handler failed: {event_type}",
                extra={"event_id": event_id, "error": str(exc)},
            )
            # Return 500 so the processor retries
            raise HTTPException(status_code=500)
    else:
        logger.info(f"Unhandled webhook event type: {event_type}")

    return {"status": "processed"}

A critical implementation detail: when the handler fails, the endpoint returns a 500 status code. This tells the payment processor to retry the webhook later. If the handler succeeds, the event ID is stored in the database to prevent duplicate processing on redelivery.


The Hardest Bug: Race Condition in Payment Confirmation

Three weeks into development, the team discovered a subtle race condition. The payment flow was:

  1. Server creates a payment intent via the API
  2. Payment processor processes the payment
  3. Payment processor sends a webhook
  4. Server updates the order based on the webhook

The problem: sometimes the webhook arrived before the synchronous API response was processed. The server would receive the webhook, look up the payment in the database, and find nothing -- because the code creating the payment had not yet committed the database transaction.

The solution was a two-part approach:

async def handle_payment_webhook(event_data: dict) -> None:
    """Handle payment webhook with retry for missing records."""
    payment_processor_id = event_data["id"]

    # Try to find the payment, with retries for race condition
    for attempt in range(3):
        payment = await db.get_payment_by_processor_id(
            payment_processor_id
        )
        if payment:
            break
        # Payment not in DB yet -- wait and retry
        await asyncio.sleep(1.0 * (attempt + 1))
    else:
        # After 3 retries, store the event for later processing
        await store_unmatched_webhook(event_data)
        logger.warning(
            f"Payment {payment_processor_id} not found after "
            f"3 retries -- stored for later processing"
        )
        return

    # Process the webhook
    await update_payment_status(payment, event_data)

Additionally, the team added a periodic job that checked for unmatched webhooks and attempted to process them again. This "eventually consistent" approach handled the race condition gracefully without any data loss.


Security Measures

Webhook Signature Verification

Every webhook was verified using HMAC-SHA256:

def verify_and_parse_webhook(
    payload: bytes,
    signature_header: str,
    secret: str,
) -> dict:
    """Verify webhook signature and parse payload."""
    # Parse the signature header (Stripe format)
    elements = dict(
        item.split("=", 1)
        for item in signature_header.split(",")
    )
    timestamp = elements.get("t", "")
    signature = elements.get("v1", "")

    # Check timestamp to prevent replay attacks
    webhook_time = int(timestamp)
    current_time = int(time.time())
    if abs(current_time - webhook_time) > 300:  # 5 minute tolerance
        raise SignatureVerificationError("Webhook timestamp too old")

    # Verify signature
    signed_payload = f"{timestamp}.{payload.decode()}"
    expected = hmac.new(
        secret.encode(),
        signed_payload.encode(),
        hashlib.sha256,
    ).hexdigest()

    if not hmac.compare_digest(expected, signature):
        raise SignatureVerificationError("Invalid signature")

    return json.loads(payload)

Secret Management

All API keys and secrets were stored in a secrets manager, never in code or environment files committed to version control:

class SecretManager:
    """Retrieve secrets from the secrets management service."""

    def __init__(self, project_id: str):
        self.project_id = project_id
        self._cache: dict[str, str] = {}

    async def get_secret(self, name: str) -> str:
        """Retrieve a secret by name."""
        if name in self._cache:
            return self._cache[name]

        # In production, this calls AWS Secrets Manager, GCP Secret
        # Manager, or HashiCorp Vault
        secret = await self._fetch_from_vault(name)
        self._cache[name] = secret
        return secret

Audit Logging

Every payment operation was logged with full context:

async def log_payment_event(
    event_type: str,
    payment_id: str,
    customer_id: str,
    amount: int,
    currency: str,
    metadata: dict | None = None,
) -> None:
    """Log a payment event for audit purposes."""
    await audit_db.insert({
        "event_type": event_type,
        "payment_id": payment_id,
        "customer_id": customer_id,
        "amount": amount,
        "currency": currency,
        "metadata": metadata or {},
        "timestamp": datetime.utcnow().isoformat(),
        "server_id": INSTANCE_ID,
    })

Rollout Strategy

The team did not flip a switch for all 500 customers at once. They used a phased rollout:

Phase 1 (Week 1): 10 internal test accounts. The team subscribed themselves and monitored every webhook, email, and payment.

Phase 2 (Week 2): 50 customers who had opted in to beta testing. The team manually verified each payment for the first billing cycle.

Phase 3 (Week 3): 200 customers. Automated monitoring was in place. The team tracked key metrics: payment success rate, webhook delivery rate, email delivery rate.

Phase 4 (Week 4): All 500 customers. The manual billing process was retired.


Results

Key Metrics After 3 Months

Metric Before (Manual) After (Automated)
Time to process monthly billing 3 days 4 hours (fully automated)
Payment success rate (first attempt) 95% 94.8%
Payment recovery rate (after retries) Unknown 82%
Invoice delivery time 1-3 days Immediate
Billing errors per month 5-8 0
Team hours spent on billing 40/month 2/month (monitoring)

The slight decrease in first-attempt success rate was attributed to automated billing catching expired cards that the manual process had been silently skipping. The retry system recovered most of these.

Lesson 1: Idempotency Is Not Optional

The double-charge bug during testing was a wake-up call. Idempotency keys were added to every payment operation, and the team established a rule: no payment API call without an idempotency key. The deterministic key generation (based on customer + invoice + amount) was more robust than random UUIDs because it survived server restarts and retries from different instances.

Lesson 2: Webhooks Are the Source of Truth

Initially, the team updated payment status based on the synchronous API response. This led to inconsistencies when the response indicated "processing" but the actual result was different. Moving to a webhook-driven state machine made the system more reliable. The synchronous response was used only to detect immediate errors (invalid parameters, authentication failures).

Lesson 3: Test with Real Money (Small Amounts)

The payment processor's test mode was useful for development, but it did not catch every edge case. During the Phase 1 rollout, the team used real payments of $1.00 to verify the complete flow, including actual bank processing, real email delivery, and real webhook timing. Several issues were found that test mode did not reveal, including a timezone bug in prorated billing calculations.

Lesson 4: The Dunning Process Pays for Itself

Implementing the graduated retry system with dunning emails took a full week of development. It recovered 82% of failed payments that would otherwise have been lost. For a subscription business, this directly impacts monthly recurring revenue. The team calculated that the dunning system recovered approximately $12,000/month in payments that would have otherwise churned.

Lesson 5: AI Excels at Payment Edge Cases

When implementing prorated billing for mid-cycle upgrades, the team asked the AI assistant: "Calculate the prorated amount when a customer upgrades from a $49/month plan to a $99/month plan on day 15 of a 30-day billing cycle, considering the unused credit from the current plan." The AI produced the correct calculation on the first attempt, including handling of leap years and varying month lengths. This kind of domain-specific calculation is where AI coding assistants provide tremendous value.


Post-Mortem: The One Incident

Six weeks after full rollout, the payment processor experienced a 45-minute outage. During this period:

  • 23 payments failed at the API level
  • 12 webhooks were delayed by up to 2 hours
  • 3 customers saw temporary error states on their dashboards

The system handled the outage correctly:

  1. Failed API calls were caught by the retry logic and rescheduled
  2. Delayed webhooks were processed successfully when they arrived (idempotency prevented duplicates)
  3. Customer-facing errors were logged and the team was notified via Slack within 2 minutes

All 23 payments were successfully processed once the outage resolved. No manual intervention was required. The team's investment in resilience patterns -- retries, idempotency, webhook-driven state management -- paid off exactly as designed.

This case study demonstrates that payment integration is fundamentally about managing uncertainty. Networks fail, services go down, and edge cases appear. The combination of idempotency, state machines, intelligent retries, and webhook-driven architecture creates a system that handles uncertainty gracefully, protecting both the business and its customers.