Case Study 2: Payment Integration Done Right
Overview
Project: SubscribePro -- A SaaS subscription management platform handling monthly recurring payments.
Team: Three developers at a fintech startup transitioning from a manual billing process to automated payment processing.
Timeline: Six-week development cycle, followed by a phased rollout to 500 existing customers.
Core Integration: Stripe-style payment processor for credit card billing, with supporting integrations for email receipts, invoice storage, and internal Slack notifications.
The Challenge
SubscribePro had been processing payments manually -- a developer would run a script each month that generated invoices and the finance team would process them one by one. With 500 customers and growing, this was unsustainable. The team needed to build an automated system that could:
- Process monthly recurring payments automatically on each customer's billing date
- Handle failed payments with intelligent retry logic
- Send email receipts for successful payments and dunning emails for failures
- Generate and store PDF invoices in cloud storage
- Notify the team of high-value payments and critical failures via Slack
- Handle subscription upgrades, downgrades, and cancellations mid-cycle with prorated billing
The stakes were high. A bug in payment processing could overcharge customers, miss payments, or worse -- charge the wrong amount. Every edge case needed to be handled correctly.
Architecture Decisions
Decision 1: Never Handle Raw Card Data
The team's first and most critical decision was to ensure that card numbers never touched their servers. They used the payment processor's client-side tokenization:
┌──────────┐ Card Details ┌──────────────┐
│ Client │ ───────────────────>│ Payment │
│ Browser │ Payment Token │ Processor │
│ │ <───────────────────│ (Stripe) │
└──────────┘ └──────────────┘
│
│ Token Only
▼
┌──────────┐ Create Customer ┌──────────────┐
│ Our │ ───────────────────>│ Payment │
│ Server │ Customer ID │ Processor │
│ │ <───────────────────│ │
└──────────┘ └──────────────┘
The client-side JavaScript widget collected card details and sent them directly to the payment processor. The application server only ever received a token, which it used to create a customer profile on the payment processor. This reduced their PCI compliance scope from SAQ D (the most burdensome) to SAQ A (the simplest).
Decision 2: Idempotency Everywhere
The team learned early in development that payment operations must be idempotent. During testing, a network timeout caused a payment to be created twice -- the test customer was charged $49.99 twice. This led to a strict idempotency policy:
import hashlib
def generate_idempotency_key(
customer_id: str,
invoice_id: str,
amount: int,
) -> str:
"""Generate a deterministic idempotency key for a payment.
Using a deterministic key (rather than a random UUID) ensures
that retries of the same logical operation use the same key,
even across different server instances or restarts.
"""
raw = f"{customer_id}:{invoice_id}:{amount}"
return hashlib.sha256(raw.encode()).hexdigest()
The key insight was using deterministic idempotency keys rather than random UUIDs. A random UUID would be different on each retry, defeating the purpose. A deterministic key based on the customer, invoice, and amount ensured that even if the retry happened from a different server instance (after a restart, for example), the payment processor would recognize it as a duplicate.
Decision 3: Webhook-Driven State Machine
Rather than relying on synchronous API responses for payment status, the team built a state machine driven by webhooks:
class PaymentState(Enum):
CREATED = "created"
PROCESSING = "processing"
SUCCEEDED = "succeeded"
FAILED = "failed"
REQUIRES_ACTION = "requires_action" # 3D Secure
CANCELED = "canceled"
REFUNDED = "refunded"
VALID_TRANSITIONS = {
PaymentState.CREATED: {
PaymentState.PROCESSING,
PaymentState.CANCELED,
},
PaymentState.PROCESSING: {
PaymentState.SUCCEEDED,
PaymentState.FAILED,
PaymentState.REQUIRES_ACTION,
},
PaymentState.REQUIRES_ACTION: {
PaymentState.PROCESSING,
PaymentState.FAILED,
PaymentState.CANCELED,
},
PaymentState.FAILED: {
PaymentState.PROCESSING, # retry
},
PaymentState.SUCCEEDED: {
PaymentState.REFUNDED,
},
}
def transition_payment(
current: PaymentState,
target: PaymentState,
) -> bool:
"""Validate and execute a payment state transition."""
valid_targets = VALID_TRANSITIONS.get(current, set())
if target not in valid_targets:
raise InvalidStateTransition(
f"Cannot transition from {current} to {target}"
)
return True
The state machine prevented impossible transitions (e.g., refunding a failed payment) and provided a clear audit trail. Every state change was logged with the webhook event that triggered it.
Decision 4: Intelligent Retry for Failed Payments
Failed payments are common -- expired cards, insufficient funds, bank declines. The team implemented a graduated retry schedule:
RETRY_SCHEDULE = [
{"delay_days": 1, "notify_customer": False},
{"delay_days": 3, "notify_customer": True,
"message": "gentle_reminder"},
{"delay_days": 5, "notify_customer": True,
"message": "second_reminder"},
{"delay_days": 7, "notify_customer": True,
"message": "final_warning"},
]
MAX_RETRIES = len(RETRY_SCHEDULE)
async def handle_payment_failure(
payment: Payment,
subscription: Subscription,
) -> None:
"""Handle a failed payment with graduated retry logic."""
retry_count = payment.retry_count
if retry_count >= MAX_RETRIES:
await cancel_subscription(subscription)
await send_cancellation_email(subscription.customer)
await notify_team_slack(
f"Subscription {subscription.id} canceled "
f"after {MAX_RETRIES} failed payment attempts"
)
return
schedule = RETRY_SCHEDULE[retry_count]
# Schedule the retry
retry_date = datetime.utcnow() + timedelta(
days=schedule["delay_days"]
)
await schedule_payment_retry(payment.id, retry_date)
# Notify customer if required
if schedule["notify_customer"]:
await send_dunning_email(
subscription.customer,
template=schedule["message"],
amount=payment.amount,
retry_date=retry_date,
)
This "dunning" process -- gradually escalating notifications for failed payments -- recovered 68% of initially failed payments within the first retry and 82% within all four retries.
Decision 5: Comprehensive Webhook Handling
The payment processor sent webhooks for numerous event types. The team needed to handle them all:
WEBHOOK_HANDLERS = {
"payment_intent.succeeded": handle_payment_success,
"payment_intent.payment_failed": handle_payment_failure,
"customer.subscription.created": handle_subscription_created,
"customer.subscription.updated": handle_subscription_updated,
"customer.subscription.deleted": handle_subscription_deleted,
"invoice.created": handle_invoice_created,
"invoice.paid": handle_invoice_paid,
"invoice.payment_failed": handle_invoice_payment_failed,
"charge.refunded": handle_refund,
"charge.dispute.created": handle_dispute,
}
@app.post("/webhooks/payments")
async def payment_webhook(request: Request):
"""Process payment processor webhooks."""
payload = await request.body()
sig_header = request.headers.get("Stripe-Signature", "")
# Verify signature
try:
event = verify_and_parse_webhook(
payload, sig_header, WEBHOOK_SECRET
)
except SignatureVerificationError:
raise HTTPException(status_code=401)
# Check for duplicate delivery
event_id = event["id"]
if await is_event_processed(event_id):
return {"status": "already_processed"}
# Route to handler
event_type = event["type"]
handler = WEBHOOK_HANDLERS.get(event_type)
if handler:
try:
await handler(event["data"]["object"])
await mark_event_processed(event_id)
except Exception as exc:
logger.error(
f"Webhook handler failed: {event_type}",
extra={"event_id": event_id, "error": str(exc)},
)
# Return 500 so the processor retries
raise HTTPException(status_code=500)
else:
logger.info(f"Unhandled webhook event type: {event_type}")
return {"status": "processed"}
A critical implementation detail: when the handler fails, the endpoint returns a 500 status code. This tells the payment processor to retry the webhook later. If the handler succeeds, the event ID is stored in the database to prevent duplicate processing on redelivery.
The Hardest Bug: Race Condition in Payment Confirmation
Three weeks into development, the team discovered a subtle race condition. The payment flow was:
- Server creates a payment intent via the API
- Payment processor processes the payment
- Payment processor sends a webhook
- Server updates the order based on the webhook
The problem: sometimes the webhook arrived before the synchronous API response was processed. The server would receive the webhook, look up the payment in the database, and find nothing -- because the code creating the payment had not yet committed the database transaction.
The solution was a two-part approach:
async def handle_payment_webhook(event_data: dict) -> None:
"""Handle payment webhook with retry for missing records."""
payment_processor_id = event_data["id"]
# Try to find the payment, with retries for race condition
for attempt in range(3):
payment = await db.get_payment_by_processor_id(
payment_processor_id
)
if payment:
break
# Payment not in DB yet -- wait and retry
await asyncio.sleep(1.0 * (attempt + 1))
else:
# After 3 retries, store the event for later processing
await store_unmatched_webhook(event_data)
logger.warning(
f"Payment {payment_processor_id} not found after "
f"3 retries -- stored for later processing"
)
return
# Process the webhook
await update_payment_status(payment, event_data)
Additionally, the team added a periodic job that checked for unmatched webhooks and attempted to process them again. This "eventually consistent" approach handled the race condition gracefully without any data loss.
Security Measures
Webhook Signature Verification
Every webhook was verified using HMAC-SHA256:
def verify_and_parse_webhook(
payload: bytes,
signature_header: str,
secret: str,
) -> dict:
"""Verify webhook signature and parse payload."""
# Parse the signature header (Stripe format)
elements = dict(
item.split("=", 1)
for item in signature_header.split(",")
)
timestamp = elements.get("t", "")
signature = elements.get("v1", "")
# Check timestamp to prevent replay attacks
webhook_time = int(timestamp)
current_time = int(time.time())
if abs(current_time - webhook_time) > 300: # 5 minute tolerance
raise SignatureVerificationError("Webhook timestamp too old")
# Verify signature
signed_payload = f"{timestamp}.{payload.decode()}"
expected = hmac.new(
secret.encode(),
signed_payload.encode(),
hashlib.sha256,
).hexdigest()
if not hmac.compare_digest(expected, signature):
raise SignatureVerificationError("Invalid signature")
return json.loads(payload)
Secret Management
All API keys and secrets were stored in a secrets manager, never in code or environment files committed to version control:
class SecretManager:
"""Retrieve secrets from the secrets management service."""
def __init__(self, project_id: str):
self.project_id = project_id
self._cache: dict[str, str] = {}
async def get_secret(self, name: str) -> str:
"""Retrieve a secret by name."""
if name in self._cache:
return self._cache[name]
# In production, this calls AWS Secrets Manager, GCP Secret
# Manager, or HashiCorp Vault
secret = await self._fetch_from_vault(name)
self._cache[name] = secret
return secret
Audit Logging
Every payment operation was logged with full context:
async def log_payment_event(
event_type: str,
payment_id: str,
customer_id: str,
amount: int,
currency: str,
metadata: dict | None = None,
) -> None:
"""Log a payment event for audit purposes."""
await audit_db.insert({
"event_type": event_type,
"payment_id": payment_id,
"customer_id": customer_id,
"amount": amount,
"currency": currency,
"metadata": metadata or {},
"timestamp": datetime.utcnow().isoformat(),
"server_id": INSTANCE_ID,
})
Rollout Strategy
The team did not flip a switch for all 500 customers at once. They used a phased rollout:
Phase 1 (Week 1): 10 internal test accounts. The team subscribed themselves and monitored every webhook, email, and payment.
Phase 2 (Week 2): 50 customers who had opted in to beta testing. The team manually verified each payment for the first billing cycle.
Phase 3 (Week 3): 200 customers. Automated monitoring was in place. The team tracked key metrics: payment success rate, webhook delivery rate, email delivery rate.
Phase 4 (Week 4): All 500 customers. The manual billing process was retired.
Results
Key Metrics After 3 Months
| Metric | Before (Manual) | After (Automated) |
|---|---|---|
| Time to process monthly billing | 3 days | 4 hours (fully automated) |
| Payment success rate (first attempt) | 95% | 94.8% |
| Payment recovery rate (after retries) | Unknown | 82% |
| Invoice delivery time | 1-3 days | Immediate |
| Billing errors per month | 5-8 | 0 |
| Team hours spent on billing | 40/month | 2/month (monitoring) |
The slight decrease in first-attempt success rate was attributed to automated billing catching expired cards that the manual process had been silently skipping. The retry system recovered most of these.
Lesson 1: Idempotency Is Not Optional
The double-charge bug during testing was a wake-up call. Idempotency keys were added to every payment operation, and the team established a rule: no payment API call without an idempotency key. The deterministic key generation (based on customer + invoice + amount) was more robust than random UUIDs because it survived server restarts and retries from different instances.
Lesson 2: Webhooks Are the Source of Truth
Initially, the team updated payment status based on the synchronous API response. This led to inconsistencies when the response indicated "processing" but the actual result was different. Moving to a webhook-driven state machine made the system more reliable. The synchronous response was used only to detect immediate errors (invalid parameters, authentication failures).
Lesson 3: Test with Real Money (Small Amounts)
The payment processor's test mode was useful for development, but it did not catch every edge case. During the Phase 1 rollout, the team used real payments of $1.00 to verify the complete flow, including actual bank processing, real email delivery, and real webhook timing. Several issues were found that test mode did not reveal, including a timezone bug in prorated billing calculations.
Lesson 4: The Dunning Process Pays for Itself
Implementing the graduated retry system with dunning emails took a full week of development. It recovered 82% of failed payments that would otherwise have been lost. For a subscription business, this directly impacts monthly recurring revenue. The team calculated that the dunning system recovered approximately $12,000/month in payments that would have otherwise churned.
Lesson 5: AI Excels at Payment Edge Cases
When implementing prorated billing for mid-cycle upgrades, the team asked the AI assistant: "Calculate the prorated amount when a customer upgrades from a $49/month plan to a $99/month plan on day 15 of a 30-day billing cycle, considering the unused credit from the current plan." The AI produced the correct calculation on the first attempt, including handling of leap years and varying month lengths. This kind of domain-specific calculation is where AI coding assistants provide tremendous value.
Post-Mortem: The One Incident
Six weeks after full rollout, the payment processor experienced a 45-minute outage. During this period:
- 23 payments failed at the API level
- 12 webhooks were delayed by up to 2 hours
- 3 customers saw temporary error states on their dashboards
The system handled the outage correctly:
- Failed API calls were caught by the retry logic and rescheduled
- Delayed webhooks were processed successfully when they arrived (idempotency prevented duplicates)
- Customer-facing errors were logged and the team was notified via Slack within 2 minutes
All 23 payments were successfully processed once the outage resolved. No manual intervention was required. The team's investment in resilience patterns -- retries, idempotency, webhook-driven state management -- paid off exactly as designed.
This case study demonstrates that payment integration is fundamentally about managing uncertainty. Networks fail, services go down, and edge cases appear. The combination of idempotency, state machines, intelligent retries, and webhook-driven architecture creates a system that handles uncertainty gracefully, protecting both the business and its customers.