Case Study 1: A Payment System Architecture Review — What Passes and What Doesn't

DataField.Dev

Case Study 1: A Payment System Architecture Review — What Passes and What Doesn't

Background

Diane Chen sits at the head of the conference table in Pinnacle Financial Group's 14th-floor boardroom. It is 9:00 AM on a Tuesday, and two architecture review presentations are scheduled back-to-back. Both propose designs for PinnaclePay. One was prepared by an internal team led by Ahmad Rashid. The other was prepared by an external consulting firm, TechForward Solutions, led by their principal architect, Derek Morrison.

Both teams had the same requirements. Both had three months to prepare. The outcomes of these two reviews illustrate everything you need to know about what makes an architecture defensible.

Part 1: The TechForward Proposal — What Fails

The Presentation

Derek Morrison opens with a polished slide deck. The graphics are beautiful. The diagrams use four colors and drop shadows. The executive summary promises "a next-generation, cloud-enabled, AI-powered payment platform that leverages cutting-edge microservices architecture."

Diane's first question comes on slide 3: "What is the availability target?"

Derek responds: "We're targeting four nines — 99.99% availability. Industry standard."

Diane: "The requirement document specifies five nines. Four nines is 52 minutes of downtime per year. Our Fedwire participation agreement requires five nines. Did you read the requirements?"

Derek: "Five nines is aspirational. In practice, four nines is achievable and widely accepted."

The CISO, Robert Huang, leans forward: "It is not aspirational. It is contractual. If we miss our Fedwire SLA, the Federal Reserve can revoke our access. This is not a nice-to-have."

The review continues. Key problems emerge:

TechForward's design uses DB2 in active-passive replication mode. The primary site handles all processing; the secondary site receives replicated data with a lag of "typically under 5 seconds."

Diane: "What is the RPO?"

Derek: "Under 5 seconds in normal conditions."

Diane: "The requirement is zero RPO. Zero data loss. If the primary fails at 14:00:00 and the last replicated transaction was at 13:59:55, those five seconds of wire transfers — potentially hundreds of millions of dollars — are lost. How do you recover them?"

Derek: "We would reconstruct from the Federal Reserve's records."

Robert Huang: "That is not our data to access. Fedwire messages are one-way. We cannot ask the Federal Reserve to replay our inbound wires because our replication lagged. This is a fundamental design flaw."

The TechForward design does not use Parallel Sysplex or DB2 data sharing. Without the coupling facility providing shared buffer pools and lock structures, there is always a replication window. For a system that moves money, this window is unacceptable.

Problem 2: Single MQ Queue Manager

TechForward's MQ design uses a single queue manager on the primary site with a standby queue manager at the DR site. During the review, the Operations Director, Janet Williams, asks:

"What happens when you need to apply a queue manager maintenance fix? You shut down the queue manager, apply the fix, and restart. During that window, which is typically 15-30 minutes, where do incoming Fedwire messages go?"

Derek: "We would schedule maintenance during a low-traffic window."

Janet: "Fedwire operates from 9 PM to 6:30 PM the next day. There is no low-traffic window for wire transfers during those hours. And RTP runs 24/7/365. There is literally no maintenance window where shutting down the queue manager is acceptable."

The correct design uses an MQ cluster with multiple queue managers, so that maintenance can be performed on one queue manager while others continue processing. This is standard practice for high-availability messaging, covered in Chapter 21 of the textbook.

Problem 3: No OFAC Screening Architecture

When Robert Huang asks about OFAC compliance, Derek's slide shows a box labeled "OFAC Screening Service" with an arrow from the wire transfer processing component. No further detail.

Robert: "What is the screening methodology? How do you handle fuzzy matching for name variations? What is the latency budget? What happens when the screening service is unavailable? How often is the SDN list updated? Where are screening results stored for audit purposes?"

Derek: "Those are implementation details that would be addressed during the detailed design phase."

Robert turns to Diane: "I cannot approve an architecture for a payment system where the compliance component is a black box. OFAC screening is not an implementation detail. A missed OFAC hit is a federal crime. The architecture must define the screening approach, the performance characteristics, and the failure modes."

Problem 4: Batch Window Not Analyzed

TechForward's batch design shows five processing steps for ACH but includes no timing analysis. Janet Williams asks:

"How long does the ACH batch take to process 3 million transactions?"

Derek: "We estimate approximately 2-3 hours based on similar systems we've built."

Janet: "Based on what? What MIPS rating? What DB2 commit frequency? What parallelism factor? Is 2 hours the normal case or the worst case? What happens on payroll day when volume doubles?"

Derek: "We would tune during the implementation phase."

Janet: "If the batch takes 5 hours instead of 3, we miss the ACH cutoff time, and every direct deposit for every Pinnacle customer is delayed by a day. I need timing analysis in the architecture, not promises about future tuning."

Problem 5: No Operational Story

The most damaging moment comes when Janet asks a simple question:

"It is 2 AM. Your monitoring system pages the on-call operator. A CICS region has crashed. Walk me through exactly what the on-call person does."

Silence.

Derek: "We would develop runbooks during the implementation phase."

Janet: "The architecture is the implementation plan. If you haven't thought about operations, you haven't thought about the system. This is a production system, not a proof of concept."

The Verdict

After 55 minutes, Diane calls for a decision. The vote is unanimous: Rejected. Return for complete redesign. The feedback letter lists 14 findings across five categories: availability, data integrity, compliance, operational readiness, and cost analysis (TechForward's TCO was missing MLC software costs entirely, understating the 5-year cost by approximately $16 million).

Part 2: Ahmad Rashid's Proposal — What Passes

The Presentation

Ahmad opens with no preamble: "PinnaclePay. National payment platform. ACH, wire, RTP. Five million transactions per day. Five nines. Zero data loss. $60.1 million over five years, $6.1 million annual savings over our current outsourced solution. I'll walk you through how."

The first difference is tone. Ahmad does not sell. He explains.

What Works: The Architecture Is Complete

Every component from the logical architecture appears in the physical topology. Every external interface has a documented protocol, format, and SLA. Every online transaction has a response time budget with millisecond allocations per step. Every batch job has a schedule, dependencies, and restart procedure. The architecture document is 127 pages with 42 diagrams.

Diane: "How do you achieve zero RPO?"

Ahmad: "DB2 data sharing group across four members on two sites. The coupling facility provides synchronous shared state. When member PPDB1 commits a transaction, the data is immediately visible to all other members through the group buffer pool. There is no replication lag because there is no replication — it is shared data, not replicated data."

What Works: The Security Design Is Specific

Robert Huang spends 20 minutes on the security slides. Ahmad has prepared:

A STRIDE threat model with 36 identified threats and mitigations
RACF group hierarchy with separation of duties matrix
Encryption architecture covering at rest, in transit, and key management
PCI-DSS v4.0 controls mapping with specific z/OS implementations
OFAC screening design with performance analysis, hash table methodology, and failure mode documentation

Robert: "Your OFAC hash table is updated daily. What about intraday SDN additions?"

Ahmad: "We subscribe to OFAC's RSS feed for emergency updates. When an emergency SDN update is published, an automated process rebuilds the hash table and loads it into the CICS shared data table within 15 minutes. During the 15-minute window, all wire transfers are held in the exception queue for manual screening. We tested this process on February 12 — the hash table rebuild took 3 minutes 42 seconds."

Robert nods. He writes: "Acceptable" on his review sheet.

What Works: The Operations Story Is Real

Janet Williams asks the same 2 AM question.

Ahmad: "The on-call operator receives a page with the region name, the abend code, and the last transaction processed. They open Runbook 1 — CICS Region Recovery. Step 1: Check CICSPlex SM dashboard for the specific failure. Step 2: If the abend is an ASRA (program check), the failing program is already disabled by CICS, and the region continues processing other transactions. The operator reviews the transaction dump, pages the on-call developer, and files an incident ticket. Step 3: If the region is unresponsive, the operator initiates a controlled shutdown and verifies that CICSPlex SM has rerouted traffic to the alternate AOR. Step 4: Restart the region. Step 5: Verify health metrics for 30 minutes. Estimated MTTR: 8 minutes for a program abend, 15 minutes for a region restart."

Janet: "Who is the on-call person?"

Ahmad: "We have a three-tier on-call rotation. Tier 1 is the operations team — they handle routine incidents using the runbooks. Tier 2 is the system programmers — they handle infrastructure issues that runbooks cannot resolve. Tier 3 is the development team — they handle application logic issues. Each tier has a primary and backup person on rotation."

Janet: "How do you prevent alert fatigue?"

Ahmad: "We have three alert severity levels. Critical alerts — system down or data integrity risk — go to all three tiers simultaneously. We expect fewer than 2 critical alerts per month in steady state. Warning alerts go to Tier 1 only and require response within 1 hour. We expect 5-10 per week. Informational alerts go to a dashboard, reviewed daily. We expect 20-30 per day. The thresholds are documented in the monitoring architecture and will be tuned during the first 90 days of production."

What Works: Honesty About Limitations

The most impressive moment comes when Diane asks: "What is the biggest weakness in this architecture?"

Ahmad does not hesitate: "Skills. The architecture is technically sound — every component is proven, and I can point to production references at CNB, Federal Benefits, and SecureFirst. But we need 12 full-time staff with COBOL, CICS, DB2, and MQ expertise. The current market has a severe shortage of these skills. Our mitigation is threefold: first, a training program that takes experienced Java developers and cross-trains them on mainframe technologies over six months. Rob Chen at CNB has done this successfully with three cohorts. Second, the CI/CD pipeline reduces the skill barrier for routine code changes — a developer does not need to understand JCL to deploy a COBOL program change. Third, the Year 3 modernization roadmap progressively reduces the mainframe-specific surface area by moving read workloads to cloud-native services."

Diane appreciates the honesty. Every architecture has weaknesses. The ones that get approved acknowledge them and present mitigations. The ones that get rejected pretend they have none.

What Works: The DR Plan Has Been Tested

Janet asks about disaster recovery, and Ahmad presents actual test results:

"We conducted a tabletop DR exercise on January 20 and a component-level test on February 8. The tabletop identified three gaps: the MQ channel security profiles were not replicated to the DR site, the CICS CSD at the DR site was 6 days behind production due to a failed copy job, and the batch scheduler configuration had not been updated after the November schedule change. All three gaps were remediated by February 5. The component-level test on February 8 — failing over one CICS AOR and one DB2 member to the DR site — completed in 11 minutes 23 seconds. The full application failover test is scheduled for March 15."

Janet writes: "This is how DR should be presented. Not as a design, but as evidence."

The Verdict

After 85 minutes and extensive Q&A, Diane calls for a decision: Approved with conditions. The conditions:

Complete the full-application DR test by March 15 and present results to the ARB
Finalize the staffing plan with HR, including budget for the training program
Add a section on FedNow readiness to the modernization roadmap (the CFO's representative noted that FedNow is launching and Pinnacle should be ready)

Ahmad agrees to all three conditions. The project is funded.

Analysis: The Five Differences

1. Specificity vs. Abstraction

Ahmad's proposal contained specific numbers: 150ms wire processing budget, 1,000-record commit frequency, 12,000 MIPS capacity, 127-page architecture document. Derek's proposal contained adjectives: "next-generation," "cutting-edge," "industry-standard." Architecture reviews reward specificity and punish abstraction.

2. Tested vs. Theoretical

Ahmad presented DR test results with dates, durations, and findings. Derek presented DR as a future activity. The review board trusts evidence over intentions.

3. Complete vs. Deferred

Ahmad's architecture addressed every layer from WLM policy to production runbooks. Derek deferred OFAC screening, batch timing, operational procedures, and MLC cost calculations to "the implementation phase." An architecture that defers critical decisions is not an architecture — it is a wish list.

4. Honest vs. Optimistic

Ahmad identified skills shortage as the biggest risk and presented a mitigation plan. Derek did not identify any risks. The review board knows that a risk-free architecture does not exist. When an architect presents zero risks, the board assumes the architect has not thought deeply enough.

5. Integrated vs. Assembled

Ahmad's architecture was a coherent system where every component referenced the others — the WLM policy matched the CICS topology, the MQ queue design matched the batch schedule, the RACF profiles matched the operational roles. Derek's architecture was a collection of components that happened to be drawn on the same diagram but whose interactions were not analyzed.

Discussion Questions

TechForward's RPO gap: Derek proposed 5-second replication lag as "acceptable." Under what circumstances might a regulator accept a non-zero RPO for a payment system? Are there payment types where 5 seconds of data loss might be tolerable?
The skills argument: Ahmad identified skills as the biggest risk. If you were the CFO, would this concern make you more or less likely to fund the mainframe-based architecture? What alternative would you suggest?
The honesty advantage: Why does admitting weaknesses increase credibility in an architecture review? Does this apply in other professional contexts (job interviews, project status reports)?
Architecture document length: Ahmad's document was 127 pages. Is this too long? What is the right balance between completeness and readability? Who reads the whole document vs. who reads just their section?
The cloud question: If TechForward had proposed a cloud-native architecture that genuinely met five nines and zero RPO (perhaps using CockroachDB or Spanner for distributed consensus), would the review board's decision have been different? What are the real barriers to cloud-native payment processing?

Key Takeaways

An architecture review is not a sales pitch. It is a technical examination where every claim must be supported by evidence.
The most common reasons for architecture rejection are: incomplete requirements coverage, missing security analysis, no operational story, unrealistic cost model, and deferred critical decisions.
A strong architecture acknowledges its weaknesses and presents mitigations. A weak architecture pretends it has no weaknesses.
Test results trump theoretical designs. "We tested it on February 8 and it took 11 minutes" is infinitely more convincing than "We expect it will take about 15 minutes."
The architecture document is the single most important artifact in enterprise systems engineering. It is worth the time to make it complete, specific, and honest.

Case Study 1: A Payment System Architecture Review — What Passes and What Doesn't

Background

Part 1: The TechForward Proposal — What Fails

The Presentation

Problem 1: No Data Sharing — Active-Passive Replication

Problem 2: Single MQ Queue Manager

Problem 3: No OFAC Screening Architecture

Problem 4: Batch Window Not Analyzed

Problem 5: No Operational Story

The Verdict

Part 2: Ahmad Rashid's Proposal — What Passes

The Presentation

What Works: The Architecture Is Complete

What Works: The Security Design Is Specific

What Works: The Operations Story Is Real

What Works: Honesty About Limitations

What Works: The DR Plan Has Been Tested

The Verdict

Analysis: The Five Differences

1. Specificity vs. Abstraction

2. Tested vs. Theoretical

3. Complete vs. Deferred

4. Honest vs. Optimistic

5. Integrated vs. Assembled

Discussion Questions

Key Takeaways