Case Study 1: SecureFirst's API Gateway Implementation
Background
SecureFirst Insurance processes 2.3 million active policies across personal auto, homeowner's, commercial property, and specialty lines. Their mainframe runs a z15 with three LPARs, hosting 847 CICS transactions that collectively handle policy administration, claims processing, billing, and underwriting. Every one of those transactions is backed by COBOL programs written and refined over 30 years.
When the board approved a digital transformation initiative — a new customer-facing portal, a mobile app, and a partner API program — Yuki Tanaka, the enterprise architect, had to answer a fundamental question: how do we get data out of the mainframe and into these new channels?
The Initial Proposal: Data Replication
The first proposal came from the cloud architecture team. Replicate mainframe data into a cloud-native database (Aurora PostgreSQL on AWS). Build the portal, mobile app, and partner APIs against the cloud database. Synchronize nightly via batch extract.
Yuki sat through the presentation and asked three questions:
- "Which database is authoritative when the customer calls at 3 PM and the cloud data is from last night's extract?"
- "Who maintains the synchronization when the mainframe copybook adds a field?"
- "How do we explain to the state insurance commissioner that our policy records exist in two databases with potentially different values?"
The cloud team didn't have satisfactory answers. The estimated cost for the replication approach was $4.2 million over 18 months, with ongoing synchronization costs of $600K annually.
Yuki's Counter-Proposal: API-First
Yuki's proposal: expose the existing CICS transactions as RESTful APIs using z/OS Connect EE. The mainframe remains the single source of truth. Every consumer — portal, mobile app, partner, internal system — calls the same APIs.
Phase 1: Foundation (Months 1-2)
Infrastructure deployment. Carlos Rivera led the infrastructure setup:
- z/OS Connect EE installed on all three production LPARs (LPAR-A, LPAR-B, LPAR-C)
- Zowe API ML deployed on a dedicated z/OS Connect instance (LPAR-A) with failover to LPAR-B
- IBM API Connect deployed on-premises for the enterprise API management layer
- TLS certificates provisioned (2048-bit RSA, SHA-256) for all z/OS Connect endpoints
- IPIC connections established between z/OS Connect and CICS regions on each LPAR
Security framework. The security architecture was designed collaboratively between the mainframe security team, the network team, and the API team:
- OAuth 2.0 authorization server: IBM DataPower Gateway (existing investment)
- Three RACF user IDs:
APIRD01(read-only APIs),APIWR01(transactional APIs),APIADM1(administrative APIs) - SAFCredentialMapper configured for role-based mapping: OAuth scope
policy:readmaps toAPIRD01, scopepolicy:writemaps toAPIWR01 - Mutual TLS required for all partner API connections
Design standards. Yuki and Carlos authored a 12-page API design standard covering naming conventions, versioning policy, error response format, pagination, and security requirements. Every API would go through design review before implementation.
Phase 2: First APIs (Months 2-4)
The team prioritized 40 APIs based on business value and consumer demand. The first five, chosen because they were the most requested by the portal team:
| API | CICS Transaction | Complexity | Status |
|---|---|---|---|
| Policy Inquiry | PLCYINQ | Low — single COMMAREA, read-only | Live month 3 |
| Policy Search | PLCYSCH | Medium — variable-length results, pagination | Live month 3 |
| Claims Status | CLMSSTS | Low — single COMMAREA, read-only | Live month 3 |
| Premium Quote | PRMMQTE | High — calls rating engine, complex calculation | Live month 4 |
| Policy Change | PLCYCHG | High — multi-step, validation, writes | Live month 4 |
Policy Inquiry — the first API. The COBOL program PLCYINQ had been running since 1997. It accepted a policy number in a COMMAREA and returned policy details. The copybook was PLCY-INQ-AREA, 2,847 bytes.
Carlos's team created the service archive in four hours:
- Imported the COBOL copybook into the z/OS Connect API toolkit
- Mapped COBOL field names to JSON names (e.g.,
PLCY-HLDR-LST-NMbecamepolicyHolderLastName) - Configured COMP-3 monetary fields as JSON strings (lesson learned: the premium field
PIC S9(07)V99 COMP-3was initially mapped to JSONnumber, and they caught the precision issue in testing when a $99,999,999.99 test policy came back as $100,000,000.00) - Wrote the OpenAPI 3.0 specification with full schema documentation
- Deployed the .sar file to all three z/OS Connect instances
The COBOL program was not modified. Not a single line of code changed.
Premium Quote — the hard one. The premium rating API was complex because PRMMQTE was a conversational transaction. It required multiple CICS calls: initialize rating → apply base rates → apply discounts → apply surcharges → calculate final premium. The original 3270 interface handled this with a multi-screen conversation.
For the API, they redesigned the interaction as a single request/response. Carlos's team wrote a new CICS wrapper program, PRMMAPI, that called PRMMQTE's subroutines non-conversationally. This was the only case where new COBOL code was written — and it was a thin wrapper that composed existing business logic, not new logic.
The resulting API accepted all rating factors in a single JSON request and returned the quoted premium in a single response. Response time: 450ms (the multi-screen 3270 conversation had taken human operators 3-5 minutes).
Phase 3: Gateway and Management (Months 3-5)
API gateway configuration. The Zowe API ML gateway was deployed in a three-node active-active configuration behind an F5 load balancer:
Internet → F5 Load Balancer → Gateway Node 1 (LPAR-A)
→ Gateway Node 2 (LPAR-B)
→ Gateway Node 3 (LPAR-C)
Each gateway node was configured with: - Connection to the Zowe Discovery Service - Rate limiting: 100/min for external partners, 1,000/min for the portal, 5,000/min for internal services - JWT validation against the DataPower authorization server - Health check monitoring every 10 seconds for all z/OS Connect instances - Circuit breaker: if a z/OS Connect instance returns 5 consecutive errors, remove it from the routing table for 30 seconds, then probe
Developer portal launch. The API Connect developer portal went live with the first five APIs. Portal features: - Interactive API documentation (Swagger UI) - Self-service API key provisioning for partners - Usage analytics dashboards - Getting started guides for each API - Sandbox environment with test data
Phase 4: Scale (Months 5-12)
The remaining 35 APIs were built at a rate of approximately 8 per month. By month 12, all 40 priority APIs were live.
Key metrics at the 12-month mark:
| Metric | Value |
|---|---|
| APIs in production | 40 |
| Daily API calls | 2.1 million |
| Average response time | 187ms |
| p99 response time | 1,200ms |
| Error rate | 0.03% |
| Availability | 99.97% |
| Active API consumers | 14 (portal, mobile, 12 partners) |
| Mainframe CPU impact | +4.2% (from API overhead) |
Technical Challenges and Solutions
Challenge 1: The Character Encoding Incident
Three weeks after launch, a partner reported that policy holder names with accented characters (common in their Spanish-speaking customer base) were displaying as garbage characters. Investigation revealed that the CICS region was using EBCDIC code page 037 (US/Canada), which doesn't include Latin-1 accented characters. The z/OS Connect conversion to UTF-8 was producing incorrect mappings.
Solution: Updated the z/OS Connect configuration to specify EBCDIC code page 1047 (which includes Latin-1 characters) for the CICS service provider connection. Tested with a comprehensive character set including accented Latin characters, and added automated character encoding tests to the CI/CD pipeline.
Challenge 2: The Batch Window Conflict
API traffic peaked between 9 AM and 5 PM (business hours). But so did online CICS transaction volume. Adding API traffic on top of existing 3270 traffic was causing CICS region queuing during peak hours, pushing response times above the 2-second SLA.
Solution: Dynamic rate limiting based on CICS region health. Carlos built a custom z/OS Connect interceptor that checks the CICS region's transaction queue depth every 5 seconds. When the queue depth exceeds 80% of the configured maximum, the interceptor reduces the API rate limit by 50% and returns 503 Service Unavailable with a Retry-After: 5 header for excess requests.
This kept the existing 3270 users unaffected while degrading API performance gracefully. The portal team implemented retry logic with exponential backoff, and end users rarely noticed the throttling.
Challenge 3: The COMMAREA Size Limit
The policy search API needed to return up to 100 matching policies. Each policy summary was 350 bytes. 100 policies × 350 bytes = 35,000 bytes — exceeding the 32,763-byte COMMAREA limit.
Solution: Redesigned the backend to use CICS channels and containers. The search results were placed in a container named SEARCH-RESULTS within a channel named POLICY-SEARCH. z/OS Connect was reconfigured to use the channel-based service provider instead of COMMAREA-based. The container has no practical size limit, so the API could return up to 100 results without issue.
They also implemented server-side pagination — the API defaults to 20 results per page, and the consumer can request up to 100 with the pageSize query parameter.
Challenge 4: Partner Onboarding Friction
The first three partner onboardings each took 6 weeks. The process involved: legal agreement review, API key provisioning, firewall rule changes, mTLS certificate exchange, RACF profile creation, rate limit assignment, and sandbox testing. Too many teams were involved, and none owned the end-to-end process.
Solution: Carlos created an API Partner Onboarding Runbook — a single document with a checklist, responsible parties, SLAs for each step, and escalation contacts. He also automated what he could: API key provisioning became self-service through the developer portal, and sandbox environment access was automatic upon registration. Onboarding time dropped to 2 weeks.
Business Results
Eighteen months after the first API went live:
- Portal launch: On time, consuming 28 of the 40 APIs. Customer self-service rate increased from 23% to 67%.
- Mobile app launch: On time, consuming 22 APIs. App store rating: 4.4 stars.
- Partner program: 12 active partners generating $14M in new annual premium through API-based integrations.
- Cost avoidance: The data replication approach would have cost $4.2M + $600K/year. The API approach cost $800K + $200K/year in ongoing operations. Five-year savings: $5.4M.
- Mainframe ROI: The mainframe, previously seen as a cost center, was now recognized as the engine powering digital channels. The z15 investment was reframed from "legacy maintenance" to "digital platform."
Lessons Learned
-
Start with read-only APIs. The first five APIs were mostly read-only. This let the team build confidence and work out infrastructure issues without risking data integrity.
-
Don't change the COBOL unless you must. Only 1 of the 40 APIs required new COBOL code, and even that was a thin wrapper. The business logic stayed untouched.
-
Character encoding is not boring. The EBCDIC/UTF-8 conversion seems trivial until it isn't. Test with the full character set your business actually uses.
-
Rate limiting is not optional. Without rate limiting, the first partner would have consumed 60% of the CICS region's capacity. Protect your mainframe.
-
API management is a team sport. The API team, mainframe team, security team, network team, and business stakeholders all need to be aligned. The API CoE was the glue.
Discussion Questions
-
Yuki's proposal required the portal and mobile app to depend on mainframe availability. What are the trade-offs of this single-source-of-truth approach vs. data replication? Under what circumstances would replication be the better choice?
-
The dynamic rate limiting solution prioritizes existing 3270 users over API consumers. Is this the right priority? How would you design a more sophisticated prioritization scheme?
-
SecureFirst's 40 APIs represent about 5% of their 847 CICS transactions. What criteria should they use to decide which transactions to expose next? Should they aim to expose all 847?
-
The partner onboarding process took 6 weeks initially. What additional automation could reduce this further? What's the minimum viable onboarding time?
-
Carlos's team chose to map COMP-3 fields to JSON strings for precision. This means API consumers have to parse the string to a decimal type. Is this the right trade-off? What alternatives exist?