Chapter 37 Quiz: The Hybrid Architecture

DataField.Dev

Chapter 37 Quiz: The Hybrid Architecture

Section 1: Multiple Choice

1. What is the threshold concept of this chapter?

a) Hybrid architecture is a transitional state that should be eliminated as quickly as possible b) Cloud-native is always the superior platform and should be the migration target c) Hybrid is the destination, not a waypoint — permanent coexistence of mainframe and cloud is the intentional target state d) The mainframe should be protected from all cloud integration to preserve its reliability

Answer: c) Hybrid is the destination, not a waypoint — permanent coexistence of mainframe and cloud is the intentional target state

Explanation: The threshold concept is the mental shift from treating hybrid as a temporary compromise during migration to recognizing it as the permanent, designed-for architecture. The economic reality (cost of rewriting 240 billion lines of COBOL), technical reality (z/OS's unmatched reliability for transaction processing), and practical reality (6,000+ developer-years to rewrite a large system) all point to permanent coexistence. When architects internalize this, they stop building "temporary" integration layers and start investing in durable hybrid infrastructure.

2. In the anti-corruption layer (ACL) pattern, who "owns the contract" between mainframe and cloud?

a) The COBOL program defines the contract through its COMMAREA layout b) The cloud microservice defines the contract through its JSON schema c) The ACL owns the contract — it translates between both platforms without forcing either to adopt the other's model d) The API gateway owns the contract and both platforms must conform to it

Answer: c) The ACL owns the contract — it translates between both platforms without forcing either to adopt the other's model

Explanation: The ACL is a translation boundary that prevents the domain model of one system from corrupting the other. By owning the contract, the ACL allows the COBOL program's COMMAREA layout to change without breaking cloud consumers, and allows cloud consumers' expected JSON format to change without modifying COBOL. The ACL absorbs the impedance mismatch — data type translation (packed decimal to JSON number), encoding translation (EBCDIC to UTF-8), date format translation, and error code mapping all happen in the ACL.

3. Why is "dual write" — updating both the mainframe database and the cloud database in the same operation — dangerous in hybrid architecture?

a) It doubles the network latency for every transaction b) There is no distributed transaction coordinator spanning z/OS and cloud platforms, so if one write succeeds and the other fails, data becomes inconsistent with no automatic recovery c) It violates RACF security policies on z/OS d) DB2 does not support writes from external systems

Answer: b) There is no distributed transaction coordinator spanning z/OS and cloud platforms, so if one write succeeds and the other fails, data becomes inconsistent with no automatic recovery

Explanation: XA two-phase commit does not work across a WAN with cloud endpoints — the latency and failure modes make it impractical. If you write to DB2 on z/OS and to a cloud database in the same logical operation, and the cloud write fails after the DB2 commit succeeds, you have inconsistent data with no coordinator to detect or resolve the inconsistency. The correct pattern is a single system of record (typically DB2 on z/OS for financial data) with CDC or event-driven replication to the cloud replica.

4. In the event mesh pattern, why do events flow out of the mainframe (to cloud) more easily than events flow into the mainframe (from cloud)?

a) MQ only supports outbound messaging b) The mainframe is an event-driven system by design c) Events flowing into the mainframe as state changes require synchronous API calls with ACID transactional guarantees — you can't fire-and-forget into CICS because it needs to commit or rollback d) Cloud events use JSON, which CICS cannot process

Answer: c) Events flowing into the mainframe as state changes require synchronous API calls with ACID transactional guarantees — you can't fire-and-forget into CICS because it needs to commit or rollback

Explanation: The mainframe's transactional model requires that state changes go through CICS's unit of work mechanism with explicit SYNCPOINT. An asynchronous event from the cloud that needs to change mainframe state must be converted into a synchronous API call through the ACL, so the calling system knows whether the change committed or rolled back. Events flow outward easily because MQ is designed for reliable async messaging — a COBOL program writes to an MQ queue within its unit of work, and the message is guaranteed delivered. The cloud consumer processes the event asynchronously without needing the mainframe to participate.

5. What is the primary purpose of the correlation ID in hybrid monitoring?

a) To authenticate requests between platforms b) To encrypt data in transit between mainframe and cloud c) To trace a single business transaction across both mainframe and cloud platforms, enabling end-to-end diagnosis during incidents d) To route API requests to the correct backend platform

Answer: c) To trace a single business transaction across both mainframe and cloud platforms, enabling end-to-end diagnosis during incidents

Explanation: The correlation ID is a UUID generated at the API gateway and propagated through every component in the request path: API gateway, z/OS Connect, CICS (stored in the TCTUA), DB2, MQ messages, Kafka events, and cloud microservices. During incidents, the correlation ID allows engineers to trace a specific customer transaction from the mobile app all the way through both platforms and back. SecureFirst reported that propagating correlation IDs into CICS reduced mean time to diagnose cross-platform issues from 45 minutes to 8 minutes.

6. In the CNB reference architecture, why are MQ and Kafka both used rather than choosing one messaging platform?

a) MQ is cheaper than Kafka b) MQ is a proven z/OS component with coupling facility integration for Sysplex-wide reliability; Kafka is the cloud-native standard for event streaming — the MQ-Kafka connector bridges both worlds without forcing either team to adopt the other's platform c) Kafka cannot handle the message volume from the mainframe d) IBM licensing requires MQ on z/OS

Answer: b) MQ is a proven z/OS component with coupling facility integration for Sysplex-wide reliability; Kafka is the cloud-native standard for event streaming — the MQ-Kafka connector bridges both worlds without forcing either team to adopt the other's platform

Explanation: This is ADR-004 in the reference architecture. MQ on z/OS uses queue-sharing groups in the coupling facility for Sysplex-wide reliable messaging. Kafka provides the cloud-native event streaming capabilities (topic-based pub/sub, consumer groups, log compaction) that cloud consumers expect. The MQ-Kafka connector is the bridge. This is a direct application of the hybrid principle: use each platform's native capabilities rather than forcing one platform's tools onto the other.

7. What is the purpose of tokenization at the mainframe boundary in Pinnacle's hybrid architecture?

a) To improve query performance in the cloud analytics database b) To compress data for faster network transfer c) To ensure that Protected Health Information (PHI) never leaves z/OS in identifiable form, limiting the HIPAA compliance scope of cloud systems d) To convert EBCDIC data to ASCII format

Answer: c) To ensure that Protected Health Information (PHI) never leaves z/OS in identifiable form, limiting the HIPAA compliance scope of cloud systems

Explanation: Ahmad Rashidi's architectural decision at Pinnacle is that PHI is tokenized on z/OS before CDC replication to the cloud. The cloud analytics team works with tokenized data — they can perform aggregations and trend analysis but cannot identify individual patients. This limits the HIPAA compliance scope: the cloud analytics platform does not process identifiable PHI, so it faces fewer regulatory requirements. When identifiable data is needed, a controlled API with full audit logging provides access to specific records.

8. Which organizational model for hybrid teams does CNB use, and why?

a) Product teams with embedded platform expertise — because they are a small organization b) Platform teams with integration squad — because their scale (500M transactions/day, 4 LPARs) requires concentrated mainframe expertise, and the integration squad bridges the cultural and technical divide between platform teams c) Federated with architecture governance — because they have strict organizational boundaries d) A single merged team with no platform specialization

Answer: b) Platform teams with integration squad — because their scale (500M transactions/day, 4 LPARs) requires concentrated mainframe expertise, and the integration squad bridges the cultural and technical divide between platform teams

Explanation: At CNB's scale, the depth of mainframe expertise required (DB2 data sharing, coupling facility tuning, CICS performance, batch scheduling) makes concentrated platform teams more practical than distributing mainframe skills across product teams. The integration squad (6 people, 3 with mainframe backgrounds + 3 with cloud backgrounds) owns the connections between platforms: API gateway, event mesh, identity bridge, CDC pipeline, and unified monitoring. The integration squad translates requirements and designs integration patterns between teams that might otherwise struggle with different vocabularies and toolchains.

9. In the saga pattern for hybrid architecture, why is orchestration almost always preferred over choreography?

a) Orchestration is faster than choreography b) Orchestration provides centralized visibility into saga progress, which is critical when a step fails at 2 AM and the on-call engineer needs to determine where the saga stopped c) Choreography doesn't work with COBOL programs d) IBM's z/OS Connect only supports orchestration

Answer: b) Orchestration provides centralized visibility into saga progress, which is critical when a step fails at 2 AM and the on-call engineer needs to determine where the saga stopped

Explanation: In choreography, each step emits an event that triggers the next step, but there's no central place to see the overall saga state. When a step fails in a hybrid environment (where the failure might be in CICS, in a cloud service, or in the MQ bridge between them), you need a central orchestrator that knows exactly which steps have completed, which compensating transactions need to run, and in what order. This centralized visibility is essential for operational reliability in production hybrid systems.

10. What is the "provisional permanence trap" described by Sandra Chen?

a) The tendency to permanently provision too much mainframe capacity b) The phenomenon where "temporary" integration solutions become permanent the moment they carry production traffic, accumulating technical debt faster than purpose-built durable solutions c) The practice of provisioning cloud resources temporarily during mainframe maintenance windows d) The regulatory requirement to maintain temporary copies of all data

Answer: b) The phenomenon where "temporary" integration solutions become permanent the moment they carry production traffic, accumulating technical debt faster than purpose-built durable solutions

Explanation: This is one of the most costly patterns in hybrid architecture. Organizations build "temporary" bridges intending to replace them when the migration completes. But the migration never completes (because hybrid is the destination), and the temporary bridges accumulate technical debt — poor error handling, no monitoring, brittle data type conversions, hardcoded connection strings. Sandra Chen's observation is that investing in durable integration infrastructure from the start has lower lifetime cost than maintaining, debugging, and eventually replacing temporary solutions.

Section 2: Short Answer

11. List the four architecture patterns for permanent hybrid COBOL-cloud systems and provide a one-sentence description of each.

Answer: 1. Anti-Corruption Layer (ACL): A translation boundary that handles synchronous request-response integration between platforms, preventing each platform's domain model from corrupting the other. 2. Event Mesh: An asynchronous communication fabric (MQ + Kafka connector) that translates between the mainframe's transactional paradigm and the cloud's event-driven paradigm. 3. Shared-Nothing Data Architecture: Each platform owns its data with no direct cross-platform database access; data flows between platforms through CDC, batch transfer, or API-based sync. 4. Hybrid API Gateway: A single entry point for all external consumers that routes requests to the appropriate platform, provides unified cross-cutting concerns, and enables transparent service migration.

12. Explain why z/OS's coupling-facility-based lock management provides stronger consistency guarantees than any cloud database, and quantify the performance difference.

Answer: The coupling facility provides hardware-accelerated global lock management with 10-30 microsecond lock acquisition latency across all LPARs in a Parallel Sysplex. This enables atomic multi-member commits — a transaction that updates tables in four different DB2 members on four different LPARs commits atomically (all or nothing). Cloud distributed databases use software-based consensus protocols (Paxos, Raft) with typical latencies of 1-10 milliseconds per consensus round — 100-1,000x slower than coupling facility locks. This hardware-speed consistency at scale is what makes z/OS irreplaceable for high-volume transactional workloads like CNB's 500 million transactions per day.

13. Describe the data ownership model in the CNB reference architecture. Which platform is the system of record for financial data, and what mechanism synchronizes data to the other platform?

Answer: DB2 on z/OS is the single system of record for all financial data (account balances, transaction history, loan records). All writes to financial data go through CICS/COBOL programs. Cloud databases contain replicas populated by CDC (InfoSphere Data Replication), which reads the DB2 recovery log and publishes changes to the cloud data warehouse with a target lag of less than 30 seconds (SLA: within 60 seconds, 99.5% of the time). Cloud services that need real-time current data (e.g., transaction authorization) call the mainframe API directly rather than reading the eventually-consistent replica.

14. What five principles does CNB's zero-trust hybrid security model implement?

Answer: 1. Encrypt everything in transit — TLS 1.3 for API traffic, AT-TLS on z/OS, IPSec for MQ channels. 2. Encrypt everything at rest — DFSMS dataset encryption on z/OS, cloud KMS encryption; keys never cross the platform boundary. 3. Authenticate every request — every API call carries a token; every MQ message carries sender identity; no "trusted network" exceptions. 4. Authorize at the finest grain — API gateway enforces coarse-grained authorization; RACF enforces fine-grained transaction-level and column-level authorization on the mainframe. 5. Log everything — SMF on z/OS, cloud audit logs, API gateway access logs, all correlated by correlation ID.

15. Explain the difference between a compensating transaction and a rollback. Why is this distinction important in the saga pattern?

Answer: A rollback undoes uncommitted changes within a single transaction manager's scope — it restores the exact prior state as if the operation never happened. A compensating transaction is a new forward transaction that undoes the business effect of an already-committed transaction. It may not restore the exact prior state — for example, if a saga step created an account and a CDC event already replicated that account to the analytics warehouse, the compensating transaction deletes the account from DB2 but the analytics warehouse may briefly show the account before CDC replicates the delete. This distinction matters because saga steps commit locally and cannot be rolled back across platforms. Compensating transactions are the only recovery mechanism, and business stakeholders must accept that the "undo" is not instantaneous or perfectly clean.

Section 3: Scenario Analysis

16. A retail bank's cloud team proposes connecting their cloud-based fraud detection ML model directly to DB2 on z/OS via JDBC to read transaction data in real-time. Using principles from this chapter, explain why this is a bad idea and propose a better architecture.

Answer: Direct JDBC access from cloud to DB2 violates the shared-nothing data architecture pattern (Pattern 3). Problems: (1) A network disruption between cloud and z/OS takes down fraud detection. (2) Cloud analytics queries compete with CICS online transactions for DB2 buffer pool pages and CPU. (3) The cloud team needs DB2 connection credentials, expanding the security attack surface. (4) No rate limiting — a bug in the ML model could generate excessive queries and degrade mainframe performance.

Better architecture: Use CDC to replicate transaction data from DB2 to a cloud database in near-real-time (< 30 seconds lag). The ML model reads from the cloud replica. For the rare case where the model needs the absolute latest balance (e.g., during real-time transaction scoring), it calls a rate-limited, authenticated mainframe API through the ACL. This preserves performance isolation, fault isolation, and security boundaries.

17. During a hybrid incident at CNB, the funds transfer API shows p99 latency of 4,200ms (normal: 350ms). CICS response time is normal (45ms). Cloud services are normal. API gateway shows no errors. Where should the incident commander look first, and why?

Answer: The incident commander should look at the integration layer — specifically: (1) MQ queue depth between z/OS Connect and the cloud, (2) z/OS Connect thread pool utilization, (3) the MQ-Kafka connector health, and (4) the identity bridge (ISAM) response time. The symptoms — high end-to-end latency with normal CICS and normal cloud metrics — point to a bottleneck in the components between the platforms. The most likely culprits are: z/OS Connect thread exhaustion (all threads busy, requests queuing), MQ channel backlog (messages accumulating), or identity bridge slowness (token validation taking longer than normal, perhaps due to a certificate or LDAP issue). The correlation ID should allow tracing a specific slow request through every component to identify where the delay is introduced.

18. FBA is considering implementing bidirectional sync for beneficiary address data (mainframe and cloud portal can both update). Sandra Chen is skeptical. Using the conflict resolution principles from Section 37.3, design the minimal safe architecture for this requirement.

Answer: (1) Choose field-level merge as the primary strategy: the cloud portal owns electronically submitted address changes (self-service); the mainframe batch owns USPS-sourced address updates (returned mail processing). (2) Implement conflict detection: if both platforms update the same beneficiary's address within a 24-hour reconciliation window, flag it as a conflict. (3) Conflict metadata: each update carries a timestamp, source system ID, and source type (self-reported vs. USPS vs. caseworker). (4) Resolution: conflicts are queued for caseworker review with both versions displayed. The caseworker selects the correct address, and their choice is written to the mainframe (system of record) with CDC replicating to cloud. (5) Prevention: add a real-time check — before the mainframe batch updates an address, it queries whether a cloud portal update occurred in the last 24 hours for the same beneficiary. If so, the batch update is held for review rather than applied automatically.

19. A vendor proposes replacing the MQ-Kafka connector in CNB's reference architecture with a direct Kafka client running on z/OS. Evaluate this proposal using the architectural principles from Section 37.7.

Answer: The proposal violates ADR-004 (MQ as mainframe-side canonical messaging, Kafka for cloud distribution) and the "replaceable components" principle. Arguments against: (1) MQ on z/OS integrates with the coupling facility for queue-sharing group reliability — this is a native z/OS capability that Kafka on z/OS cannot match. (2) Running a Kafka client on z/OS introduces a non-standard z/OS component that the mainframe team must learn, monitor, and maintain. (3) The MQ-Kafka connector is a replaceable component at the integration layer — it can be swapped for a different bridge technology without affecting either platform. A Kafka client embedded in CICS programs creates tight coupling between mainframe application code and a specific event platform. (4) If Kafka is replaced in 5 years by the next event streaming platform, the MQ-Kafka connector changes; COBOL programs that write to MQ don't change. Arguments for (to be fair): fewer moving parts (one fewer component), lower latency (no connector hop). But the reliability and maintainability arguments outweigh the latency savings.

20. Design the annual "Assumptions Register" review for a hybrid architecture. List at least five assumptions that should be reviewed, the evidence that would indicate each assumption has broken, and the architectural response if it breaks.

Answer:

#	Assumption	Breaking Evidence	Architectural Response
1	IBM continues z/OS development through 2040	IBM announces z/OS end-of-life or end-of-support	Accelerate strangler fig; begin planning replatform for core transactions
2	Cloud provider maintains API backward compatibility	Breaking changes in cloud APIs used by integration layer	Increase ACL abstraction layer thickness; evaluate multi-cloud
3	COBOL compilation toolchain remains available	IBM discontinues Enterprise COBOL compiler	Evaluate GnuCOBOL or automated COBOL-to-Java conversion for new development
4	CDC replication lag stays under 60 seconds	Consistently exceeding 60-second SLA for >5% of time	Upgrade CDC infrastructure; evaluate alternative CDC tools; adjust analytics SLAs
5	MQ-Kafka connector handles message volume growth	Connector throughput reaches 80% of rated capacity	Scale connector cluster; evaluate direct Kafka-on-z/OS as backup option
6	Regulatory framework does not require data residency changes	New regulation restricts where financial data can be processed	Evaluate on-premises cloud (private cloud); adjust data sovereignty policies
7	Mainframe talent pipeline produces adequate staff	Unable to fill mainframe positions within 6 months	Accelerate cross-training; increase AI-assisted development tools; consider managed services

Review process: Annual meeting chaired by the chief architect, attended by platform leads, integration squad, security, and compliance. Each assumption is scored Green (solid), Yellow (early warning signs), or Red (assumption has broken or will break within 12 months). Red assumptions trigger an architecture response plan within 30 days.